Skip to main content
Back

Chapter 23: Inferences for Regression – Study Notes

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Inferences for Regression

Introduction to Regression Analysis

Regression analysis is a statistical method used to model and analyze the relationship between a dependent (response) variable and one or more independent (predictor) variables. The goal is to predict or explain the dependent variable using the independent variable(s).

  • Dependent variable (y): The outcome or variable to be predicted.

  • Independent variable (x): The variable used to predict y.

  • The equation for the line of best fit (simple linear regression):

Example: Predicting percent body fat from waist size in men.

Scatterplot of % Body Fat vs Waist size

The Population Regression Model

We use regression to make inferences about the population, not just the sample. The idealized regression line for the population is written as:

  • For individual data points, we include random error:

  • Where is the residual error, assumed to be random and normally distributed.

Key Terms:

  • Intercept (): Expected value of y when x = 0.

  • Slope (): Expected change in y for a one-unit increase in x.

  • Residual (): The difference between observed and predicted y.

Distribution of % Body Fat at different Waist sizes

Assumptions and Conditions for Regression Inference

1. Linearity Assumption

The relationship between x and y must be linear. This is checked using a scatterplot of the data (Straight Enough Condition).

  • If the scatterplot is not linear, consider transforming the data.

Scatterplot of % Body Fat vs Waist size

2. Independence Assumption

The observations must be independent. This is often ensured by random sampling (Randomization Condition). Independence can also be checked by plotting residuals versus the x-variable; the residuals should appear randomly scattered.

3. Equal Variance Assumption (Homoscedasticity)

The spread of the residuals should be roughly constant for all values of x (Does the Plot Thicken? Condition). This is checked using a residuals vs fits plot.

  • If the spread increases or decreases with x, the assumption is violated.

Residual plot with increasing variance

4. Normal Population Assumption

The residuals should be nearly normally distributed (Nearly Normal Condition). This is checked using a histogram or a normal probability plot of the residuals. Outliers should also be checked.

Histogram of residuals

Summary of Assumptions

  • Linearity: Relationship between x and y is linear.

  • Independence: Observations are independent.

  • Equal Variance: Residuals have constant variance.

  • Normality: Residuals are normally distributed.

Regression model with normal distributions at each x

Order of Conducting a Simple Regression Analysis

  1. Check linearity with a scatterplot.

  2. Fit the regression model and examine residuals vs fits plot for patterns.

  3. Check the histogram and normal probability plot of residuals for normality.

  4. If data are time series, plot residuals vs time to check for independence.

Intuition About Regression Inference

Factors Affecting the Standard Error of the Slope

  • Spread around the line (se): Less scatter means more consistent slope estimates.

  • Spread of x values (sx): Greater spread in x gives more stable regression estimates.

  • Sample size (n): Larger sample size gives more consistent estimates.

Scatterplots showing effect of x spread on slope estimateScatterplots showing effect of sample size on slope estimate

Standard Error of the Slope

The standard error of the regression slope is calculated as:

  • = residual standard deviation

  • = standard deviation of x-values

  • = sample size

Inference for Regression Slope

Hypothesis Test for Slope

  • Null hypothesis: (no relationship between x and y)

  • Alternative hypothesis: (x and y are related)

  • Test statistic: with degrees of freedom

  • Confidence interval for slope:

Example: Testing the Relationship Between % Body Fat and Waist Size

Term

Coef

SE Coef

T-Value

P-Value

Constant

-62.6

10.2

-6.16

0.000

Waist

2.222

0.273

8.14

0.000

  • Test statistic:

  • P-value < 0.05, so we reject and conclude a significant linear relationship exists.

Confidence Interval for the Slope

For 95% confidence and 18 degrees of freedom:

Interpretation: For every 1 inch increase in waist size, percent body fat increases by between 1.648% and 2.796%.

Estimation and Prediction Using the Regression Model

Confidence Interval for Mean Response

To estimate the mean value of y for a given x ():

Where is the standard error for the mean response at .

Prediction Interval for a New Observation

To predict the value of y for a new individual at :

Where includes an extra term for the variability of individual observations.

Comparison of Intervals

  • Confidence Interval (CI): For the mean response at a given x; narrower interval.

  • Prediction Interval (PI): For a single new observation at a given x; wider interval due to extra uncertainty.

Common Pitfalls in Regression Inference

  • Do not fit a linear model to non-linear data.

  • Check for non-constant variance (plot thickening).

  • Be cautious with extrapolation (predicting outside the range of data).

  • Watch for outliers and high-influence points.

  • Ensure residuals are nearly normal.

  • Use two-tailed tests unless a one-tailed test is justified.

Summary of Key Skills

  • Check all regression assumptions before making inferences.

  • Interpret regression output, including slope, intercept, standard errors, t-values, and p-values.

  • Construct and interpret confidence and prediction intervals.

  • Understand the difference between inference for the mean response and for individual predictions.

Pearson Logo

Study Prep