Skip to main content
Back

Linear Regression: Modeling, Residuals, and Interpretation

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Linear Regression and Its Application

Introduction to Linear Regression

Linear regression is a statistical method used to model the relationship between two quantitative variables by fitting a straight line to the observed data. This line, called the regression line, allows us to predict the value of one variable (the response or dependent variable) based on the value of another (the explanatory or independent variable).

  • Scatterplots are used to visualize the relationship between two quantitative variables.

  • Correlation coefficient (r) measures the strength and direction of the linear relationship.

  • A positive, linear, and moderately strong relationship is indicated by an r value close to +1.

Scatterplot of Fat vs Protein for Burger King menu items

The Linear Regression Model

The linear regression model provides an equation for the best-fitting straight line through the data. This model is used for prediction and understanding the association between variables.

  • The general form of the regression equation is:

  • Slope (b1): Indicates the expected change in y for a one-unit increase in x.

  • Intercept (b0): The predicted value of y when x = 0.

Scatterplot with regression line and means marked

Finding the Regression Equation

To determine the regression equation, calculate the slope and intercept using summary statistics and the correlation coefficient.

  • The slope is calculated as:

  • The intercept is calculated as:

  • Where is the correlation coefficient, and are the standard deviations of y and x, and and are their means.

Worked example of finding the regression equation for Burger King data

Making Predictions with the Regression Model

Once the regression equation is established, it can be used to predict the response variable for given values of the explanatory variable.

  • Substitute the value of x into the regression equation to obtain the predicted y (denoted ).

  • Example: Predicting fat content for a Burger King item with 29g protein using yields grams.

Scatterplot with regression line and prediction for 29g protein

Understanding and Analyzing Residuals

Definition and Calculation of Residuals

Residuals are the differences between observed values and the values predicted by the regression model. They measure the vertical distance from each data point to the regression line.

  • Residual for the i-th observation:

  • Positive residual: Actual value is greater than predicted (underestimate).

  • Negative residual: Actual value is less than predicted (overestimate).

Definition of predicted value and residual

Visualizing Residuals

Plotting residuals helps assess the fit of the model and check for violations of regression assumptions.

  • A good model will have residuals scattered randomly around zero, with no obvious patterns.

  • Residual plots can reveal non-linearity, unequal variance, or outliers.

Scatterplot of residuals versus fitted values

Distribution of Residuals

The distribution of residuals should be approximately normal (unimodal and symmetric) if the linear model is appropriate.

  • Histogram and normal probability plots are used to assess normality.

  • Most residuals should fall within two standard deviations of zero (68-95-99.7 rule).

Histogram of residuals for Burger King regression

Assessing Model Fit: R-Squared and Standard Deviation of Residuals

R-Squared (Coefficient of Determination)

R-squared () measures the proportion of the variance in the response variable that is explained by the regression model.

  • Ranges from 0 to 1 (or 0% to 100%).

  • Higher indicates a better fit; for example, means 62% of the variation in y is explained by x.

Just Checking: Interpreting R-squared in regression

Standard Deviation of Residuals (se)

The standard deviation of the residuals, , quantifies the typical size of the prediction errors. It is used to assess the accuracy of the regression model.

  • Calculated as:

  • About 95% of residuals should fall within of zero if the model is appropriate.

Regression Assumptions and Conditions

Key Assumptions for Linear Regression

  • Quantitative Variables Condition: Both variables must be quantitative.

  • Straight Enough Condition: The relationship between x and y should be linear.

  • Does the Plot Thicken? Condition: The variance of residuals should be roughly constant for all values of x (homoscedasticity).

  • Outlier Condition: There should be no influential outliers that distort the regression line.

Plots showing increasing variance as a violation of equal variance assumption

Checking Assumptions with Residual Plots

After fitting the regression model, plot residuals against predicted values or x to check for:

  • Bends (non-linearity)

  • Outliers

  • Changes in spread (heteroscedasticity)

Residual plot showing random scatter

Interpreting Regression Results and Common Pitfalls

Interpreting Slope and Intercept

  • Slope: The expected change in y for a one-unit increase in x, with units of y per x.

  • Intercept: The predicted value of y when x = 0 (may not always be meaningful).

Explanation of slope and intercept in regression

Common Pitfalls in Regression Analysis

  • Do not fit a linear model to a non-linear relationship.

  • Do not ignore outliers; they can greatly affect the regression line.

  • Do not invert the regression equation to solve for x unless a new regression is performed with x as the response.

  • Do not choose a model based solely on ; check all assumptions and residuals.

Summary Table: Key Regression Concepts

Concept

Definition

Formula

Regression Equation

Best-fitting line for predicting y from x

Slope ()

Change in y per unit change in x

Intercept ()

Predicted y when x = 0

Residual ()

Difference between observed and predicted y

R-squared ()

Fraction of variance in y explained by x

Residual Std. Dev. ()

Typical prediction error

Conclusion

Linear regression is a powerful tool for modeling and predicting relationships between quantitative variables. Always check assumptions, interpret results in context, and use residual analysis to validate your model.

Pearson Logo

Study Prep