Examining Residuals and the Residual Standard Deviation in Linear Regression

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Linear Regression: Examining Residuals

Introduction to Residuals in Regression

In linear regression, residuals are the differences between observed values and the values predicted by the regression model. Examining residuals is crucial for assessing the appropriateness of a regression model and for diagnosing potential problems with the fit.

Residual (e): The difference between the observed value (Y) and the predicted value (\( \hat{Y} \)), calculated as \( e = Y - \hat{Y} \).
After fitting a regression model, we plot the residuals to check for patterns. Ideally, the residual plot should show no systematic structure.
A scatterplot of residuals versus the x-values (or versus predicted values) should appear random, with no clear direction, shape, or outliers.

Scatterplot of Max Wind Speed vs Central Pressure with regression line and residuals

Purpose of Residual Plots

Residual plots help us determine whether the regression model adequately captures the relationship between variables. They are used to check for:

Non-linearity (bends or curves in the plot)
Outliers (points far from the rest)
Non-constant variance (spread of residuals changes across values of x)

Scatterplot and residual plot for Fat vs Protein in Burger King items

Key Points:

The residual plot should stretch horizontally with about the same amount of vertical scatter around zero.
There should be no bends or outliers. If present, investigate further as the regression model may be missing important features.

Understanding the Residual Standard Deviation (\( S_e \))

Definition and Calculation

The residual standard deviation (also called the standard error of estimate) measures the typical vertical distance of the data points from the regression line. It quantifies the average prediction error made by the regression model.

Formula:

Formulas for residual standard deviation and visual representation

Alternatively, when summary statistics are available, use:

Where \( S_Y \) is the standard deviation of Y, and r is the correlation coefficient.

Interpreting \( S_e \)

\( S_e \) tells us, on average, how much the observed values deviate from the regression line. The smaller the \( S_e \), the better the model fits the data.

Scatterplot showing errors as vertical distances from regression line

For a perfect linear relationship (r = 1), \( S_e = 0 \).
For no relationship (r = 0), \( S_e \approx S_Y \).

Residual standard deviation for perfect and no relationship

Equal Variance Assumption (Homoscedasticity)

Definition and Importance

The equal variance assumption (homoscedasticity) states that the variability of the residuals should be roughly constant across all values of the explanatory variable (x). This is a key condition for valid inference in regression analysis.

If the scatter diagram is elliptical (football-shaped) and there are no outliers, the equal variance assumption likely holds.
If the spread of residuals increases or decreases with x, the assumption is violated (heteroscedasticity).

Visual comparison of equal and unequal variances in regression

Empirical Rule for Residuals

If the equal variance assumption holds, the distribution of residuals follows a pattern similar to the empirical rule for normal distributions:

About 68% of points are within one \( S_e \) of the regression line (vertically).
About 95% are within two \( S_e \).
About 99.7% are within three \( S_e \).

Scatterplot with shaded region showing 68% of points within 1 Se

Worked Example: Exam Scores

Regression and Residuals

Consider a regression model predicting Exam 2 scores (Y) from Exam 1 scores (X):

Scatterplot of Exam 2 vs Exam 1 with regression line

The residuals are the vertical distances from each point to the regression line.

Residual plot versus Exam 1 Score Residual plot versus Predicted Exam 2 Score

Calculating \( S_e \) for the Example

Given: \( S_Y = 5.5 \), \( r = 0.67 \)
\( S_e \approx 5.5 \sqrt{1 - (0.67)^2} = 4.1 \)

Calculation of Se and interpretation for exam scores

On average, predictions deviate from the regression line by about 4 points.

Visualizing the Empirical Rule with Residuals

About 68% of the points are within 4 points (vertically) of the regression line.
About 95% are within 8 points, and 99.7% are within 12 points.

Residual plot with 68% band Residual plot with 95% band Residual plot with 99.7% band

Summary Table: Residual Standard Deviation Formulas

Situation	Formula for \( S_e \)	Interpretation
General case		Standard deviation of residuals
Using summary statistics		Approximate, especially for large n
Perfect relationship (r = 1)		No prediction error
No relationship (r = 0)		All variation remains unexplained

Application: Using Software Output

Statistical software often reports the residual standard deviation (sometimes labeled as "s" or "Std Error of Estimate") in regression output. For example, in a regression predicting Math SAT scores from Verbal SAT scores:

\( S_e \) is reported as 71.755
Alternatively, calculate using summary statistics: \( S_e \approx 98.1 \sqrt{1 - (0.685)^2} = 71.47 \)

Statistical software output for regression and calculation of Se

Key Takeaways

Residuals and their standard deviation are essential for diagnosing regression models.
Residual plots should show no pattern; patterns indicate model inadequacy.
The residual standard deviation quantifies the typical prediction error.
The equal variance assumption must be checked for valid inference.
Use summary statistics or software output to compute \( S_e \).