Regression Diagnostics: Changing Variation, Outliers, and Dependent Errors

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Regression Diagnostics

Introduction to Regression Diagnostics

Regression diagnostics are essential tools for evaluating the validity and reliability of regression models. They help identify issues such as changing variation (heteroscedasticity), outliers, and dependence among observations, which can undermine statistical inference and prediction accuracy.

Changing Variation in Regression Models

Understanding Changing Variation

In regression analysis, the assumption of constant variance (homoscedasticity) of errors is crucial. When the variance of errors changes with the level of the explanatory variable, the data are said to exhibit heteroscedasticity. This can lead to unreliable confidence intervals and hypothesis tests.

Homoscedasticity: Errors have equal variance across all levels of the explanatory variable.
Heteroscedasticity: Errors have different variances, often increasing or decreasing with the explanatory variable.

Example: Home prices tend to be more variable for larger homes, leading to heteroscedasticity in a regression of price on home size.

Scatterplot of Price vs. Square Feet showing increasing spread

Detecting Changing Variation

Several graphical and statistical methods can be used to detect changing variation:

Residual Plots: A fan-shaped pattern in a plot of residuals versus fitted values or explanatory variable indicates heteroscedasticity.
Side-by-Side Boxplots: Comparing the spread of residuals across groups of the explanatory variable can reveal differences in variance.

Fan-shaped residual plot indicating heteroscedasticity Boxplots of residuals by size range showing increasing variance

Consequences of Heteroscedasticity

Prediction intervals may be too narrow or too wide, depending on the value of the explanatory variable.
Confidence intervals for regression coefficients (slope and intercept) are unreliable.
Hypothesis tests for coefficients may not be valid.

Prediction intervals too wide for small homes and too narrow for large homes

Fixing Heteroscedasticity: Model Revision

One common remedy is to transform the response and/or explanatory variable to stabilize variance. For example, dividing both sides of the regression equation by the explanatory variable can yield a model with more constant variance:

Let Price = F + M × SqFt + ε (original model)
Divide both sides by SqFt:
Now, regress Price per SqFt on 1/SqFt.

This transformation often results in residuals with similar variances (homoscedasticity).

Boxplots confirming homoscedastic errors after transformation

Comparing Models: Original vs. Transformed

Although the transformed model may have a lower , it provides more reliable confidence and prediction intervals.

Response	Similar Variances?	Estimated Fixed Cost	95% Confidence Interval
Price	No	$50,599	−$4,000 to $105,000
Price/Sq Ft	Yes	$53,887	$19,000 to $89,000

Table comparing fixed cost estimates and confidence intervals

Response	Similar Variances?	Estimated Marginal Cost	95% Confidence Interval
Price	No	$159/Sq Ft	$135 to $183/Sq Ft
Price/Sq Ft	Yes	$158/Sq Ft	$137 to $179/Sq Ft

Table comparing marginal cost estimates and confidence intervals

Size (Sq Ft)	Response	Similar Variances?	95% Prediction Interval Lower	Upper	Length
1,000	Price	No	$28,000	$392,000	$364,000
1,000	Price/Sq Ft	Yes	$133,000	$290,000	$157,000
3,000	Price	No	$347,000	$711,000	$364,000
3,000	Price/Sq Ft	Yes	$291,000	$765,000	$474,000

Table comparing prediction intervals for different models

Outliers in Regression

Identifying and Understanding Outliers

An outlier is an observation that deviates markedly from other observations. In regression, an outlier can have a large influence on the fitted line, especially if it is a leveraged observation (i.e., has an extreme value of the explanatory variable).

Leveraged Observation: An observation with an unusually high or low value of the explanatory variable, which can pull the regression line toward itself.

Scatterplot showing an outlier with high leverage

Consequences of Outliers

Outliers can significantly affect regression estimates:

Including an outlier can shift the estimated intercept and slope by more than one standard error.
Prediction intervals can change substantially depending on whether the outlier is included.

	Including Outlier	Excluding Outlier
	0.5586	0.3765
	3,196.80	3,093.18
n	30	29

Term	Estimate (Incl.)	Std Error (Incl.)	Estimate (Excl.)	Std Error (Excl.)
	5,887.74	1,400.02	1,558.17	2,877.88
	27.44	4.61	44.74	11.08

Regression lines with and without outlier Prediction intervals with outlier included

Handling Outliers

If the outlier is representative of future data, it should be included in the analysis.
If the outlier is due to error or is not representative, it may be excluded, but justification is needed.

Dependent Errors and Time Series

Detecting Dependence in Errors

In time series data, errors may be correlated across time (autocorrelation), violating the independence assumption of regression. This can be detected by plotting residuals versus time and using statistical tests such as the Durbin-Watson statistic.

Autocorrelation: Correlation between adjacent residuals in time series data.
Durbin-Watson Statistic: Tests the null hypothesis (no autocorrelation).

Scatterplot showing fit for time series data Timeplot of residuals showing dependence

Durbin-Watson Statistic

The Durbin-Watson statistic is calculated as:

Critical values depend on sample size. If is much less than 2, positive autocorrelation is indicated.

n	D is less than	D is greater than
15	1.36	2.64
20	1.41	2.59
30	1.49	2.51
40	1.54	2.46
50	1.59	2.41
75	1.65	2.35
100	1.69	2.31

Table of Durbin-Watson critical values

Consequences and Remedies for Dependent Errors

Standard errors are underestimated, making confidence intervals and hypothesis tests unreliable.
Regression coefficients are less precise than indicated.
Best remedy: Incorporate dependence structure into the model (e.g., use time series models).

Case Study: Predicting Cell Phone Subscribers

Application of Regression Diagnostics

Simple regression can be used to predict the number of cell phone subscribers over time. However, if residuals show autocorrelation, statistical inferences are not reliable.

Regression equation:
Timeplot of residuals and Durbin-Watson statistic indicate violation of independence.

Scatterplot of subscribers over time Timeplot of residuals showing autocorrelation

Conclusion: While the regression shows a strong upward trend, the violation of model assumptions means that prediction intervals and statistical inferences are not trustworthy.