Diagnostics on Regression: Coefficient of Determination, Residual Analysis, and Influential Observations

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Diagnostics on Regression

Recap of Regression Concepts

Regression analysis is a fundamental statistical technique used to examine the relationship between two variables. The following concepts are essential:

Linear Correlation Coefficient (r): Measures the strength and direction of a linear relationship between two variables.
Least Squares Regression Line: The line of best fit that minimizes the sum of squared residuals.
Prediction: Using the regression line to estimate values of the response variable for given values of the explanatory variable.
Interpretation of Slope and Intercept: The slope indicates the change in the response variable for a one-unit change in the explanatory variable; the intercept is the predicted value when the explanatory variable is zero.

Regression and Correlation

Explaining Variation with Regression

Regression analysis seeks to explain the variation in the response variable (y) using the explanatory variable (x). The regression equation is:

The coefficient of determination () quantifies the fraction of the variation in y that is explained by the regression model.

Example: If , then . This means 27% of the variation in y can be explained by x through the regression equation.

Interpreting

is preferred over for commenting on the strength of association:

: About half the variation in y is explained by x.
: Only a quarter of the variation in y is explained.

Scatter Plots vs

Scatter plots visually demonstrate the strength of association. Higher values correspond to tighter clustering around the regression line.

Decomposition of Variation

Types of Deviation

Variation in the response variable can be decomposed as follows:

Total Deviation: (difference between observed and mean value)
Explained Deviation: (difference between predicted and mean value)
Unexplained Deviation: (difference between observed and predicted value)

The relationship is:

In terms of variation:

Coefficient of Determination:

Table: Data Sets and

Data Set	Coefficient of Determination ()	Interpretation
A	0.89	89% of the variability in y is explained by the regression line.
B	0.67	67% of the variability in y is explained by the regression line.
C	0.41	41% of the variability in y is explained by the regression line.

Additional info: Table values inferred from context.

Practice Questions

Regression Prediction Example

Given:
Find: Predicted systolic blood pressure for mg/day.
Solution:

Correlation and Slope

If , the slope of the regression line must be negative, but its exact value depends on the standard deviations of x and y.

Residuals and Residual Analysis

Definition of Residuals

In regression, the residual for an individual is the difference between the observed and predicted values:

The least-squares regression line minimizes the sum of squared residuals.

Residual Plots

A residual plot is a scatter diagram with residuals on the vertical axis and the explanatory variable on the horizontal axis. It is used to assess the appropriateness of the linear model.

Ideal Residual Plot: No discernible pattern; residuals appear random.
Non-Ideal Residual Plot: Patterns (e.g., curves, increasing/decreasing spread) indicate model inadequacy.

Constant Error Variance (Homoskedasticity): The spread of residuals should be roughly constant across all values of x. Violation of this assumption (heteroskedasticity) undermines the reliability of regression predictions.

Outliers and Influential Observations

Outliers

An outlier is an observation whose response variable is inconsistent with the overall pattern of the data. Outliers can be detected using residual plots or boxplots of residuals.

Influence of Outliers

To assess the influence of an outlier, remove it and recalculate the regression line. Significant changes in slope or intercept indicate high influence.

Influential Observations

An influential observation significantly affects the regression line's slope, intercept, or the correlation coefficient. Influence is determined by:

Residuals: Vertical position relative to the regression line.
Leverage: Horizontal position; how far the explanatory variable is from its mean.

Example: Adding a data point with high leverage and a large residual can substantially change the regression line.

Handling Influential Observations

Remove only with justification.
If removal is not warranted, collect more data near the influential point or use robust estimation techniques (e.g., minimizing absolute deviations).

Importance of Plotting Data

Visual Assessment of Regression Appropriateness

Plotting data is crucial for identifying linearity, outliers, and influential points. Four datasets may have similar regression lines and correlation coefficients, but their scatterplots reveal different relationships:

Set	Description	Regression OK?
A	Moderate linear association	Yes
B	Obvious nonlinear relationship	No
C	One clear outlier	Examine closely
D	One influential point; others have same x-value	More study needed

Historical Context: Why the Term Regression?

The term "regression" was coined by Francis Galton in the 19th century. He observed that sons of tall fathers tended to be shorter than their fathers, and sons of short fathers tended to be taller, a phenomenon he called "regression towards the mean." The regression line predicts values closer to the mean than the original values, especially when the slope is less than 1 and units are the same on both axes.