BackDiagnostics on Regression: Coefficient of Determination, Residual Analysis, and Influential Observations
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Diagnostics on Regression
Recap of Regression Concepts
Regression analysis is a fundamental statistical technique used to examine the relationship between two variables. The following concepts are essential:
Linear Correlation Coefficient (r): Measures the strength and direction of a linear relationship between two variables.
Least Squares Regression Line: The line of best fit that minimizes the sum of squared residuals.
Prediction: Using the regression line to estimate values of the response variable for given values of the explanatory variable.
Interpretation of Slope and Intercept: The slope indicates the change in the response variable for a one-unit change in the explanatory variable; the intercept is the predicted value when the explanatory variable is zero.
Regression and Correlation
Explaining Variation with Regression
Regression analysis seeks to explain the variation in the response variable (y) using the explanatory variable (x). The regression equation is:
The coefficient of determination () quantifies the fraction of the variation in y that is explained by the regression model.
Example: If , then . This means 27% of the variation in y can be explained by x through the regression equation.
Interpreting
is preferred over for commenting on the strength of association:
: About half the variation in y is explained by x.
: Only a quarter of the variation in y is explained.
Scatter Plots vs
Scatter plots visually demonstrate the strength of association. Higher values correspond to tighter clustering around the regression line.
Decomposition of Variation
Types of Deviation
Variation in the response variable can be decomposed as follows:
Total Deviation: (difference between observed and mean value)
Explained Deviation: (difference between predicted and mean value)
Unexplained Deviation: (difference between observed and predicted value)
The relationship is:
In terms of variation:
Coefficient of Determination:
Table: Data Sets and
Data Set | Coefficient of Determination () | Interpretation |
|---|---|---|
A | 0.89 | 89% of the variability in y is explained by the regression line. |
B | 0.67 | 67% of the variability in y is explained by the regression line. |
C | 0.41 | 41% of the variability in y is explained by the regression line. |
Additional info: Table values inferred from context.
Practice Questions
Regression Prediction Example
Given:
Find: Predicted systolic blood pressure for mg/day.
Solution:
Correlation and Slope
If , the slope of the regression line must be negative, but its exact value depends on the standard deviations of x and y.
Residuals and Residual Analysis
Definition of Residuals
In regression, the residual for an individual is the difference between the observed and predicted values:
The least-squares regression line minimizes the sum of squared residuals.
Residual Plots
A residual plot is a scatter diagram with residuals on the vertical axis and the explanatory variable on the horizontal axis. It is used to assess the appropriateness of the linear model.
Ideal Residual Plot: No discernible pattern; residuals appear random.
Non-Ideal Residual Plot: Patterns (e.g., curves, increasing/decreasing spread) indicate model inadequacy.
Constant Error Variance (Homoskedasticity): The spread of residuals should be roughly constant across all values of x. Violation of this assumption (heteroskedasticity) undermines the reliability of regression predictions.
Outliers and Influential Observations
Outliers
An outlier is an observation whose response variable is inconsistent with the overall pattern of the data. Outliers can be detected using residual plots or boxplots of residuals.
Influence of Outliers
To assess the influence of an outlier, remove it and recalculate the regression line. Significant changes in slope or intercept indicate high influence.
Influential Observations
An influential observation significantly affects the regression line's slope, intercept, or the correlation coefficient. Influence is determined by:
Residuals: Vertical position relative to the regression line.
Leverage: Horizontal position; how far the explanatory variable is from its mean.
Example: Adding a data point with high leverage and a large residual can substantially change the regression line.
Handling Influential Observations
Remove only with justification.
If removal is not warranted, collect more data near the influential point or use robust estimation techniques (e.g., minimizing absolute deviations).
Importance of Plotting Data
Visual Assessment of Regression Appropriateness
Plotting data is crucial for identifying linearity, outliers, and influential points. Four datasets may have similar regression lines and correlation coefficients, but their scatterplots reveal different relationships:
Set | Description | Regression OK? |
|---|---|---|
A | Moderate linear association | Yes |
B | Obvious nonlinear relationship | No |
C | One clear outlier | Examine closely |
D | One influential point; others have same x-value | More study needed |
Historical Context: Why the Term Regression?
The term "regression" was coined by Francis Galton in the 19th century. He observed that sons of tall fathers tended to be shorter than their fathers, and sons of short fathers tended to be taller, a phenomenon he called "regression towards the mean." The regression line predicts values closer to the mean than the original values, especially when the slope is less than 1 and units are the same on both axes.