Skip to main content
Back

Regression Analysis: Outliers, Model Interpretation, and Error Measures

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Analyzing the Association Between Quantitative Variables: Regression Analysis

Simple Linear Regression and Outliers

Simple linear regression is a statistical method used to model the relationship between two quantitative variables by fitting a straight line (regression line) to the observed data. Outliers can significantly affect the regression model, especially if they are influential or have high leverage.

  • Regression Line: The best-fit line is determined by minimizing the sum of squared residuals (differences between observed and predicted values).

  • Outlier: An observation that lies far from the general trend of the data. Outliers can distort the slope and intercept of the regression line.

  • Influential Point: An outlier that, if removed, would substantially change the regression results.

Example: The point (1900, 105) is an outlier and influential. Including it in the model changes the regression line and predictions. Removing it results in a model that better fits the general trend of the data.

Interpreting Regression Output

Regression output typically includes the estimated slope and intercept, the coefficient of determination (), and other statistics.

  • Slope (): Indicates the average change in the response variable for each one-unit increase in the explanatory variable.

  • Intercept (): The predicted value of the response variable when the explanatory variable is zero. Sometimes, the intercept is not meaningful if zero is outside the range of observed data.

  • Coefficient of Determination (): Represents the proportion of variance in the response variable explained by the explanatory variable.

  • Correlation Coefficient (): Measures the strength and direction of the linear relationship between two variables. is the square root of , with the sign matching the slope.

Formulas:

  • Regression Equation:

  • Coefficient of Determination:

  • Correlation Coefficient: (sign matches slope)

Making Predictions and Calculating Residuals

The regression equation can be used to predict the response variable for a given value of the explanatory variable. The residual is the difference between the observed and predicted values.

  • Prediction: Substitute the value of into the regression equation to estimate .

  • Residual:

  • Extrapolation: Predicting for values outside the range of the observed data is called extrapolation and can be unreliable.

Example: Predict the length of a song released in 2001 using the regression equation. Calculate the residual for a song with a known duration.

Error Measures in Regression

Several statistics are used to assess the fit of a regression model and the accuracy of predictions.

  • Error Sum of Squares (): The sum of squared residuals.

  • Total Sum of Squares (): The total variation in the response variable.

  • Standard Error of Estimate (): Measures the typical distance that the observed values fall from the regression line.

  • Total Squared Distance: The sum of squared differences between observed values and the regression line.

Formulas:

  • Error Sum of Squares:

  • Total Sum of Squares:

  • Standard Error of Estimate:

Summary Table: Effects of Outliers on Regression

Scenario

Slope

Intercept

Prediction Accuracy

With Outlier

Distorted

Distorted

Lower

Poor (especially for most data)

Without Outlier

Represents trend

Represents trend

Higher

Better (for general data)

Additional info: Outliers, especially those with high leverage, can disproportionately influence regression results. It is important to assess whether such points should be included in the model based on their validity and influence.

Pearson Logo

Study Prep