Skip to main content
Back

Linear Regression: Concepts, Methods, and Diagnostics

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Linear Regression

Introduction to Linear Regression

Linear regression is a fundamental statistical method used to model the relationship between two quantitative variables. It allows us to describe and predict the value of a dependent variable (response) based on the value of an independent variable (predictor).

  • Correlation vs. Regression: Correlation treats both variables as equals and measures the direction and strength of their linear relationship. Regression, however, distinguishes between predictor and response variables.

  • Deterministic vs. Statistical Relationships: A deterministic relationship is exact (e.g., Fahrenheit and Celsius conversion), while regression models statistical relationships that are not perfect.

Goals of Linear Regression

  • Description: Summarize the linear relationship between variables.

  • Prediction: Use the model to estimate future or unknown values of the response variable.

Motivating Examples

  • Facebook Friends and Brain Structure: Researchers studied the association between the number of Facebook friends and brain density in specific regions, using regression to assess predictive relationships.

  • Burger King Nutrition: Examining the relationship between protein and fat content in menu items, regression helps predict fat content based on protein amount.

Scatterplots and Linear Association

Scatterplots visually display the relationship between two quantitative variables. If the points suggest a linear pattern, fitting a straight line (regression line) is appropriate.

  • Least Squares Line: The best-fitting line is called the least squares regression line, which minimizes the sum of squared residuals.

  • Multiple Possible Lines: There are many possible lines; the least squares criterion selects the one with the smallest total squared error.

Regression Model and Parameters

Each regression model is defined by two parameters: the intercept and the slope.

  • Model Equation: The true regression model is .

  • Fitted Regression Line: The estimated model from sample data is .

  • Parameters: (intercept) and (slope) are estimated from data.

Finding the Best Line: Least Squares Method

The least squares method minimizes the sum of squared residuals (errors) between observed and predicted values.

  • Residual:

  • Sum of Squared Errors (SSE):

  • Minimization: Find and that minimize .

Formulas:

Relationship Between Correlation and Regression

  • The regression line always passes through .

  • The slope is related to the correlation coefficient and the standard deviations of and .

  • For standardized variables, the regression line's slope equals the correlation coefficient .

Interpreting Regression Coefficients

  • Intercept (): The predicted value of when .

  • Slope (): The change in the predicted value of for a one-unit increase in .

Worked Example: Burger King Data

  • Given: , , , ,

  • Compute slope:

  • Compute intercept:

  • Regression equation:

  • Prediction: For ,

Switching Predictor and Response Variables

  • Switching the roles of and does not simply invert the slope.

  • Original slope: ; new slope:

Regression Assumptions and Diagnostics

Model Assumptions

  • Linearity: The relationship between and is linear.

  • Independence: Errors are independent.

  • Constant Variance (Homoscedasticity): Errors have constant variance.

  • Normality: Errors are normally distributed.

  • Outlier Condition: Outliers can strongly affect the regression model.

Influence and Leverage

  • Leverage: Data points with extreme values have high leverage.

  • Influential Points: Points that unduly affect regression results.

  • Outliers and high leverage points require careful investigation.

Residual Analysis

  • Residuals:

  • Used to check model assumptions.

  • Residual Plots: Plot residuals vs. predictor or fitted values to check for patterns.

  • Q-Q Plot: Used to assess normality of residuals.

Goodness-of-Fit: Coefficient of Determination ()

Partitioning Variation

  • Total Sum of Squares (SST):

  • Sum of Squares due to Error (SSE):

  • Sum of Squares due to Regression (SSR):

  • Relationship:

Coefficient of Determination

  • Represents the proportion of variance in explained by the model.

  • Ranges from 0 to 1; higher values indicate better fit.

  • Related to the correlation coefficient:

Regression to the Mean

Concept

  • Predicted values tend to be closer to the mean than the predictor values, especially when the correlation is less than 1.

  • Example: Offspring height regresses toward the mean of parental height.

Multiple Linear Regression

Extension of Simple Linear Regression

Multiple linear regression models the relationship between a response variable and two or more predictors.

  • Model Equation:

  • Fitted Model:

  • Partial Regression Coefficients: Each measures the effect of holding other predictors constant.

Model Selection and Diagnostics

  • Including many predictors can lead to inefficiency and multicollinearity.

  • Variable selection and model comparison are important topics.

  • Significance of predictors is assessed using p-values; predictors with high p-values may be removed.

Correlation Matrix (Described in Text)

The correlation matrix summarizes the pairwise correlations between predictors, helping to identify multicollinearity.

Predictor 1

Predictor 2

Correlation

lstat

age

0.60

lstat

medv

-0.74

lstat

rm

0.61

lstat

crim

0.46

age

medv

-0.38

age

rm

-0.24

age

crim

0.35

medv

rm

0.69

medv

crim

0.39

rm

crim

-0.22

Additional info: Correlation values are rounded and inferred from context.

Summary

  • Linear regression is a powerful tool for modeling and predicting relationships between variables.

  • Understanding assumptions, diagnostics, and model selection is essential for valid inference.

  • Multiple regression extends these concepts to more complex scenarios with several predictors.

Pearson Logo

Study Prep