Skip to main content
Back

Linear Regression: Concepts, Computation, and Interpretation

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Linear Regression

Introduction to Linear Regression

Linear regression is a fundamental statistical method used to describe the linear relationship between two quantitative variables. The goal is to fit a straight line to the data, which can then be used to make predictions about the response variable y based on the explanatory variable x.

  • Objective: To model and predict the relationship between two quantitative variables using a straight line.

  • Application: Predicting outcomes, understanding relationships, and quantifying the strength and direction of associations.

The Linear Regression Model

Mathematical Formulation

The linear regression model relates two variables, x and y, using the equation of a straight line:

  • General form:

  • : Intercept (the value of y when x = 0)

  • : Slope (the change in y for a one-unit increase in x)

Interpretation of Slope and Intercept

  • Slope (): Indicates how much y is expected to change for a one-unit increase in x. If the slope is positive, y increases with x; if negative, y decreases with x.

  • Intercept (): The predicted value of y when x is zero; the point where the line crosses the y-axis.

The Regression Line

Properties of the Regression Line

The regression line is the line that best describes the relationship between x and y. It is also called the least squares regression line because it minimizes the sum of squared residuals.

  • Slope ():

    where r is the correlation coefficient, and , are the standard deviations of x and y respectively.

  • Intercept ():

    where and are the means of x and y.

  • The regression line passes through the point (the mean of x and y).

  • The predicted value from the model is denoted .

Example: Predicting Final Course Grade

  • Context: Predicting a student's final course grade based on the number of classes skipped.

  • Summary statistics: , , , ,

  • Finding the slope:

  • Finding the intercept:

  • Regression line:

    Interpretation: For each additional class skipped, the predicted final grade decreases by 5.27 points. For a student who skips no classes (), the predicted grade is 100.

Residuals and Model Assessment

Definition and Calculation of Residuals

  • Residual (): The difference between the observed value and the predicted value.

  • The sum of residuals is always zero for the least squares regression line.

  • The regression line is found by minimizing the sum of squared residuals.

Residual Plots

  • A residual plot displays residuals versus the explanatory variable.

  • If the linear model is appropriate, the residual plot should show no systematic pattern (random scatter).

  • Patterns in the residual plot (e.g., curves, increasing/decreasing spread) suggest the linear model may not be appropriate.

Example: Residual Table

y

Predicted value ()

Residual

98

94.73

3.27

90

89.46

0.54

83

78.92

4.08

88

84.19

3.81

71

73.65

-2.65

85

89.46

-4.46

76

78.92

-2.92

81

84.19

-3.19

71

68.38

2.62

Variation Accounted For: and Standard Deviation of Residuals

Standard Deviation of Residuals

  • The standard deviation of the residuals, , measures the typical distance that the observed values fall from the regression line.

  • is always between 0 and the standard deviation of the original y data.

  • The variance of the residuals is times the variance of the y values:

Coefficient of Determination ()

  • (the square of the correlation coefficient ) represents the proportion of the variance in y that is explained by the regression model.

  • ranges from 0 (no explanatory power) to 1 (perfect fit).

  • For example, if , then , meaning 49% of the variation in y is explained by the regression line.

Limitations and Cautions in Linear Regression

Lurking Variables and Causation

  • An observed association between two variables does not imply causation.

  • Lurking variables may be responsible for the observed relationship.

Extrapolation

  • Extrapolation is using the regression model to predict y for values of x outside the range of the observed data.

  • Extrapolation can lead to unreliable or nonsensical predictions because the linear relationship may not hold outside the observed range.

  • Example: Predicting the price of a 20-year-old car using a model built on cars aged 3–8 years is not appropriate, as the relationship may change beyond the observed data.

Best Practices in Linear Regression

  • Check that the relationship between x and y is approximately linear (use scatterplots and residual plots).

  • Assess whether the slope and intercept are reasonable in the context of the data.

  • Avoid extrapolation beyond the observed data range.

  • Always interpret results in the context of the data and consider possible lurking variables.

Summary Table: Key Linear Regression Quantities

Quantity

Symbol/Formula

Interpretation

Slope

Change in y per unit change in x

Intercept

Predicted y when x = 0

Predicted value

Value of y predicted by the model

Residual

Difference between observed and predicted y

Coefficient of determination

Proportion of variance in y explained by the model

Additional info: The notes also emphasize the importance of checking model assumptions, understanding the limitations of linear regression, and the dangers of extrapolation and misinterpreting correlation as causation.

Pearson Logo

Study Prep