Linear Regression: Concepts, Computation, and Interpretation

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Linear Regression

Introduction to Linear Regression

Linear regression is a fundamental statistical method used to describe the linear relationship between two quantitative variables. The goal is to fit a straight line to the data, which can then be used to make predictions about the response variable y based on the explanatory variable x.

Objective: To model and predict the relationship between two quantitative variables using a straight line.
Application: Predicting outcomes, understanding relationships, and quantifying the strength and direction of associations.

The Linear Regression Model

Mathematical Formulation

The linear regression model relates two variables, x and y, using the equation of a straight line:

General form:

: Intercept (the value of y when x = 0)
: Slope (the change in y for a one-unit increase in x)

Interpretation of Slope and Intercept

Slope (): Indicates how much y is expected to change for a one-unit increase in x. If the slope is positive, y increases with x; if negative, y decreases with x.
Intercept (): The predicted value of y when x is zero; the point where the line crosses the y-axis.

The Regression Line

Properties of the Regression Line

The regression line is the line that best describes the relationship between x and y. It is also called the least squares regression line because it minimizes the sum of squared residuals.

Slope ():
where r is the correlation coefficient, and , are the standard deviations of x and y respectively.
Intercept ():
where and are the means of x and y.
The regression line passes through the point (the mean of x and y).
The predicted value from the model is denoted .

Example: Predicting Final Course Grade

Context: Predicting a student's final course grade based on the number of classes skipped.
Summary statistics: , , , ,
Finding the slope:
Finding the intercept:
Regression line:
Interpretation: For each additional class skipped, the predicted final grade decreases by 5.27 points. For a student who skips no classes (), the predicted grade is 100.

Residuals and Model Assessment

Definition and Calculation of Residuals

Residual (): The difference between the observed value and the predicted value.

The sum of residuals is always zero for the least squares regression line.
The regression line is found by minimizing the sum of squared residuals.

Residual Plots

A residual plot displays residuals versus the explanatory variable.
If the linear model is appropriate, the residual plot should show no systematic pattern (random scatter).
Patterns in the residual plot (e.g., curves, increasing/decreasing spread) suggest the linear model may not be appropriate.

Example: Residual Table

y	Predicted value ()	Residual
98	94.73	3.27
90	89.46	0.54
83	78.92	4.08
88	84.19	3.81
71	73.65	-2.65
85	89.46	-4.46
76	78.92	-2.92
81	84.19	-3.19
71	68.38	2.62

Variation Accounted For: and Standard Deviation of Residuals

Standard Deviation of Residuals

The standard deviation of the residuals, , measures the typical distance that the observed values fall from the regression line.
is always between 0 and the standard deviation of the original y data.
The variance of the residuals is times the variance of the y values:

Coefficient of Determination ()

(the square of the correlation coefficient ) represents the proportion of the variance in y that is explained by the regression model.
ranges from 0 (no explanatory power) to 1 (perfect fit).
For example, if , then , meaning 49% of the variation in y is explained by the regression line.

Limitations and Cautions in Linear Regression

Lurking Variables and Causation

An observed association between two variables does not imply causation.
Lurking variables may be responsible for the observed relationship.

Extrapolation

Extrapolation is using the regression model to predict y for values of x outside the range of the observed data.
Extrapolation can lead to unreliable or nonsensical predictions because the linear relationship may not hold outside the observed range.
Example: Predicting the price of a 20-year-old car using a model built on cars aged 3–8 years is not appropriate, as the relationship may change beyond the observed data.

Best Practices in Linear Regression

Check that the relationship between x and y is approximately linear (use scatterplots and residual plots).
Assess whether the slope and intercept are reasonable in the context of the data.
Avoid extrapolation beyond the observed data range.
Always interpret results in the context of the data and consider possible lurking variables.

Summary Table: Key Linear Regression Quantities

Quantity	Symbol/Formula	Interpretation
Slope		Change in y per unit change in x
Intercept		Predicted y when x = 0
Predicted value		Value of y predicted by the model
Residual		Difference between observed and predicted y
Coefficient of determination		Proportion of variance in y explained by the model

Additional info: The notes also emphasize the importance of checking model assumptions, understanding the limitations of linear regression, and the dangers of extrapolation and misinterpreting correlation as causation.