BackLinear Regression: Concepts, Computation, and Interpretation
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Linear Regression
Introduction to Linear Regression
Linear regression is a fundamental statistical method used to describe the linear relationship between two quantitative variables. The goal is to fit a straight line to the data, which can then be used to make predictions about the response variable y based on the explanatory variable x.
Objective: To model and predict the relationship between two quantitative variables using a straight line.
Application: Predicting outcomes, understanding relationships, and quantifying the strength and direction of associations.
The Linear Regression Model
Mathematical Formulation
The linear regression model relates two variables, x and y, using the equation of a straight line:
General form:
: Intercept (the value of y when x = 0)
: Slope (the change in y for a one-unit increase in x)
Interpretation of Slope and Intercept
Slope (): Indicates how much y is expected to change for a one-unit increase in x. If the slope is positive, y increases with x; if negative, y decreases with x.
Intercept (): The predicted value of y when x is zero; the point where the line crosses the y-axis.
The Regression Line
Properties of the Regression Line
The regression line is the line that best describes the relationship between x and y. It is also called the least squares regression line because it minimizes the sum of squared residuals.
Slope ():
where r is the correlation coefficient, and , are the standard deviations of x and y respectively.
Intercept ():
where and are the means of x and y.
The regression line passes through the point (the mean of x and y).
The predicted value from the model is denoted .
Example: Predicting Final Course Grade
Context: Predicting a student's final course grade based on the number of classes skipped.
Summary statistics: , , , ,
Finding the slope:
Finding the intercept:
Regression line:
Interpretation: For each additional class skipped, the predicted final grade decreases by 5.27 points. For a student who skips no classes (), the predicted grade is 100.
Residuals and Model Assessment
Definition and Calculation of Residuals
Residual (): The difference between the observed value and the predicted value.
The sum of residuals is always zero for the least squares regression line.
The regression line is found by minimizing the sum of squared residuals.
Residual Plots
A residual plot displays residuals versus the explanatory variable.
If the linear model is appropriate, the residual plot should show no systematic pattern (random scatter).
Patterns in the residual plot (e.g., curves, increasing/decreasing spread) suggest the linear model may not be appropriate.
Example: Residual Table
y | Predicted value () | Residual |
|---|---|---|
98 | 94.73 | 3.27 |
90 | 89.46 | 0.54 |
83 | 78.92 | 4.08 |
88 | 84.19 | 3.81 |
71 | 73.65 | -2.65 |
85 | 89.46 | -4.46 |
76 | 78.92 | -2.92 |
81 | 84.19 | -3.19 |
71 | 68.38 | 2.62 |
Variation Accounted For: and Standard Deviation of Residuals
Standard Deviation of Residuals
The standard deviation of the residuals, , measures the typical distance that the observed values fall from the regression line.
is always between 0 and the standard deviation of the original y data.
The variance of the residuals is times the variance of the y values:
Coefficient of Determination ()
(the square of the correlation coefficient ) represents the proportion of the variance in y that is explained by the regression model.
ranges from 0 (no explanatory power) to 1 (perfect fit).
For example, if , then , meaning 49% of the variation in y is explained by the regression line.
Limitations and Cautions in Linear Regression
Lurking Variables and Causation
An observed association between two variables does not imply causation.
Lurking variables may be responsible for the observed relationship.
Extrapolation
Extrapolation is using the regression model to predict y for values of x outside the range of the observed data.
Extrapolation can lead to unreliable or nonsensical predictions because the linear relationship may not hold outside the observed range.
Example: Predicting the price of a 20-year-old car using a model built on cars aged 3–8 years is not appropriate, as the relationship may change beyond the observed data.
Best Practices in Linear Regression
Check that the relationship between x and y is approximately linear (use scatterplots and residual plots).
Assess whether the slope and intercept are reasonable in the context of the data.
Avoid extrapolation beyond the observed data range.
Always interpret results in the context of the data and consider possible lurking variables.
Summary Table: Key Linear Regression Quantities
Quantity | Symbol/Formula | Interpretation |
|---|---|---|
Slope | Change in y per unit change in x | |
Intercept | Predicted y when x = 0 | |
Predicted value | Value of y predicted by the model | |
Residual | Difference between observed and predicted y | |
Coefficient of determination | Proportion of variance in y explained by the model |
Additional info: The notes also emphasize the importance of checking model assumptions, understanding the limitations of linear regression, and the dangers of extrapolation and misinterpreting correlation as causation.