BackLinear Regression: Least Squares and the Line of Best Fit
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Chapter 7: Linear Regression
Section 7.1: Least Squares – The Line of “Best Fit”
Linear regression is a fundamental statistical method used to model the relationship between two quantitative variables. The goal is to find the line that best fits the data points, minimizing the discrepancies between observed and predicted values.
Linear Model: A linear model describes the relationship between two variables using a straight line. For example, the relationship between fat and protein content in Burger King menu items can be modeled linearly.
Correlation Coefficient (r): Measures the strength and direction of the linear relationship. In the Burger King example, indicates a strong positive linear relationship.
Line of Best Fit: The line should be as close as possible to all data points, representing the average trend.
The Residual
The residual is a key concept in regression analysis, representing the difference between the observed value and the value predicted by the regression line.
Definition: The residual for a data point is , where is the observed value and is the predicted value.
Interpretation:
Points above the line have positive residuals.
Points below the line have negative residuals.
Purpose: The regression line gives the average expected value of the dependent variable for a given value of the independent variable.
More on Residuals
Calculation: Residual = Observed – Predicted
Significance: Both large positive and large negative residuals indicate poor fit; squaring residuals ensures all are positive and emphasizes larger errors.
The Line of Best Fit
The line of best fit is the line for which the sum of the squares of the residuals is minimized. This is known as the least squares line.
Least Squares Criterion: The best fitting line minimizes over all data points.
Section 7.2: The Linear Model
Equation of the Line of Best Fit
The equation for a straight line from algebra is:
For regression, the equation is:
(Slope): Indicates how rapidly changes with respect to .
(Intercept): The value of when .
Interpreting the Line of Best Fit
Slope Example: In the Burger King example, , so each additional gram of protein is associated with an expected increase of 0.91 grams of fat.
Intercept Example: The intercept (8.4) represents the expected fat content when protein is zero.
Example: Linear Model for Hurricanes
Regression Equation:
Interpretation: For every 1 mb increase in central pressure, the maximum wind speed decreases by 0.9748 knots.
Intercept Note: The y-intercept may not always be meaningful, especially if the value of is not realistic in context.
Section 7.3: Finding the Least Squares Line
Slope and Correlation
Formula for Slope:
Units: The slope has units of per unit of ; correlation is unitless.
Finding the y-Intercept
Formula:
The point always lies on the line of best fit.
Step-by-Step Example: Calculating Regression Equation
Given summary statistics, use the formulas for and to find the regression line.
Example: For Burger King data, grams of fat per gram of protein, grams of fat.
Regression Equation:
Best Fit Line with Technology
Statistical software (e.g., StatCrunch) can compute the regression line and correlation coefficient efficiently.
Output Example: , , error standard deviation = 2.25
Predicting with the Line of Best Fit
Use the regression equation to predict for a given .
Example: For a sandwich with 31g protein: g
Residual Calculation: If actual fat is 22g, residual = g
Conditions for Using Regression
Variables must be quantitative.
Relationship must be straight enough (linear).
There should be no outliers.
Section 7.4: Regression to the Mean
Correlation and Prediction
When predicting one variable from another, the best guess is the mean if no other information is available.
With standardized variables (z-scores), the regression equation simplifies:
Regression to the Mean: Predictions tend to be closer to the mean than the original values, especially when .
Section 7.5: Examining the Residuals
Residuals Revisited
Residual for a data point:
Calculate using the regression equation, then subtract from .
Example: Hurricane Katrina: knots; actual = 150 knots; residual = knots.
Assessing Model Fit
A good regression model will have residuals that show no pattern (random scatter).
Check for direction, shape, bends, and outliers in the residual plot.
Residual Standard Deviation
The mean of residuals is zero; the standard deviation measures the typical size of prediction errors.
Histogram of residuals helps assess their distribution.
Section 7.6: R Squared – Variation Accounted for by the Model
Comparing Variation
If all residuals are zero, the model explains all variation in .
In practice, the variation in residuals is less than the total variation in .
R Squared (): Fraction of the variation in explained by the model.
Example: , ; 58% of the variability in fat content is explained by protein content.
Interpreting R Squared
High (close to 1): Model is useful for prediction.
Low (close to 0): Model is not useful.
Always report and the standard deviation of residuals ().
Switching x and y
Switching the roles of and requires recalculating the regression equation; do not simply solve for in the original equation.
Section 7.7: Regression Assumptions and Conditions
Conditions to Check
Quantitative Variable Condition: Both variables must be quantitative.
Straight Enough Condition: Scatterplot should show a linear pattern.
Outlier Condition: Outliers can strongly affect the regression line.
Does the Plot Thicken? The spread of residuals should be consistent across all values of .
Conditions on the Residual Plot
No bends.
No outliers.
No changes in spread from one part of the plot to another.
Step-by-Step Example: Regression in Practice
Example: Breakfast Cereals
Variables: Sugar content (g) and calories (per serving).
Both variables are quantitative; scatterplot is linear; no outliers; spread is consistent.
Summary statistics: , (calories); , (grams sugar);
Slope: calories per gram of sugar
Intercept: calories
Regression equation:
(31.8% of variation explained)
Model Assessment
Slope and intercept are reasonable; prediction error should be checked for practical usefulness.
Always check actual data to validate the model.
Causation and Regression
Regression analysis shows association, not causation.
High correlation and a good model do not imply that changes in cause changes in .
Scientific explanation is needed to establish causality.
Common Pitfalls in Regression
Do not fit a straight line to a nonlinear relationship.
Do not ignore outliers; report and consider them carefully.
Do not invert the regression equation without recalculating.
Do not claim causation based solely on regression analysis.
Summary: What Have We Learned?
Linear models are useful for describing linear relationships between quantitative variables.
Residuals help assess model fit and identify violations of assumptions.
Correlation coefficient () quantifies the strength and direction of linear association.
Regression to the mean is a common phenomenon in related variables.
Always check regression assumptions and report and residual standard deviation.
Interpret regression slope as change in per unit change in .