Skip to main content
Back

Linear Regression, Correlation, and Analysis of Variance in Statistics

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Statistical Notation and Normal Distribution

Notation for Parameters and Statistics

Statistical notation is essential for distinguishing between population parameters and sample statistics. For the normal distribution, the notation y ~ N(\mu, \sigma) indicates that the variable y is normally distributed with mean \mu and standard deviation \sigma.

  • Parameter: A value that describes a characteristic of a population (e.g., \mu, \sigma).

  • Statistic: A value calculated from sample data (e.g., sample mean, sample standard deviation).

Graphs with Two Quantitative Variables

Scatterplots and Correlation

Scatterplots are used to visualize the relationship between two quantitative variables. Correlation quantifies the strength and direction of a linear relationship between variables.

  • Scatterplot: Each point represents a pair of values for two variables.

  • Correlation coefficient (r): Measures linear association; ranges from -1 to 1.

  • Positive correlation: As one variable increases, the other tends to increase.

  • Negative correlation: As one variable increases, the other tends to decrease.

Example: The correlation between MaxSpeed and CentralPressure is r = -0.951, indicating a strong negative linear relationship.Scatterplot of MaxSpeed vs CentralPressureFitted Line Plot of MaxSpeed vs CentralPressureCorrelation table for MaxSpeed and CentralPressure

Regression and Linear Models

Simple Linear Regression

Regression analysis estimates the relationship between a response variable and a predictor variable. The fitted line plot shows the regression equation and the data points.

  • Regression equation:

  • Slope (\beta_1): Change in y for a one-unit increase in x.

  • Intercept (\beta_0): Predicted value of y when x = 0 (may not always have a logical interpretation).

Example: For predicting MaxHR from Age: Fitted Line Plot of MaxHR vs Age

Anscombe’s Quartet

Importance of Data Visualization

Anscombe’s Quartet consists of four datasets with nearly identical summary statistics but very different relationships when graphed. This demonstrates the importance of visualizing data before analysis.

  • Non-resistant statistics: Statistics that can be heavily influenced by outliers or unusual data patterns.

  • Graphical analysis: Reveals patterns, outliers, and relationships not evident in summary statistics.

Example: All four datasets have similar means, standard deviations, and correlations, but their scatterplots are visually distinct.Statistics for Anscombe's Quartet x variablesStatistics for Anscombe's Quartet y variablesCorrelations for Anscombe's QuartetCorrelations for Anscombe's Quartet x2 and y4

Examining Residuals

Definition and Calculation

Residuals are the differences between observed values and predicted values from a regression model. They are used to assess model fit and identify outliers.

  • Residual formula:

  • Interpretation: Positive residual: observed value is above the predicted value; negative residual: observed value is below the predicted value.

Residual formula diagram

Visualization and Storage of Residuals

Residuals can be visualized on fitted line plots and stored for further analysis. Comparing residuals helps identify influential points and assess model assumptions.

  • Largest residual: Indicates the observation farthest from the regression line.

  • Smallest residual: Indicates the observation closest to the regression line.

Table of residuals for MaxHR vs AgeFitted line plot with residuals visualizedBoxplot of MaxHR and ResidualsStatistics for residuals

Ordinary Least Squares (OLS) Regression

Least Squares Criterion

OLS regression estimates coefficients by minimizing the sum of squared residuals. This method ensures the best linear fit to the data.

  • Sum of squared residuals:

  • Least squares criterion: Minimizes the total squared error between observed and predicted values.

Table of residuals and squared residualsPie chart partitioning explained and unexplained variation

Analysis of Variance (ANOVA) in Regression

Partitioning Variation

ANOVA tables partition the total variation in the response variable into variation explained by the model and variation due to error (residuals).

  • Total Sum of Squares (SSTO):

  • Regression Sum of Squares (SSR):

  • Error Sum of Squares (SSE):

ANOVA table for regressionANOVA table outputANOVA table output

R Squared (Coefficient of Determination)

Interpretation and Calculation

R squared (R2) measures the proportion of variance in the response variable explained by the model. It is calculated as the squared correlation coefficient.

  • Formula:

  • Interpretation: Higher R2 indicates a more useful model for prediction.

  • Range: 0 to 1 (or 0% to 100%)

Pie chart partitioning explained and unexplained variationModel summary with R-squaredModel summary with R-squaredPie chart partitioning explained and unexplained variationPie chart partitioning explained and unexplained variation

Standard Deviation of Residuals (se) and Model Evaluation

Comparing se and sy

The standard deviation of residuals (se) measures the average spread of residuals. Comparing se to the standard deviation of the response variable (sy) helps evaluate model effectiveness.

  • se: Standard deviation of residuals; lower values indicate better model fit.

  • sy: Standard deviation of the response variable.

  • Interpretation: If se < sy, the model is useful for prediction.

Boxplot of MaxHR and ResidualsStatistics for MaxHR standard deviationStatistics for MaxHR standard deviationStatistics for MaxHR standard deviationStatistics for MaxHR standard deviation

Regression Assumptions and Conditions

Key Assumptions

For valid inference in linear regression, several assumptions must be met:

  • Linearity: The relationship between predictor and response is linear.

  • Independence: Observations are independent.

  • Homoscedasticity: Constant variance of residuals across levels of predictor.

  • Normality: Residuals are approximately normally distributed.

Summary Table: ANOVA Components

The ANOVA table summarizes the partitioning of variation in regression analysis.

Source of Variation

SS

df

Regression

SSR =

1

Error

SSE =

n - 2

Total

SSTO =

n - 1

ANOVA table for regressionANOVA table outputANOVA table output

Attributes of R Squared and se

Reporting and Interpretation

Always report both R2 and se when evaluating regression models. R2 is a descriptive measure, while se is reported in the original units of the response variable.

  • R2: Indicates the fraction of variability explained by the model.

  • se: Provides context for prediction accuracy.

Applications and Examples

Predicting Maximum Heart Rate

Regression models can be used to predict physiological variables such as maximum heart rate based on age or other predictors. The effectiveness of the model is evaluated using R2 and se.

  • Example:

  • R2: 54.9% of the variation in MaxHR is explained by age.

  • se: 15.65 beats/minute (standard deviation of residuals).

Boxplot of MaxHR and Residuals

Anscombe’s Quartet Revisited

R2 Calculation

For all four graphs in Anscombe’s Quartet, the squared correlation coefficient is or 67%.

  • Interpretation: Despite identical R2 values, the data relationships are visually distinct, emphasizing the importance of graphing data.

Conclusion

Linear regression, correlation, and ANOVA are fundamental tools in statistics for analyzing relationships between quantitative variables. Visualizing data, examining residuals, and understanding model fit are essential for effective statistical analysis.

Pearson Logo

Study Prep