Scatterplots, Correlation, and Linear Regression: Study Guide for Statistics Students

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Scatterplots, Association & Correlation

Comparing Variables

When analyzing relationships between variables, the method depends on the variable type. For two categorical variables, contingency tables and segmented bar charts are used. For two quantitative variables, scatterplots and correlation coefficients are appropriate.

Scatterplots show the relationship between two quantitative variables.
Timeplot is a special scatterplot where the independent variable is time.
Scatterplots help identify patterns, trends, unusual features, and associations between variables.

Guidelines for Analyzing Scatterplots (DFSU)

To interpret scatterplots, consider:

Direction: Is the association positive, negative, or neither?
Form: Is the relationship linear, curved, or something else?
Strength: How tightly do the points cluster around the form?
Unusual Features: Are there outliers or subgroups?

Direction

Positive association: As one variable increases, the other increases (bottom left to upper right).
Negative association: As one variable increases, the other decreases (upper left to lower right).

Form

Linear form: Points follow a straight-ish line.
Other forms (curved, etc.) are less useful for linear analysis.

Strength

Strong: Points tightly cluster around the form.
Weak: Points are scattered with no discernible pattern.
Moderate: Intermediate clustering.
Strength for linear forms is measured by correlation.

Unusual Features

Outliers: Points far from the rest.
Subgroups: Small clusters outside the general form.

Roles for Variables

Explanatory variable (x): Independent variable, plotted on the x-axis (time is always x).
Response variable (y): Dependent variable, plotted on the y-axis.
Assignment of roles depends on context, but does not imply prediction or causation.

Correlation

Definition and Conditions

Correlation measures the direction and strength of linear relationships between two quantitative variables. It is only valid when:

Quantitative Variables Condition: Both variables must be quantitative.
Straight Enough Condition: The scatterplot should show a linear form.
No Outliers Condition: Outliers can drastically affect correlation.

Correlation Coefficient (r)

The sign (+/-) indicates direction.
Range:
Correlation of x vs y is the same as y vs x.
No units; calculated using z-scores.
Not affected by changes of center or scale.
Measures strength of linear relationship; strong curved relationships have small r.
Very sensitive to outliers.

Formula:

Formula for correlation coefficient r

Guidelines for Strength

There are general guidelines for interpreting the strength of correlation:

Bounds	Category
r = 0	None
-0.25 < r < 0 or 0 < r < 0.25	Weak
-0.5 < r < -0.25 or 0.25 < r < 0.5	Moderately Weak
-0.85 < r < -0.5 or 0.5 < r < 0.85	Moderately Strong
-1 < r < -0.85 or 0.85 < r < 1	Strong
r = 1	Perfect

Correlation ≠ Causation

Correlation indicates association, not causation.
Lurking variables may affect the relationship.
Scatterplots and r do not tell the whole story.

Common Pitfalls

Do not use correlation for categorical variables.
Do not confuse correlation with causation.
Only use correlation for linear relationships.
Beware of outliers.

Linear Regression

Linear Models

Linear regression estimates the response variable based on its relationship to the explanatory variable. The goal is to create the line of best fit (least squares line).

The line of best fit minimizes the sum of squared residuals.
Regression models are written in context, with units specified.

Regression Equation

Formula:

: Predicted y value
: y-intercept
: Slope

Slope

Formula:

Interpretation: On average, for every 1 unit increase in x, y increases (or decreases) by the slope value.
Example: For every 1 kg increase in car weight, fuel efficiency decreases by 2.3 mpg.

Y-Intercept

Formula:

Represents the predicted y when x = 0.
May not always be meaningful in context.

Residuals

Residuals measure the difference between observed and predicted values.

Formula:
Negative residual: Overestimated
Positive residual: Underestimated

Residual Plots

Residual plots help assess the accuracy of the model. A good model will have a residual plot with no direction, form, strength, or unusual features.

Scatterplot of Dive Heart Rate vs Duration Residual plot of Dive Heart Rate vs Duration

Least Squares

The best model minimizes the sum of squared residuals.
If all residuals are added, they should cancel out.

Standard Error (se)

Standard deviation of the residuals.
Summarizes the typical error size.
Interpretation: Estimates will typically be off by about se (y units).

Computer Outputs

Regression output tables summarize key statistics: coefficients, standard errors, t-ratios, p-values, R squared, and standard error of residuals.

Annotated regression output table

Variability and R Squared

R squared () measures the proportion of variability in the response variable accounted for by the model.

Formula:
Properties: , is the square of the correlation coefficient.
Interpretation: R squared indicates the percentage of variability in y explained by x.

Regression Wisdom

Checking Residuals

Residual plots are essential for verifying the linearity assumption. Patterns in residuals may indicate violations of regression conditions.

Getting the "Bends"

Curved relationships may not be apparent in scatterplots but are visible in residual plots.
Always check residuals for bends after fitting regression.

Extrapolation

Extrapolation is predicting values outside the range of the data.
Predictions far from the mean in x are less reliable.
Extrapolation is risky and can lead to inaccurate predictions.

Timeplot of oil price forecasts and actual prices

Outliers, Leverage, and Influence

Outliers can strongly influence regression results.
High leverage points have x-values far from the mean and can change the regression line.
Influential points are those whose removal significantly alters the model.

Scatterplot showing outlier effect on regression Scatterplot with high leverage point Scatterplot with influential point

Lurking Variables and Causation

Regression cannot prove causation.
Lurking variables may drive observed associations.
Observational data cannot rule out lurking variables.

Scatterplot: Life expectancy vs doctors per person

Working With Summary Values

Scatterplots of summary statistics (e.g., averages) show less variability than individual data.
Summary data can inflate the impression of relationship strength.
There is no simple correction for this phenomenon.

Scatterplot: % voting for Trump vs % rural population (state averages)

What Can Go Wrong?

Do not fit a straight line to a nonlinear relationship.
Do not ignore outliers.
Do not infer causation from strong linear relationships.
Do not choose a model based on R squared alone.
Do not invert regression (switching x and y changes the model).
Check for different groups and fit separate models if needed.
Beware of extrapolation, especially into the future.
Look for unusual points and compare regressions with and without them.
Treat unusual points honestly; do not remove them just to improve fit.
Watch out for lurking variables and summary data.

Summary

Regression analysis requires careful checking of assumptions and conditions.
Residuals, outliers, leverage, and influential points must be considered.
Extrapolation and summary data can mislead interpretations.
Correlation and regression do not imply causation.