Scatterplots, Correlation, and Simple Linear Regression: Study Notes for Stat 250

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Scatterplots and Relationships Between Quantitative Variables

Introduction to Scatterplots

Scatterplots are graphical representations used to visualize the relationship between two quantitative variables. Each point on the scatterplot corresponds to a pair of values from the dataset, allowing for the identification of patterns, trends, and associations.

Key Point 1: Scatterplots help determine the direction, form, and strength of relationships between variables.
Key Point 2: The axes represent the two variables being compared, typically with the explanatory variable on the x-axis and the response variable on the y-axis.
Example: The scatterplot of Bodyfat vs Weight shows how body fat percentage relates to weight in pounds for a sample of men.

Description of body fat dataset Scatterplot of Bodyfat vs Weight Scatterplot of Bodyfat vs Abdomen Scatterplot of Height vs Age Scatterplot of Wrist vs Age

Patterns of Association in Scatterplots

The pattern of points in a scatterplot reveals the type of association between variables. Associations can be positive, negative, or show no clear direction.

Positive Association: As one variable increases, the other also increases.
Negative Association: As one variable increases, the other decreases.
No Association: No discernible pattern between the variables.
Complex Association: Patterns that are not strictly linear or may involve curves.
Example: The scatterplot of Bodyfat vs Abdomen shows a strong positive association, while Height vs Age shows a weak negative association.

Positive association scatterplot Negative association scatterplot No association scatterplot Complex association scatterplot

Correlation: Measuring Linear Relationships

Definition and Features of Correlation

Correlation quantifies the strength and direction of a linear relationship between two quantitative variables. The most common measure is the Pearson correlation coefficient, denoted as r.

Key Point 1: r ranges from -1 (perfect negative) to +1 (perfect positive), with 0 indicating no linear relationship.
Key Point 2: Correlation is unitless and does not require identification of explanatory or response variables.
Key Point 3: Only linear relationships are measured; non-linear associations are not captured by r.
Formula: The Pearson correlation coefficient is calculated as:
Example: The correlation between Bodyfat and Weight is 0.596, indicating a moderate positive relationship.

Pearson correlation formula

Interpreting Correlation Values

Correlation values are interpreted based on their magnitude and sign. The closer the value is to ±1, the stronger the linear relationship.

Perfect Positive Correlation: r = +1
Perfect Negative Correlation: r = -1
No Correlation: r = 0
Example: The correlation between Bodyfat and Abdomen is 0.812, which is considered strong and positive.

Correlation values and scatterplots

Correlation in Practice: Body Fat Data

Correlation analysis can be applied to real datasets to quantify relationships between variables.

Bodyfat vs Weight: r = 0.596 (moderate positive)
Bodyfat vs Abdomen: r = 0.812 (strong positive)
Height vs Age: r = -0.269 (weak negative)
Wrist vs Age: r = 0.216 (weak positive)

Variable Pair	Correlation (r)
Bodyfat vs Weight	0.596
Bodyfat vs Abdomen	0.812
Height vs Age	-0.269
Wrist vs Age	0.216

Correlation output for Bodyfat vs Weight Correlation output for Height vs Age Correlation output for Wrist vs Age

Correlation Does Not Imply Causation

It is important to note that a strong correlation does not necessarily mean that one variable causes changes in the other. There may be lurking variables or coincidental relationships.

Key Point: Always consider the possibility of confounding factors or underlying mechanisms.
Example: The correlation between chocolate consumption and Nobel laureates is strong, but causality is not established.

Chocolate consumption and Nobel laureates article Scatterplot of chocolate consumption vs Nobel laureates

Simple Linear Regression

The Linear Model

Simple linear regression models the relationship between two quantitative variables by fitting a straight line to the data. The equation of the line is:

Equation:
y-intercept (b0): The predicted value of y when x = 0.
Slope (b1): The predicted change in y for each one unit increase in x.
Example: In hurricane data, the regression equation predicts maximum wind speed from central pressure.

Linear regression equation diagram

Finding the Least Squares Line

The least squares method determines the line of best fit by minimizing the sum of squared differences between observed and predicted values.

Key Point: The least squares line provides the most accurate linear prediction for the data.
Formula: The slope and intercept are calculated to minimize .
Example: Fitting a regression line to hurricane data to predict wind speed.

Least squares regression line diagram Comparison of regression lines

Regression Example: Hurricanes

Regression analysis can be used to predict hurricane wind speed based on central pressure. The fitted line plot shows the linear relationship and the regression equation.

Regression Equation:
Interpretation: For each increase of 1 millibar in central pressure, the predicted maximum speed decreases by 1.20 miles per hour.
Correlation: r = -0.951 (strong negative relationship)
Trustworthy Prediction: Predictions are reliable when the x-value is within the observed range.

Scatterplot of MaxSpeed vs CentralPressure Fitted line plot for hurricane data Regression output for hurricane data Slope and correlation interpretation Y-intercept interpretation Prediction for hurricane with central pressure 940

Summary Table: Correlation and Regression Results

Variable Pair	Correlation (r)	Regression Equation
Bodyfat vs Weight	0.596	Not provided
Bodyfat vs Abdomen	0.812	Not provided
Height vs Age	-0.269	Not provided
Wrist vs Age	0.216	Not provided
MaxSpeed vs CentralPressure	-0.951	MaxSpeed = 1264 - 1.20 × CentralPressure

Key Takeaways

Scatterplots are essential for visualizing relationships between quantitative variables.
Correlation measures the strength and direction of linear relationships, but does not imply causation.
Simple linear regression models the relationship and allows for prediction using the least squares method.
Interpret regression coefficients in context, and ensure predictions are made within the observed range of data.