Scatter Diagrams, Correlation, and Least-Squares Regression: Study Guide

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Scatter Diagrams and Correlation

Drawing and Interpreting Scatter Diagrams

Scatter diagrams (or scatter plots) are graphical tools used to visualize the relationship between two quantitative variables. Each point on the plot represents an observation, with one variable plotted on the x-axis and the other on the y-axis.

Key Point 1: Scatter diagrams help identify patterns, trends, and possible relationships between variables.
Key Point 2: Variables are often classified as explanatory (independent) and response (dependent).
Example: A doctor records the ages and cholesterol levels of 14 female patients to study the relationship between age and cholesterol.

Scatter diagram example with age and cholesterol data

Identifying Explanatory and Response Variables

In statistical studies, the response variable is the outcome of interest, while the explanatory variable is used to explain or predict changes in the response.

Key Point 1: The response variable is often called the dependent variable.
Key Point 2: The explanatory variable is often called the independent variable.
Example: In the cholesterol study, age is the explanatory variable and cholesterol is the response variable.

Scatter plot with explanatory and response variables labeled

Properties of the Linear Correlation Coefficient

Understanding the Correlation Coefficient (r)

The correlation coefficient (r) measures the strength and direction of the linear relationship between two variables. It ranges from -1 to 1.

Key Point 1:
Key Point 2: r < 0 indicates a negative linear relationship; r > 0 indicates a positive linear relationship.
Key Point 3: The closer r is to -1 or 1, the stronger the linear relationship.
Example: A correlation of 0.72 indicates a moderately strong positive relationship.

Correlation coefficient table and scatter plot examples

Interpreting the Strength and Direction of Correlation

The sign of r indicates the direction (positive or negative), while the magnitude indicates the strength of the relationship.

Key Point 1: Values near 0 indicate weak or no linear relationship.
Key Point 2: Values near 1 or -1 indicate strong linear relationships.
Example: See the table below for interpretation:

r Value	Strength	Direction
0.90 to 1.00	Very strong	Positive
0.70 to 0.89	Strong	Positive
0.40 to 0.69	Moderate	Positive
0.10 to 0.39	Weak	Positive
0.00 to 0.09	No relationship	None
-0.10 to -0.39	Weak	Negative
-0.40 to -0.69	Moderate	Negative
-0.70 to -0.89	Strong	Negative
-0.90 to -1.00	Very strong	Negative

Correlation vs. Causation

Distinguishing Correlation from Causation

Correlation measures the association between two variables, but does not imply that one variable causes the other.

Key Point 1: Correlation can exist without causation due to confounding or lurking variables.
Key Point 2: Causation requires evidence that changes in one variable directly result in changes in another.
Example: Body mass index and cholesterol may be correlated, but other factors (diet, activity) may influence both.

Notes on correlation vs causation with highlighted terms

Least-Squares Regression

Finding the Least-Squares Regression Line

The least-squares regression line is the line that best fits the data, minimizing the sum of squared differences between observed and predicted values.

Key Point 1: The equation of the regression line is , where m is the slope and b is the y-intercept.
Key Point 2: The slope represents the change in the response variable for each unit increase in the explanatory variable.
Example: For cholesterol data, the regression line might be .

Least-squares regression calculation example

Using the Regression Line for Prediction

The regression line can be used to predict the response variable for given values of the explanatory variable.

Key Point 1: Predictions are valid only within the range of observed data (interpolation).
Key Point 2: Extrapolation (predicting outside the observed range) is risky and often unreliable.
Example: Predicting cholesterol for a 40-year-old woman using the regression equation.

Regression prediction example with software screenshot

Interpreting Slope and Y-Intercept

The slope and y-intercept of the regression line have specific interpretations in context.

Key Point 1: The slope indicates the expected change in the response variable per unit change in the explanatory variable.
Key Point 2: The y-intercept represents the predicted value of the response variable when the explanatory variable is zero (may not always be meaningful).
Example: In cholesterol data, a slope of 2.8 means cholesterol increases by 2.8 points per year of age.

Interpretation of slope and y-intercept in regression Notes on y-intercept interpretation

Interpolation and Extrapolation

Interpolation refers to making predictions within the range of observed data, while extrapolation refers to predictions outside this range.

Key Point 1: Interpolation is generally safe; extrapolation can be unreliable.
Example: Using the regression equation to predict cholesterol for ages within the sample is interpolation; for ages outside the sample, it is extrapolation.

Notes on interpolation and extrapolation

Using Student Survey Data

Applying Regression and Correlation to Survey Data

Survey data can be used to explore relationships between variables, such as predicting one variable from another using regression and assessing the strength of association with correlation.

Key Point 1: Identify explanatory and response variables in the context of the survey.
Key Point 2: Use regression and correlation to analyze the relationship and make predictions.
Example: Predicting distance from campus based on commute time, and interpreting the strength of the relationship.

Student survey regression and correlation example

Additional info: These notes cover Section 4.1 (Scatter Diagrams and Correlation) and Section 4.2 (Least-Squares Regression), directly relevant to Chapter 4 and Chapter 14 of a college statistics course.