BackScatterplots, Association, and Correlation: Visualizing and Quantifying Relationships Between Quantitative Variables
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Chapter 6: Scatterplots, Association, and Correlation
Motivating Examples
This chapter introduces the statistical investigation of relationships between two quantitative variables, using real-world examples to illustrate the process.
Example 1: Is human intelligence related to brain size?
Example 2: Does skipping classes affect academic performance?
The statistical process involves:
Formulate research question: Define the relationship of interest (e.g., does brain size predict intelligence?).
Collect data: Decide what variables to measure and on whom.
Examine the data: Use graphical and numerical techniques to explore relationships.
Interpret results and draw conclusions: Assess the evidence and its implications.
Scatterplots
Definition and Purpose
Scatterplots are graphical tools used to visualize the relationship between two quantitative variables. Each point on the plot represents a pair of observations (x, y) for an individual.
x-axis: Typically the explanatory variable (e.g., brain size).
y-axis: Typically the response variable (e.g., IQ).
Mean-mean point: The point represents the center of the data cloud.
Example: Plotting MRI brain scan pixel counts (x) against IQ scores (y) for 10 individuals.
Sample Data Table
Person | MRI count | IQ |
|---|---|---|
816932 | 124 | |
951545 | 122 | |
991305 | 135 | |
833868 | 114 | |
856472 | 125 | |
852244 | 113 | |
790619 | 121 | |
866662 | 116 | |
857782 | 129 | |
948066 | 105 |
Note: Both IQ and MRI count are quantitative variables.
Interpreting Scatterplots
Direction:
Positive: x and y increase together.
Negative: x increases as y decreases.
Form: Linear or non-linear (e.g., quadratic, exponential).
Strength: How closely the points follow a pattern (strong vs. weak relationship).
Outliers: Points that deviate markedly from the overall pattern.
Types of Relationships (Illustration)
Linear relationship: Points follow a straight line (can be weak or strong).
Non-linear relationship: Points follow a curved pattern (e.g., quadratic, exponential).
Example: A scatterplot of MRI count vs. IQ may show a positive linear association if higher brain size tends to be associated with higher IQ.
Roles for Variables
Explanatory vs. Response Variables
When plotting two variables, the explanatory variable (predictor) is usually placed on the x-axis, and the response variable (outcome) on the y-axis. The explanatory variable is thought to influence the response variable.
Sometimes, the distinction is not clear (e.g., English vs. Math grades); either variable can be on the x-axis.
Case Study: Class Skipping and Academic Performance
Data Example
Suppose we measure class skipping by the number of classes missed and academic performance by final grade for 9 students:
# classes skipped | 1 | 2 | 4 | 3 | 5 | 2 | 4 | 3 | 6 |
|---|---|---|---|---|---|---|---|---|---|
final grade | 98 | 90 | 83 | 88 | 71 | 85 | 76 | 81 | 71 |
Explanatory variable: Number of classes skipped Response variable: Final grade
Scatterplot Interpretation
Does final grade seem to be related to the number of classes skipped?
The correlation coefficient indicates a strong negative linear relationship: more classes skipped is associated with lower final grades.
Concepts on Correlation
Definition and Properties
Correlation quantifies the degree of linear association between two quantitative variables.
Positive correlation: Large values of x are associated with large values of y.
Negative correlation: Large values of x are associated with small values of y.
Correlation coefficient (): Measures the strength and direction of linear association.
Properties of :
for perfect positive linear correlation
for perfect negative linear correlation
implies weak or no linear relationship
is unitless
Swapping x and y does not change
Adding or multiplying all values of either variable by a constant does not change (if the constant is positive)
is sensitive to outliers
Association vs. Causality
Association does not imply causation. A strong correlation between two variables does not mean that changes in one cause changes in the other. There may be lurking variables that influence both.
Example: Number of firefighters at a fire scene vs. amount of damage. Larger fires require more firefighters and cause more damage, but sending more firefighters does not cause more damage.
Calculating the Correlation Coefficient ()
Steps
Standardize the x and y values (subtract mean, divide by standard deviation).
Multiply the standardized x and y values for each pair.
Sum the products.
Divide the sum by , where n is the number of pairs.
Formula:
Example: For the class skipping data, indicates a strong negative linear relationship.
Additional info: The computation of involves standardizing each variable and then assessing how much they co-vary. Outliers can have a large effect on , so always inspect scatterplots before interpreting correlation coefficients.