Scatterplots, Association, and Correlation: Visualizing and Quantifying Relationships Between Quantitative Variables

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Chapter 6: Scatterplots, Association, and Correlation

Motivating Examples

This chapter introduces the statistical investigation of relationships between two quantitative variables, using real-world examples to illustrate the process.

Example 1: Is human intelligence related to brain size?
Example 2: Does skipping classes affect academic performance?

The statistical process involves:

Formulate research question: Define the relationship of interest (e.g., does brain size predict intelligence?).
Collect data: Decide what variables to measure and on whom.
Examine the data: Use graphical and numerical techniques to explore relationships.
Interpret results and draw conclusions: Assess the evidence and its implications.

Scatterplots

Definition and Purpose

Scatterplots are graphical tools used to visualize the relationship between two quantitative variables. Each point on the plot represents a pair of observations (x, y) for an individual.

x-axis: Typically the explanatory variable (e.g., brain size).
y-axis: Typically the response variable (e.g., IQ).
Mean-mean point: The point represents the center of the data cloud.

Example: Plotting MRI brain scan pixel counts (x) against IQ scores (y) for 10 individuals.

Sample Data Table

Person	MRI count	IQ
816932	124
951545	122
991305	135
833868	114
856472	125
852244	113
790619	121
866662	116
857782	129
948066	105

Note: Both IQ and MRI count are quantitative variables.

Interpreting Scatterplots

Direction:
- Positive: x and y increase together.
- Negative: x increases as y decreases.
Form: Linear or non-linear (e.g., quadratic, exponential).
Strength: How closely the points follow a pattern (strong vs. weak relationship).
Outliers: Points that deviate markedly from the overall pattern.

Types of Relationships (Illustration)

Linear relationship: Points follow a straight line (can be weak or strong).
Non-linear relationship: Points follow a curved pattern (e.g., quadratic, exponential).

Example: A scatterplot of MRI count vs. IQ may show a positive linear association if higher brain size tends to be associated with higher IQ.

Roles for Variables

Explanatory vs. Response Variables

When plotting two variables, the explanatory variable (predictor) is usually placed on the x-axis, and the response variable (outcome) on the y-axis. The explanatory variable is thought to influence the response variable.

Sometimes, the distinction is not clear (e.g., English vs. Math grades); either variable can be on the x-axis.

Case Study: Class Skipping and Academic Performance

Data Example

Suppose we measure class skipping by the number of classes missed and academic performance by final grade for 9 students:

# classes skipped	1	2	4	3	5	2	4	3	6
final grade	98	90	83	88	71	85	76	81	71

Explanatory variable: Number of classes skipped Response variable: Final grade

Scatterplot Interpretation

Does final grade seem to be related to the number of classes skipped?
The correlation coefficient indicates a strong negative linear relationship: more classes skipped is associated with lower final grades.

Concepts on Correlation

Definition and Properties

Correlation quantifies the degree of linear association between two quantitative variables.

Positive correlation: Large values of x are associated with large values of y.
Negative correlation: Large values of x are associated with small values of y.
Correlation coefficient (): Measures the strength and direction of linear association.

Properties of :

for perfect positive linear correlation
for perfect negative linear correlation
implies weak or no linear relationship
is unitless
Swapping x and y does not change
Adding or multiplying all values of either variable by a constant does not change (if the constant is positive)
is sensitive to outliers

Association vs. Causality

Association does not imply causation. A strong correlation between two variables does not mean that changes in one cause changes in the other. There may be lurking variables that influence both.

Example: Number of firefighters at a fire scene vs. amount of damage. Larger fires require more firefighters and cause more damage, but sending more firefighters does not cause more damage.

Calculating the Correlation Coefficient ()

Steps

Standardize the x and y values (subtract mean, divide by standard deviation).
Multiply the standardized x and y values for each pair.
Sum the products.
Divide the sum by , where n is the number of pairs.

Formula:

Example: For the class skipping data, indicates a strong negative linear relationship.

Additional info: The computation of involves standardizing each variable and then assessing how much they co-vary. Outliers can have a large effect on , so always inspect scatterplots before interpreting correlation coefficients.