Exploring Relationships Between Variables: Scatterplots, Correlation, and Linear Regression

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Exploring Relationships Between Variables

Scatterplots and the Correlation Coefficient

Scatterplots are a fundamental graphical tool in statistics for visualizing the relationship between two quantitative variables. They help identify patterns, trends, and possible associations between variables.

Definition: A scatterplot is a graph in which each point represents a pair of values for two variables, typically labeled X (predictor) and Y (response).
Example: The relationship between monthly advertising expenditures (in thousands of dollars) and total sales (in thousands of dollars) for a brewery is shown below.

Month	Advertising	Sales
1	2.2	27.5
2	1.6	26.9
3	2.9	29.8
4	3.2	29.2
5	3.5	30.3
6	3.4	30.0
7	2.3	29.1
8	2.5	28.3
9	1.0	26.5
10	2.5	28.9
11	3.4	30.4
12	0.9	26.8
13	0.9	26.6
14	1.4	26.9
15	2.6	28.2

Interpretation: The scatterplot of these data shows that as advertising expenditures increase, total sales tend to increase, indicating a positive linear relationship.

How to Interpret a Scatterplot

Interpretation of a scatterplot focuses on the direction and strength of the relationship between variables.

Positive Linear Relationship: As X increases, Y tends to increase.
Negative Linear Relationship: As X increases, Y tends to decrease.
No Linear Relationship: As X increases, Y tends to neither increase nor decrease.
Note: Always interpret scatterplots in terms of the actual variable names, not just X and Y.
Example: A scatterplot of EPA gas mileage ratings (MPG) versus horsepower for car models shows a negative linear relationship: as horsepower increases, MPG tends to decrease.

The Correlation Coefficient (r)

The correlation coefficient quantifies the strength and direction of a linear relationship between two quantitative variables.

Definition: The correlation coefficient, denoted as , measures the degree to which two variables are linearly related.
Properties:
- The value of is always between -1 and 1.
- The sign of matches the trend of the scatterplot: positive for upward, negative for downward, zero for no trend.
- If all points fall perfectly on a straight line with an upward trend, .
- If all points fall perfectly on a straight line with a downward trend, .
- If the relationship is very weak or non-linear, is near 0.
Visual Examples:
- : Perfect positive linear relationship
- : Perfect negative linear relationship
- near 0: Very weak or no linear relationship
- near +1: Strong positive linear relationship
- near -1: Strong negative linear relationship

Testing for a Correlation Between Two Quantitative Variables

Statistical hypothesis testing can determine whether a significant linear relationship exists between two variables.

Null Hypothesis (): X and Y are not correlated.
Alternative Hypothesis (): X and Y are correlated.
Test Statistic:
Decision Rule: Accept if the p-value (commonly ).
Example: For advertising expenditure and sales, , p-value . Since p-value , we reject and conclude a positive linear relationship exists.

Fitting a Straight Line to Sampled Data: The Least Squares Prediction Equation

General Form of the Least Squares Line

The least squares regression line is the best-fitting straight line through a set of data points, minimizing the sum of squared vertical distances from the points to the line.

Equation:
Where:
- = Y-intercept (value of Y when X = 0)
- = slope (change in Y for a one-unit increase in X)
Example: For the brewery data, the prediction equation is:
Interpretation: The slope () indicates the expected increase in sales for each additional thousand dollars spent on advertising. The intercept () estimates sales when advertising is zero.
Prediction: If the brewery spends $2,000 on advertising:

Describing the Strength of the Linear Relationship: Coefficient of Determination ()

The coefficient of determination () quantifies the proportion of variance in the response variable explained by the predictor variable.

Definition: is the fraction of the variability in Y explained by a linear relationship with X.
Properties:
- is always between 0 and 1.
- is the fraction of variability in Y not explained by X (unexplained or due to error).
Example: If , then 89.77% of the variation in monthly sales is explained by advertising expenditures.

Prediction and Confidence Intervals

Predicted values from the regression line can be interpreted in two ways:

Estimated Mean Sales: The average sales for all months with a given advertising expenditure.
Predicted Sales for a Single Month: The expected sales for a specific month with a given advertising expenditure.
Confidence Interval: Used to estimate the mean value of Y for a particular value of X.
Prediction Interval: Used to predict a single value of Y for a particular value of X.

Additional info: Confidence and prediction intervals are important for quantifying uncertainty in regression predictions.