BackExploring Relationships Between Variables: Scatterplots, Correlation, and Linear Regression
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Exploring Relationships Between Variables
Scatterplots and the Correlation Coefficient
Scatterplots are a fundamental graphical tool in statistics for visualizing the relationship between two quantitative variables. They help identify patterns, trends, and possible associations between variables.
Definition: A scatterplot is a graph in which each point represents a pair of values for two variables, typically labeled X (predictor) and Y (response).
Example: The relationship between monthly advertising expenditures (in thousands of dollars) and total sales (in thousands of dollars) for a brewery is shown below.
Month | Advertising | Sales |
|---|---|---|
1 | 2.2 | 27.5 |
2 | 1.6 | 26.9 |
3 | 2.9 | 29.8 |
4 | 3.2 | 29.2 |
5 | 3.5 | 30.3 |
6 | 3.4 | 30.0 |
7 | 2.3 | 29.1 |
8 | 2.5 | 28.3 |
9 | 1.0 | 26.5 |
10 | 2.5 | 28.9 |
11 | 3.4 | 30.4 |
12 | 0.9 | 26.8 |
13 | 0.9 | 26.6 |
14 | 1.4 | 26.9 |
15 | 2.6 | 28.2 |
Interpretation: The scatterplot of these data shows that as advertising expenditures increase, total sales tend to increase, indicating a positive linear relationship.
How to Interpret a Scatterplot
Interpretation of a scatterplot focuses on the direction and strength of the relationship between variables.
Positive Linear Relationship: As X increases, Y tends to increase.
Negative Linear Relationship: As X increases, Y tends to decrease.
No Linear Relationship: As X increases, Y tends to neither increase nor decrease.
Note: Always interpret scatterplots in terms of the actual variable names, not just X and Y.
Example: A scatterplot of EPA gas mileage ratings (MPG) versus horsepower for car models shows a negative linear relationship: as horsepower increases, MPG tends to decrease.
The Correlation Coefficient (r)
The correlation coefficient quantifies the strength and direction of a linear relationship between two quantitative variables.
Definition: The correlation coefficient, denoted as , measures the degree to which two variables are linearly related.
Properties:
The value of is always between -1 and 1.
The sign of matches the trend of the scatterplot: positive for upward, negative for downward, zero for no trend.
If all points fall perfectly on a straight line with an upward trend, .
If all points fall perfectly on a straight line with a downward trend, .
If the relationship is very weak or non-linear, is near 0.
Visual Examples:
: Perfect positive linear relationship
: Perfect negative linear relationship
near 0: Very weak or no linear relationship
near +1: Strong positive linear relationship
near -1: Strong negative linear relationship
Testing for a Correlation Between Two Quantitative Variables
Statistical hypothesis testing can determine whether a significant linear relationship exists between two variables.
Null Hypothesis (): X and Y are not correlated.
Alternative Hypothesis (): X and Y are correlated.
Test Statistic:
Decision Rule: Accept if the p-value (commonly ).
Example: For advertising expenditure and sales, , p-value . Since p-value , we reject and conclude a positive linear relationship exists.
Fitting a Straight Line to Sampled Data: The Least Squares Prediction Equation
General Form of the Least Squares Line
The least squares regression line is the best-fitting straight line through a set of data points, minimizing the sum of squared vertical distances from the points to the line.
Equation:
Where:
= Y-intercept (value of Y when X = 0)
= slope (change in Y for a one-unit increase in X)
Example: For the brewery data, the prediction equation is:
Interpretation: The slope () indicates the expected increase in sales for each additional thousand dollars spent on advertising. The intercept () estimates sales when advertising is zero.
Prediction: If the brewery spends $2,000 on advertising:
Describing the Strength of the Linear Relationship: Coefficient of Determination ()
The coefficient of determination () quantifies the proportion of variance in the response variable explained by the predictor variable.
Definition: is the fraction of the variability in Y explained by a linear relationship with X.
Properties:
is always between 0 and 1.
is the fraction of variability in Y not explained by X (unexplained or due to error).
Example: If , then 89.77% of the variation in monthly sales is explained by advertising expenditures.
Prediction and Confidence Intervals
Predicted values from the regression line can be interpreted in two ways:
Estimated Mean Sales: The average sales for all months with a given advertising expenditure.
Predicted Sales for a Single Month: The expected sales for a specific month with a given advertising expenditure.
Confidence Interval: Used to estimate the mean value of Y for a particular value of X.
Prediction Interval: Used to predict a single value of Y for a particular value of X.
Additional info: Confidence and prediction intervals are important for quantifying uncertainty in regression predictions.