Skip to main content
Back

Exploring Relationships Between Variables: Scatterplots, Correlation, and Linear Regression

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Exploring Relationships Between Variables

Scatterplots and the Correlation Coefficient

Scatterplots are a fundamental graphical tool in statistics for visualizing the relationship between two quantitative variables. They help identify patterns, trends, and possible associations between variables.

  • Definition: A scatterplot is a graph in which each point represents a pair of values for two variables, typically labeled X (predictor) and Y (response).

  • Example: The relationship between monthly advertising expenditures (in thousands of dollars) and total sales (in thousands of dollars) for a brewery is shown below.

Month

Advertising

Sales

1

2.2

27.5

2

1.6

26.9

3

2.9

29.8

4

3.2

29.2

5

3.5

30.3

6

3.4

30.0

7

2.3

29.1

8

2.5

28.3

9

1.0

26.5

10

2.5

28.9

11

3.4

30.4

12

0.9

26.8

13

0.9

26.6

14

1.4

26.9

15

2.6

28.2

  • Interpretation: The scatterplot of these data shows that as advertising expenditures increase, total sales tend to increase, indicating a positive linear relationship.

How to Interpret a Scatterplot

Interpretation of a scatterplot focuses on the direction and strength of the relationship between variables.

  • Positive Linear Relationship: As X increases, Y tends to increase.

  • Negative Linear Relationship: As X increases, Y tends to decrease.

  • No Linear Relationship: As X increases, Y tends to neither increase nor decrease.

  • Note: Always interpret scatterplots in terms of the actual variable names, not just X and Y.

  • Example: A scatterplot of EPA gas mileage ratings (MPG) versus horsepower for car models shows a negative linear relationship: as horsepower increases, MPG tends to decrease.

The Correlation Coefficient (r)

The correlation coefficient quantifies the strength and direction of a linear relationship between two quantitative variables.

  • Definition: The correlation coefficient, denoted as , measures the degree to which two variables are linearly related.

  • Properties:

    • The value of is always between -1 and 1.

    • The sign of matches the trend of the scatterplot: positive for upward, negative for downward, zero for no trend.

    • If all points fall perfectly on a straight line with an upward trend, .

    • If all points fall perfectly on a straight line with a downward trend, .

    • If the relationship is very weak or non-linear, is near 0.

  • Visual Examples:

    • : Perfect positive linear relationship

    • : Perfect negative linear relationship

    • near 0: Very weak or no linear relationship

    • near +1: Strong positive linear relationship

    • near -1: Strong negative linear relationship

Testing for a Correlation Between Two Quantitative Variables

Statistical hypothesis testing can determine whether a significant linear relationship exists between two variables.

  • Null Hypothesis (): X and Y are not correlated.

  • Alternative Hypothesis (): X and Y are correlated.

  • Test Statistic:

  • Decision Rule: Accept if the p-value (commonly ).

  • Example: For advertising expenditure and sales, , p-value . Since p-value , we reject and conclude a positive linear relationship exists.

Fitting a Straight Line to Sampled Data: The Least Squares Prediction Equation

General Form of the Least Squares Line

The least squares regression line is the best-fitting straight line through a set of data points, minimizing the sum of squared vertical distances from the points to the line.

  • Equation:

  • Where:

    • = Y-intercept (value of Y when X = 0)

    • = slope (change in Y for a one-unit increase in X)

  • Example: For the brewery data, the prediction equation is:

  • Interpretation: The slope () indicates the expected increase in sales for each additional thousand dollars spent on advertising. The intercept () estimates sales when advertising is zero.

  • Prediction: If the brewery spends $2,000 on advertising:

Describing the Strength of the Linear Relationship: Coefficient of Determination ()

The coefficient of determination () quantifies the proportion of variance in the response variable explained by the predictor variable.

  • Definition: is the fraction of the variability in Y explained by a linear relationship with X.

  • Properties:

    • is always between 0 and 1.

    • is the fraction of variability in Y not explained by X (unexplained or due to error).

  • Example: If , then 89.77% of the variation in monthly sales is explained by advertising expenditures.

Prediction and Confidence Intervals

Predicted values from the regression line can be interpreted in two ways:

  • Estimated Mean Sales: The average sales for all months with a given advertising expenditure.

  • Predicted Sales for a Single Month: The expected sales for a specific month with a given advertising expenditure.

  • Confidence Interval: Used to estimate the mean value of Y for a particular value of X.

  • Prediction Interval: Used to predict a single value of Y for a particular value of X.

Additional info: Confidence and prediction intervals are important for quantifying uncertainty in regression predictions.

Pearson Logo

Study Prep