Correlation and Simple Linear Regression: Study Notes

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

14 Correlation and Simple Linear Regression

14.1 Scatterplots and the Correlation Coefficient

Understanding the relationship between two quantitative variables is fundamental in statistics. Scatterplots and the correlation coefficient are essential tools for visualizing and quantifying the strength and direction of linear associations.

Scatterplot: A graph of paired quantitative data where each point represents one observation, used to visualize the relationship between two variables.
Linear Association: The tendency of data points to cluster around a straight line, indicating a linear relationship between variables.

To quantify the strength and direction of a linear relationship, we use the correlation coefficient (r).

Formula for the Correlation Coefficient

The correlation coefficient is computed as:

where:

These quantities measure the variability of x and y and how they vary together.

Properties of the Correlation Coefficient

Range:
Interpretation:
- : Perfect positive linear relationship
- : Perfect negative linear relationship
- : No linear relationship
Sensitivity: Pearson’s r is sensitive to outliers and assumes both variables are approximately normal.

Example: Volume of Lumber

In a study, the volume of timber from black cherry trees is plotted against the diameter of the tree. The scatterplot shows a strong positive association, with most points above the mean lines for both variables. The correlation coefficient is high, indicating a strong linear relationship.

x	y	(x−x̄)2	(y−ȳ)2	(x−x̄)(y−ȳ)
8.3	10.3	24.448634	613.586473	18.382064
10.8	10.3	21.087036	398.615593	18.357092
11.4	10.3	15.781478	398.615593	13.411667
...	...	...	...	...

Additional info: Table shows how deviations from the mean are used to compute , , and .

Interpretation of r

If is positive, as one variable increases, so does the other (positive association).
If is negative, as one variable increases, the other decreases (negative association).
The closer is to 1, the stronger the linear relationship.

Example: Age and Systolic Blood Pressure

A scatterplot of age versus systolic blood pressure for adults shows a strong positive association (), indicating that older participants tend to have higher blood pressure.

Population Correlation Coefficient

The sample correlation coefficient (r) estimates the population correlation coefficient (). To test if (no linear relationship in the population), use:

where is the sample size. The test statistic follows a Student’s t-distribution with degrees of freedom.

Using JMP for Scatterplots and Correlation

Load data into JMP.
Use Graph Builder for scatterplots and Fit Line for regression lines.
Use Analyze → Multivariate Methods → Multivariate for correlation matrices.

14.2 Least Squares Regression Line

Simple linear regression models the relationship between two variables by fitting a straight line that best predicts the response variable from the explanatory variable.

Regression Model

The simple linear regression model is:

: Response variable
: Predictor (explanatory) variable
: Intercept (value of when )
: Slope (change in for a one-unit increase in )
: Random error term

Fitting the Model: Least Squares Method

The best-fitting line minimizes the sum of squared residuals (vertical distances between observed and predicted values):

The least squares estimates for the intercept and slope are:

Example: Advertising and Sales

Suppose a regression line for advertising spend () and weekly sales () is:

This means each additional thousand dollars spent on advertising is associated with an increase of about 1.74 thousand dollars in weekly sales. The intercept (12.47) is the predicted sales when advertising spend is zero.

Key Terms Recap

Keyword/Concept	Explanation
simple linear regression	A model relating a quantitative response to a single quantitative predictor via a straight line.
least squares	A method that determines the intercept and slope by minimizing the sum of squared residuals.
slope ()	The change in the predicted response for a one-unit increase in the explanatory variable.
intercept ()	The predicted response when the explanatory variable equals zero.
residual	The difference between an observed value and its predicted value: .

Example: Interpreting Regression Output

If , the slope 0.8 means each additional hour of study increases the predicted exam score by 0.8 points. The intercept 2.5 is the predicted score for a student who studies zero hours (may not always be meaningful).
Squaring residuals ensures that both positive and negative errors contribute to the measure of fit, preventing cancellation and emphasizing larger discrepancies.

Additional info: These notes include expanded definitions, formulas, and examples for clarity and exam preparation.