BackCorrelation and Simple Linear Regression: Study Notes
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
14 Correlation and Simple Linear Regression
14.1 Scatterplots and the Correlation Coefficient
Understanding the relationship between two quantitative variables is fundamental in statistics. Scatterplots and the correlation coefficient are essential tools for visualizing and quantifying the strength and direction of linear associations.
Scatterplot: A graph of paired quantitative data where each point represents one observation, used to visualize the relationship between two variables.
Linear Association: The tendency of data points to cluster around a straight line, indicating a linear relationship between variables.
To quantify the strength and direction of a linear relationship, we use the correlation coefficient (r).
Formula for the Correlation Coefficient
The correlation coefficient is computed as:
where:
These quantities measure the variability of x and y and how they vary together.
Properties of the Correlation Coefficient
Range:
Interpretation:
: Perfect positive linear relationship
: Perfect negative linear relationship
: No linear relationship
Sensitivity: Pearson’s r is sensitive to outliers and assumes both variables are approximately normal.
Example: Volume of Lumber
In a study, the volume of timber from black cherry trees is plotted against the diameter of the tree. The scatterplot shows a strong positive association, with most points above the mean lines for both variables. The correlation coefficient is high, indicating a strong linear relationship.
x | y | (x−x̄)2 | (y−ȳ)2 | (x−x̄)(y−ȳ) |
|---|---|---|---|---|
8.3 | 10.3 | 24.448634 | 613.586473 | 18.382064 |
10.8 | 10.3 | 21.087036 | 398.615593 | 18.357092 |
11.4 | 10.3 | 15.781478 | 398.615593 | 13.411667 |
... | ... | ... | ... | ... |
Additional info: Table shows how deviations from the mean are used to compute , , and .
Interpretation of r
If is positive, as one variable increases, so does the other (positive association).
If is negative, as one variable increases, the other decreases (negative association).
The closer is to 1, the stronger the linear relationship.
Example: Age and Systolic Blood Pressure
A scatterplot of age versus systolic blood pressure for adults shows a strong positive association (), indicating that older participants tend to have higher blood pressure.
Population Correlation Coefficient
The sample correlation coefficient (r) estimates the population correlation coefficient (). To test if (no linear relationship in the population), use:
where is the sample size. The test statistic follows a Student’s t-distribution with degrees of freedom.
Using JMP for Scatterplots and Correlation
Load data into JMP.
Use Graph Builder for scatterplots and Fit Line for regression lines.
Use Analyze → Multivariate Methods → Multivariate for correlation matrices.
14.2 Least Squares Regression Line
Simple linear regression models the relationship between two variables by fitting a straight line that best predicts the response variable from the explanatory variable.
Regression Model
The simple linear regression model is:
: Response variable
: Predictor (explanatory) variable
: Intercept (value of when )
: Slope (change in for a one-unit increase in )
: Random error term
Fitting the Model: Least Squares Method
The best-fitting line minimizes the sum of squared residuals (vertical distances between observed and predicted values):
The least squares estimates for the intercept and slope are:
Example: Advertising and Sales
Suppose a regression line for advertising spend () and weekly sales () is:
This means each additional thousand dollars spent on advertising is associated with an increase of about 1.74 thousand dollars in weekly sales. The intercept (12.47) is the predicted sales when advertising spend is zero.
Key Terms Recap
Keyword/Concept | Explanation |
|---|---|
simple linear regression | A model relating a quantitative response to a single quantitative predictor via a straight line. |
least squares | A method that determines the intercept and slope by minimizing the sum of squared residuals. |
slope () | The change in the predicted response for a one-unit increase in the explanatory variable. |
intercept () | The predicted response when the explanatory variable equals zero. |
residual | The difference between an observed value and its predicted value: . |
Example: Interpreting Regression Output
If , the slope 0.8 means each additional hour of study increases the predicted exam score by 0.8 points. The intercept 2.5 is the predicted score for a student who studies zero hours (may not always be meaningful).
Squaring residuals ensures that both positive and negative errors contribute to the measure of fit, preventing cancellation and emphasizing larger discrepancies.
Additional info: These notes include expanded definitions, formulas, and examples for clarity and exam preparation.