Correlation and Simple Linear Regression: Chapter 14 Study Guide

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Correlation and Simple Linear Regression

14.1 Dependent and Independent Variables

In regression and correlation analysis, variables are classified as either independent or dependent. The independent variable (x) is used to explain or predict the dependent variable (y). The direction of causality is assumed to be from x to y.

Independent Variable (x): The variable that is manipulated or used as a predictor.
Dependent Variable (y): The variable that is measured or predicted.
Examples: The size of a television screen (x) explains the selling price (y); the number of visitors per day (x) explains the amount of sales per day (y).

Independent Variable	Dependent Variable
The size of a television screen	The selling price of the television
The number of visitors per day on a Web site	The amount of sales per day from the Web site
The curb weight of a car	The car's gas mileage

Examples of independent and dependent variables

14.2 Correlation Analysis

Correlation analysis measures the strength and direction of a linear relationship between two variables. A relationship is linear if the scatter plot of the variables forms a straight-line pattern.

Strength: How closely the data points fit a straight line.
Direction: Positive (both variables increase together) or negative (one increases as the other decreases).
Scatter Plot: Used to visually assess linearity.

Scatter plot of TV ads vs. cars sold

Correlation Coefficient (r): Indicates both strength and direction. Values range from -1.0 (perfect negative) to +1.0 (perfect positive). r = 0 means no linear relationship.

Perfect Positive:
Perfect Negative:
Moderate Positive:
Moderate Negative:
No Correlation:

Perfect positive and negative correlation Examples of moderate and no correlation

Formula for the Correlation Coefficient:

Example: Using Excel's CORREL function to calculate r for TV ads and cars sold.

Excel CORREL function example

Hypothesis Testing for the Correlation Coefficient

To determine if the population correlation coefficient () is significantly different from zero, a hypothesis test is performed:

Null Hypothesis (H0): (no positive relationship)
Alternative Hypothesis (H1): (positive linear relationship)

Test Statistic:

Compare t to the critical value from the t-distribution with degrees of freedom.

14.3 Simple Linear Regression Analysis

Simple linear regression fits a straight line to a set of ordered pairs (x, y) to model the relationship between the independent and dependent variables. The regression equation is:

: Predicted value of y
: y-intercept
: Slope

Regression line with slope and intercept

Residuals (ei): The difference between the actual and predicted values for each observation.

Least Squares Method: Minimizes the sum of squared errors (SSE) to find the best fit line.

Formulas for Slope and Intercept:

Example: Regression equation for TV ads and cars sold.

Regression equation example

Calculating Regression in Excel

Excel's Data Analysis tool can be used to perform regression analysis. Input the dependent and independent variables, select regression, and review the output for coefficients, standard error, and significance.

Excel regression dialog box Excel regression input ranges Excel regression output summary

Partitioning the Sum of Squares

The total variation in the dependent variable is partitioned into explained and unexplained components:

Total Sum of Squares (SST):
Sum of Squares Regression (SSR):
Sum of Squares Error (SSE):
Relationship:

Partitioning sum of squares

Coefficient of Determination (R2)

R2 measures the proportion of variation in y explained by x. It ranges from 0 to 1 (or 0% to 100%).

Higher values indicate a better fit.

Hypothesis Testing for R2

To test if the population coefficient of determination () is significantly greater than zero:

Null Hypothesis (H0):
Alternative Hypothesis (H1):

F-test Statistic:

Excel regression output with F-test

14.4 Using Regression to Make Predictions

Regression equations can be used to predict y for a given x. A point estimate is the predicted value, and confidence intervals can be constructed around this estimate.

Standard Error of Estimate (se): Measures dispersion of observed data around the regression line.

Standard error of estimate in Excel

Confidence Interval for Average y:

Prediction Interval for Specific y:

Confidence interval for regression prediction

14.5 Testing the Significance of the Slope

To determine if the slope () is significantly different from zero, a t-test is performed:

Null Hypothesis (H0):
Alternative Hypothesis (H1):

Test Statistic:

Standard Error of Slope (sb): Measures variation in the estimate of the slope.

Small standard error of slope Large standard error of slope Excel output for slope significance

14.6 Assumptions for Regression Analysis

Regression analysis relies on several key assumptions:

Linearity: The relationship between x and y is linear.
Independence: Residuals are independent and exhibit no patterns.
Homoscedasticity: Constant variance of residuals across values of x.
Normality: Residuals follow a normal distribution.

Scatter plots and residual plots are used to check these assumptions.

14.7 Simple Regression Example with Negative Correlation

Regression can also model negative relationships. For example, as driving speed increases, gas mileage decreases. The slope of the regression equation is negative.

Negative correlation between speed and MPG

Example: For each 1 MPH increase in speed, gas mileage decreases by 0.26 MPG.

14.8 Final Thoughts and Pitfalls

Key cautions in regression analysis:

Do not extrapolate: Avoid predicting y for x values outside the observed range.
Correlation does not imply causation: Statistical significance does not prove causality.

Negative correlation between price and demand

Example: The law of supply and demand demonstrates negative correlation: as price increases, demand decreases.

Summary: Correlation and regression are powerful tools for analyzing relationships between variables, but must be used with care and proper understanding of their assumptions and limitations.