Scatter Diagrams, Correlation, and Linear Regression in Statistics

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Scatter Diagrams and Correlation

Introduction to Bivariate Data

Bivariate data involves measurements of two variables for each individual in a study. Analyzing bivariate data allows us to explore relationships between variables, often using graphical and numerical methods.

Response (Dependent) Variable: The variable whose value is explained by the explanatory variable.
Explanatory (Independent) Variable: The variable that explains or influences changes in the response variable.

Scatter Diagrams

A scatter diagram (or scatter plot) is a graph that displays the relationship between two quantitative variables. Each point represents an individual, with the explanatory variable on the horizontal axis and the response variable on the vertical axis.

Purpose: To visually assess the type and strength of the relationship between two variables.
Types of Relationships: Linear, nonlinear, or no relation.

Example: Predicting the selling price of a home using data from Zillow, where the Zestimate is the explanatory variable and Sale Price is the response variable.

Sample Table: Drilling Data

Depth at Which Drilling Begins (in feet), x	Time to Drill Five Feet (in minutes), y
54	5.98
75	6.41
93	5.90
110	6.74
130	6.27
145	7.47
155	6.82
165	7.42
178	7.89
190	7.90

Additional info: This table is used to illustrate how scatter diagrams can reveal relationships between depth and drilling time.

Interpreting Scatter Diagrams

Positive Association: Higher values of one variable are associated with higher values of the other.
Negative Association: Higher values of one variable are associated with lower values of the other.
No Association: No apparent relationship between the variables.

Linear Correlation Coefficient

Definition and Properties

The linear correlation coefficient (Pearson product moment correlation coefficient), denoted by for sample and for population, measures the strength and direction of the linear relationship between two quantitative variables.

Range:
Interpretation:
- : Perfect positive linear relationship
- : Perfect negative linear relationship
- : No linear relationship
- The closer is to 1, the stronger the linear association
Not Resistant: Outliers can greatly affect .
Only Measures Linear Association: does not detect nonlinear relationships.

Formula for Sample Linear Correlation Coefficient

The formula for the sample linear correlation coefficient is:

: th observation of the explanatory variable
: th observation of the response variable
: Mean of the explanatory variable
: Mean of the response variable
: Standard deviation of the explanatory variable
: Standard deviation of the response variable
: Number of individuals in the sample

Example: Computing by Hand

Depth, x	Time, y					Product
54	5.98	-72.5	-1.74717	-1.41641	-2.54051	3.59801
75	6.41	-51.5	-1.34992	-1.00644	-1.96444	1.97760
93	5.90	-33.5	-1.85983	-0.65536	-2.70852	1.77589
110	6.74	-16.5	-0.96853	-0.32356	-1.41056	0.45698
130	6.27	3.5	-1.43853	0.06819	-2.09639	-0.14298
145	7.47	18.5	0.76347	0.36057	1.11284	0.40167
155	6.82	28.5	0.11347	0.55609	0.16548	0.09192
165	7.42	38.5	0.71347	0.75161	1.04196	0.78244
178	7.89	51.5	1.18347	1.00644	1.72714	1.73907
190	7.90	63.5	1.19347	1.24092	1.72926	2.14716

Additional info: The table above shows the step-by-step calculation for using drilling data.

Final calculation:

Using Technology to Compute

Statistical software such as StatCrunch or online applets can be used to quickly compute the linear correlation coefficient for large datasets.

Testing for a Linear Relation

Steps to Test for Linearity

Determine the absolute value of the correlation coefficient.
Find the critical value for the sample size.
If the absolute value of the correlation coefficient is greater than the critical value, a linear relation exists.

Example: Testing whether a linear relation exists between drilling depth and time to drill five feet.

Correlation vs. Causation

Key Differences

Correlation measures the strength and direction of a linear relationship between two variables, but does not imply that changes in one variable cause changes in the other.

Causation: Implies that one variable directly affects another.
Lurking Variable: A variable not included in the analysis that may influence both variables being studied.

Example: Ice cream sales and drowning rates may be correlated due to a lurking variable (temperature), not because one causes the other.

Least-Squares Regression Line

Objectives

Find the least-squares regression line and use it for predictions.
Interpret the slope and y-intercept.
Compute the sum of squared residuals.

Finding the Regression Line

Given data points , the least-squares regression line is the line that minimizes the sum of squared residuals (differences between observed and predicted values).

Equation of the Regression Line:

: Slope of the line
: y-intercept

Example: Using sample data to find the regression line and make predictions.

x	y
0	5.3
2	5.7
3	5.2
5	2.8
6	1.9

Choose points (2, 5.7) and (6, 1.9) to find the equation of the line.

Prediction: Use the regression equation to predict for a given value.

Summary Table: Properties of the Linear Correlation Coefficient

Property	Description
Range
Strength	Closer is to 1, stronger the linear association
Direction	Positive indicates positive association; negative indicates negative association
Interpretation	means no linear association
Resistant?	Not resistant to outliers
Type of Relation	Measures only linear relationships

Cautions and Limitations

A correlation coefficient close to 0 does not imply no relationship, only no linear relationship.
Always examine scatter diagrams to detect nonlinear associations.
Do not infer causation from correlation without further investigation.

Practice and Application

Use scatter diagrams and correlation coefficients to analyze real-world data (e.g., SAT scores vs. teacher salaries).
Consider lurking variables when interpreting results.
Apply regression analysis to make predictions and interpret relationships.