(Lecture 7) Association Between Two Quantitative Variables: Correlation and Regression

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Section 3.2: The Association Between Two Quantitative Variables

Introduction to Association

In statistics, understanding the relationship between two quantitative variables is essential for data analysis and prediction. This section explores how to describe, visualize, and quantify associations between variables such as Internet and Facebook usage rates across countries.

Example: Internet and Facebook Penetration Rates

Consider the following data for 31 countries, showing the percentage of the population using the Internet and Facebook:

Country	Internet Penetration	Facebook Penetration
Brazil	49.9%	29.5%
Canada	86.8%	51.9%
China	42.3%	0.1%
France	83.0%	39.0%
India	12.6%	5.6%
United States	81.0%	52.9%
Thailand	26.5%	26.5%
United Kingdom	87.0%	52.1%
Sweden	94.0%	52.0%
Philippines	36.2%	30.9%
Saudi Arabia	54.0%	20.7%
Spain	72.0%	38.1%
Turkey	45.1%	43.4%
Russia	53.3%	5.6%
Netherlands	93.0%	45.1%
Peru	38.2%	31.2%
Poland	65.0%	25.6%
South Africa	41.0%	12.3%
Japan	65.0%	13.5%
Malaysia	65.0%	41.5%
Mexico	38.4%	38.4%
Colombia	49.0%	30.3%
Egypt	44.1%	15.1%
Germany	72.8%	56.4%
Hong Kong	72.8%	56.4%
Indonesia	15.4%	20.7%
Italy	58.0%	38.1%
Venezuela	44.1%	32.6%

Additional info: Table entries inferred and grouped for clarity.

Measures of Center and Spread

To summarize the data, we use measures of center (mean, median) and spread (standard deviation, quartiles, minimum, maximum):

Variable	N	Mean	StDev	Minimum	Q1	Median	Q3	Maximum
Internet Use	32	59.2	22.4	12.6	43.6	56.9	81.3	94.0
Facebook Use	32	33.9	16.0	0.0	24.4	34.5	47.1	56.4

Graphical Displays: Histograms and Scatterplots

Histograms show the distribution of each variable, helping to identify outliers and the shape of the data.
Scatterplots display the relationship between two quantitative variables. The horizontal axis (x) represents the explanatory variable, and the vertical axis (y) represents the response variable.

Example: Scatterplot Interpretation

In the scatterplot of Internet vs. Facebook use, each point represents a country. Outliers, such as Japan (x = 79%, y = 13%), can be identified as points that deviate from the overall pattern.

How to Examine a Scatterplot

Describe the trend: Is the pattern linear, curved, clustered, or random?
Identify the direction: Is the association positive, negative, or none?
Assess the strength: How closely do the points follow the trend?
Look for outliers: Points that do not fit the overall pattern.

Interpreting Scatterplots: Direction and Association

Positive association: High values of x tend to occur with high values of y; low values of x with low values of y.
Negative association: High values of one variable tend to pair with low values of the other.

Section 3.2: Summarizing the Strength of Association: The Correlation Coefficient

Definition of Correlation

The correlation coefficient (r) measures the strength and direction of the linear association between two quantitative variables.

A positive r indicates a positive association.
A negative r indicates a negative association.
r close to +1 or -1 indicates a strong linear association.
r close to 0 indicates a weak association.

Formula for the correlation coefficient:

Properties of Correlation

Always falls between -1 and +1.
The sign of r denotes direction: negative for negative association, positive for positive association.
Unitless measure: does not depend on the units of the variables.
Correlation is not resistant to outliers.
Measures only the strength of a linear relationship.
Correlation is the same regardless of which variable is treated as the response or explanatory variable.

Examples and Applications

Scatterplots with points close to a straight line have stronger correlation.
Example: Internet and Facebook use for 32 countries yields .

Section 3.3: Predicting the Outcome of a Variable: Regression Analysis

Regression Line

A regression line is used to predict the value of the response variable (y) as a function of the explanatory variable (x). The equation of the regression line is:

a: y-intercept (predicted value of y when x = 0)
b: slope (change in y for a one-unit increase in x)

Example: Predicting Height from Femur Length

Regression equation: For a femur length of 50 cm: cm

Interpreting the y-Intercept and Slope

y-intercept: Predicted value for y when x = 0. May not always have practical meaning.
Slope: Amount that y changes for each one-unit increase in x. Positive slope indicates positive association; negative slope indicates negative association.

Residuals: Measuring Prediction Errors

Residuals measure the difference between observed and predicted values:

Large residuals indicate unusual observations.
Smaller absolute residuals mean better predictions.

Method of Least Squares

The least squares regression line minimizes the sum of squared residuals:

The line passes through the point .
The sum (and mean) of the residuals equals zero.

Formulas for Slope and Intercept

Slope:
Intercept:

Relationship Between Slope and Correlation

Correlation describes the strength of the linear association and is unitless.
Slope depends on the units of measurement and requires identification of response and explanatory variables.

Coefficient of Determination ()

The squared correlation () measures the proportion of the variation in the response variable explained by the linear relationship with the explanatory variable.

Example: For Internet and Facebook use, , so (37.7%).
This means 37.7% of the variability in Facebook use is explained by Internet use.

Summary Table: Correlation vs. Regression

Feature	Correlation	Regression
Measures	Strength & direction of linear association	Predicts response variable from explanatory variable
Unit	Unitless	Depends on variable units
Symmetry	Same regardless of variable roles	Requires response/explanatory distinction
Interpretation	Strong/weak, positive/negative	Change in y per unit change in x

Additional info: Table synthesized for comparison.