Back(Lecture 7) Association Between Two Quantitative Variables: Correlation and Regression
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Section 3.2: The Association Between Two Quantitative Variables
Introduction to Association
In statistics, understanding the relationship between two quantitative variables is essential for data analysis and prediction. This section explores how to describe, visualize, and quantify associations between variables such as Internet and Facebook usage rates across countries.
Example: Internet and Facebook Penetration Rates
Consider the following data for 31 countries, showing the percentage of the population using the Internet and Facebook:
Country | Internet Penetration | Facebook Penetration |
|---|---|---|
Brazil | 49.9% | 29.5% |
Canada | 86.8% | 51.9% |
China | 42.3% | 0.1% |
France | 83.0% | 39.0% |
India | 12.6% | 5.6% |
United States | 81.0% | 52.9% |
Thailand | 26.5% | 26.5% |
United Kingdom | 87.0% | 52.1% |
Sweden | 94.0% | 52.0% |
Philippines | 36.2% | 30.9% |
Saudi Arabia | 54.0% | 20.7% |
Spain | 72.0% | 38.1% |
Turkey | 45.1% | 43.4% |
Russia | 53.3% | 5.6% |
Netherlands | 93.0% | 45.1% |
Peru | 38.2% | 31.2% |
Poland | 65.0% | 25.6% |
South Africa | 41.0% | 12.3% |
Japan | 65.0% | 13.5% |
Malaysia | 65.0% | 41.5% |
Mexico | 38.4% | 38.4% |
Colombia | 49.0% | 30.3% |
Egypt | 44.1% | 15.1% |
Germany | 72.8% | 56.4% |
Hong Kong | 72.8% | 56.4% |
Indonesia | 15.4% | 20.7% |
Italy | 58.0% | 38.1% |
Venezuela | 44.1% | 32.6% |
Additional info: Table entries inferred and grouped for clarity.
Measures of Center and Spread
To summarize the data, we use measures of center (mean, median) and spread (standard deviation, quartiles, minimum, maximum):
Variable | N | Mean | StDev | Minimum | Q1 | Median | Q3 | Maximum |
|---|---|---|---|---|---|---|---|---|
Internet Use | 32 | 59.2 | 22.4 | 12.6 | 43.6 | 56.9 | 81.3 | 94.0 |
Facebook Use | 32 | 33.9 | 16.0 | 0.0 | 24.4 | 34.5 | 47.1 | 56.4 |
Graphical Displays: Histograms and Scatterplots
Histograms show the distribution of each variable, helping to identify outliers and the shape of the data.
Scatterplots display the relationship between two quantitative variables. The horizontal axis (x) represents the explanatory variable, and the vertical axis (y) represents the response variable.
Example: Scatterplot Interpretation
In the scatterplot of Internet vs. Facebook use, each point represents a country. Outliers, such as Japan (x = 79%, y = 13%), can be identified as points that deviate from the overall pattern.
How to Examine a Scatterplot
Describe the trend: Is the pattern linear, curved, clustered, or random?
Identify the direction: Is the association positive, negative, or none?
Assess the strength: How closely do the points follow the trend?
Look for outliers: Points that do not fit the overall pattern.
Interpreting Scatterplots: Direction and Association
Positive association: High values of x tend to occur with high values of y; low values of x with low values of y.
Negative association: High values of one variable tend to pair with low values of the other.
Section 3.2: Summarizing the Strength of Association: The Correlation Coefficient
Definition of Correlation
The correlation coefficient (r) measures the strength and direction of the linear association between two quantitative variables.
A positive r indicates a positive association.
A negative r indicates a negative association.
r close to +1 or -1 indicates a strong linear association.
r close to 0 indicates a weak association.
Formula for the correlation coefficient:
Properties of Correlation
Always falls between -1 and +1.
The sign of r denotes direction: negative for negative association, positive for positive association.
Unitless measure: does not depend on the units of the variables.
Correlation is not resistant to outliers.
Measures only the strength of a linear relationship.
Correlation is the same regardless of which variable is treated as the response or explanatory variable.
Examples and Applications
Scatterplots with points close to a straight line have stronger correlation.
Example: Internet and Facebook use for 32 countries yields .
Section 3.3: Predicting the Outcome of a Variable: Regression Analysis
Regression Line
A regression line is used to predict the value of the response variable (y) as a function of the explanatory variable (x). The equation of the regression line is:
a: y-intercept (predicted value of y when x = 0)
b: slope (change in y for a one-unit increase in x)
Example: Predicting Height from Femur Length
Regression equation: For a femur length of 50 cm: cm
Interpreting the y-Intercept and Slope
y-intercept: Predicted value for y when x = 0. May not always have practical meaning.
Slope: Amount that y changes for each one-unit increase in x. Positive slope indicates positive association; negative slope indicates negative association.
Residuals: Measuring Prediction Errors
Residuals measure the difference between observed and predicted values:
Large residuals indicate unusual observations.
Smaller absolute residuals mean better predictions.
Method of Least Squares
The least squares regression line minimizes the sum of squared residuals:
The line passes through the point .
The sum (and mean) of the residuals equals zero.
Formulas for Slope and Intercept
Slope:
Intercept:
Relationship Between Slope and Correlation
Correlation describes the strength of the linear association and is unitless.
Slope depends on the units of measurement and requires identification of response and explanatory variables.
Coefficient of Determination ()
The squared correlation () measures the proportion of the variation in the response variable explained by the linear relationship with the explanatory variable.
Example: For Internet and Facebook use, , so (37.7%).
This means 37.7% of the variability in Facebook use is explained by Internet use.
Summary Table: Correlation vs. Regression
Feature | Correlation | Regression |
|---|---|---|
Measures | Strength & direction of linear association | Predicts response variable from explanatory variable |
Unit | Unitless | Depends on variable units |
Symmetry | Same regardless of variable roles | Requires response/explanatory distinction |
Interpretation | Strong/weak, positive/negative | Change in y per unit change in x |
Additional info: Table synthesized for comparison.