BackChapter 3: Association – Exploring Relationships Between Two Variables
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Association Between Two Variables
Introduction to Association
Understanding the association between variables is fundamental in statistics. An association exists when certain values of one variable are more likely to occur with specific values of another variable. This chapter explores how to identify, describe, and analyze associations between both categorical and quantitative variables.
Response Variable (Dependent Variable): The outcome variable on which comparisons are made.
Explanatory Variable (Independent Variable): The variable that explains or influences changes in the response variable.
Association: Exists if particular values for one variable are more likely to occur with certain values of the other variable.
Association Between Two Categorical Variables
Contingency Tables
A contingency table is used to display the relationship between two categorical variables. The rows represent categories of one variable, and the columns represent categories of the other. The entries are frequencies (counts).
Row and column totals provide marginal frequencies for each variable.
Cell counts show the joint frequency for each combination of categories.
Example: Meal Plans in College
Recommend a Meal Plan | Have a Meal Plan | Total | |
|---|---|---|---|
Yes | No | ||
Yes | 58 | 99 | 157 |
No | 51 | 2 | 53 |
Total | 109 | 101 | 210 |
The percentage of students who would recommend a meal plan is:
Conditional Proportions
Conditional proportions help determine if an association exists by comparing the proportion of one variable within levels of another.
Recommend a Meal Plan | Have a Meal Plan | Total | n | |
|---|---|---|---|---|
Yes | No | |||
Yes | 0.37 | 0.63 | 1 | 157 |
No | 0.96 | 0.04 | 1 | 53 |
Only 37% of those with a meal plan recommend it, while 96% of those without a meal plan recommend it.
This significant difference indicates an association between the variables.
Visualizing Associations: Side-by-Side Bar Plots
Side-by-side bar plots display conditional proportions for each category of the explanatory variable.
If there is no association, the bars for each group will be similar in height.
Class Exercise: Gender Gap in Party Identification
Party Identification | Democrat | Independent | Republican | Total |
|---|---|---|---|---|
Male | 299 | 365 | 232 | 896 |
Female | 422 | 381 | 273 | 1,076 |
Total | 721 | 746 | 505 | 1,972 |
Identify response and explanatory variables.
Calculate joint, marginal, and conditional proportions to analyze the association.
Association Between Two Quantitative Variables
Scatterplots
A scatterplot is a graphical display of the relationship between two quantitative variables. The explanatory variable is plotted on the horizontal axis (x), and the response variable on the vertical axis (y).
Trend: Linear, curved, clusters, or no pattern.
Direction: Positive, negative, or none.
Strength: How closely the points fit the trend.
Outliers should be noted as they can affect analysis.
Example: There is a strong negative linear association between car weight and miles per gallon (mpg).
Correlation Coefficient (r)
The correlation coefficient measures the strength and direction of the linear association between two quantitative variables.
Formula:
Range: -1 to +1
r > 0: Positive association; r < 0: Negative association
r close to ±1: Strong linear association; r close to 0: Weak association
Unitless and unaffected by variable units
Not resistant to outliers
Only measures linear relationships
Example: For mpg and weight, indicates a strong negative linear correlation.
Regression Line
The regression line predicts the value of the response variable y as a linear function of the explanatory variable x.
Equation:
a: y-intercept
b: Slope
Formulas for a and b:
Example: For mpg and weight,
The slope indicates the change in predicted y for a one-unit increase in x.
The y-intercept is the predicted value when x = 0 (may not always be meaningful).
Coefficient of Determination (r2)
The squared correlation () measures the proportion of variability in the response variable explained by the linear relationship with the explanatory variable.
Example: If , then . This means 75.69% of the variation in mpg is explained by car weight.
Important Points in Analyzing Associations
Extrapolation: Predicting y for x values outside the observed range is risky and may not be valid.
Influential Outliers: Points with extreme x values that do not follow the trend can greatly affect the regression line.
Regression Outlier: An observation far from the trend of the data.
Correlation vs. Causation
A strong correlation does not imply that one variable causes changes in the other.
Correlation only indicates association, not causality.
Lurking Variables
A lurking variable is an unmeasured variable that influences the relationship between the explanatory and response variables.
Example: Age can be a lurking variable affecting both height and math score in children.
Lurking variables can create spurious associations or mask real ones.
Simpson’s Paradox and Confounding
Simpson’s Paradox: The direction of an association between two variables reverses when a third variable is considered.
Confounding: Occurs when two explanatory variables are both associated with the response variable and with each other, making it difficult to separate their effects.
Summary Table: Key Concepts in Association Analysis
Concept | Description | Example |
|---|---|---|
Contingency Table | Displays frequencies for two categorical variables | Meal plan vs. recommendation |
Scatterplot | Graphical display for two quantitative variables | Weight vs. mpg |
Correlation (r) | Measures strength and direction of linear association | r = -0.87 for weight and mpg |
Regression Line | Predicts y from x using | |
r2 | Proportion of variance explained | 0.7569 (75.69%) |
Lurking Variable | Unmeasured variable affecting association | Age in height/math score |
Simpson’s Paradox | Association reverses with third variable | Party ID by gender and age |
Additional info: This summary includes expanded explanations, formulas, and examples to ensure the notes are self-contained and suitable for exam preparation.