BackAssociation Between Variables: Categorical and Quantitative Analysis
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Chapter 3: Association
Association Between Two Categorical Variables
Understanding the relationship between two categorical variables is fundamental in statistics. The association exists if certain values of one variable are more likely to occur with specific values of another variable.
Response Variable (Dependent Variable): The outcome variable on which comparisons are made.
Explanatory Variable (Independent Variable): The variable that explains changes in the response variable.
Association: Exists when particular values for one variable are more likely to occur with certain values of the other variable.
Example: Is there an association between college GPA and high school GPA? Here, college GPA is the response variable, and high school GPA is the explanatory variable.
Contingency Tables
Contingency tables are used to display the frequencies of two categorical variables.
Rows represent categories of one variable.
Columns represent categories of the other variable.
Entries are frequencies (counts).
Example Table: Meal Plans in College
Recommend a Meal Plan | Have a Meal Plan: Yes | Have a Meal Plan: No | Total |
|---|---|---|---|
Yes | 58 | 51 | 109 |
No | 99 | 2 | 101 |
Total | 157 | 53 | 210 |
The percentage of students who would recommend a meal plan is .
Conditional Proportions
Conditional proportions help determine if an association exists by comparing proportions within levels of the explanatory variable.
Recommend a Meal Plan | Yes | No | Total | n |
|---|---|---|---|---|
Have a Meal Plan: Yes | 0.37 | 0.63 | 1 | 157 |
Have a Meal Plan: No | 0.96 | 0.04 | 1 | 53 |
Only 37% of those with a meal plan recommend it, while 96% of those without a meal plan recommend it.
Significant differences in conditional proportions indicate association.
Side-By-Side Bar Plots
Bar plots visually compare conditional proportions across categories. If there is no association, proportions for the response variable are similar across levels of the explanatory variable.
Class Exercise: Gender Gap in Party Identification
Party Identification | Democrat | Independent | Republican | Total |
|---|---|---|---|---|
Male | 299 | 365 | 232 | 896 |
Female | 422 | 381 | 273 | 1,076 |
Total | 721 | 746 | 505 | 1,972 |
Calculate proportions for specific combinations (e.g., male and Republican).
Find conditional proportions for party identification given gender.
Visualize differences using side-by-side bar plots.
Association Between Two Quantitative Variables
Scatterplots are used to display the association between two quantitative variables. The explanatory variable is plotted on the horizontal axis, and the response variable on the vertical axis.
Trend: Linear, curved, clusters, or no pattern.
Direction: Positive, negative, or no direction.
Strength: How closely points fit the trend.
Outliers: Points that deviate from the overall trend.
Example: There is a strong negative linear association between weight and miles per gallon (mpg) of a car.
Scatterplot Creation (TI-Calculator)
Store explanatory variable values under L1 and response variable values under L2.
Select scatterplot option and assign lists to X and Y axes.
Use ZOOM 9 for better visualization.
The Correlation Coefficient, r
The correlation coefficient measures the strength and direction of the linear association between two quantitative variables.
Formula:
r ranges from -1 to +1.
Positive r: positive association; negative r: negative association.
r close to ±1: strong linear association; r close to 0: weak association.
Correlation is unitless and not resistant to outliers.
Correlation only measures linear relationships.
Example: For mpg and weight, indicates strong negative linear correlation.
Regression Line
The regression line predicts the value of the response variable as a linear function of the explanatory variable.
Equation:
y-intercept (a):
Slope (b):
Example: For mpg and weight,
Slope interpretation: On average, mpg decreases by 5.344 for each 1000 lb increase in weight.
y-intercept may not be meaningful if x = 0 is outside the observed range.
Squared Correlation, r2
The squared correlation measures the proportion of variability in the response variable explained by the linear relationship.
Formula:
Interpretation: For , means 75.69% of variation in mpg is explained by weight.
Important Points in Analyzing Associations
Extrapolation: Predicting y for x values outside the observed range is risky; the relationship may not hold.
Influential Outliers: Outliers with extreme x values can significantly affect regression results.
Regression Outlier: An observation far from the trend.
Example: An influential outlier can distort the regression line, while a non-influential outlier may not.
Correlation vs Causation
Strong correlation does not imply causation.
Correlation indicates association, not a cause-effect relationship.
Lurking Variables
A lurking variable is not included in the analysis but can influence the relationship between variables.
Example: Age is a lurking variable affecting both height and math score in children.
Lurking variables may be common causes for both explanatory and response variables.
Simpson’s Paradox and Confounding
Simpson’s Paradox: The direction of association changes when a third variable is included and data is analyzed at separate levels.
Confounding: Two explanatory variables are both associated with the response variable and with each other, making it difficult to distinguish their effects.