Association Between Variables: Categorical and Quantitative Analysis

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Chapter 3: Association

Association Between Two Categorical Variables

Understanding the relationship between two categorical variables is fundamental in statistics. The association exists if certain values of one variable are more likely to occur with specific values of another variable.

Response Variable (Dependent Variable): The outcome variable on which comparisons are made.
Explanatory Variable (Independent Variable): The variable that explains changes in the response variable.
Association: Exists when particular values for one variable are more likely to occur with certain values of the other variable.

Example: Is there an association between college GPA and high school GPA? Here, college GPA is the response variable, and high school GPA is the explanatory variable.

Contingency Tables

Contingency tables are used to display the frequencies of two categorical variables.

Rows represent categories of one variable.
Columns represent categories of the other variable.
Entries are frequencies (counts).

Example Table: Meal Plans in College

Recommend a Meal Plan	Have a Meal Plan: Yes	Have a Meal Plan: No	Total
Yes	58	51	109
No	99	2	101
Total	157	53	210

The percentage of students who would recommend a meal plan is .

Conditional Proportions

Conditional proportions help determine if an association exists by comparing proportions within levels of the explanatory variable.

Recommend a Meal Plan	Yes	No	Total	n
Have a Meal Plan: Yes	0.37	0.63	1	157
Have a Meal Plan: No	0.96	0.04	1	53

Only 37% of those with a meal plan recommend it, while 96% of those without a meal plan recommend it.
Significant differences in conditional proportions indicate association.

Side-By-Side Bar Plots

Bar plots visually compare conditional proportions across categories. If there is no association, proportions for the response variable are similar across levels of the explanatory variable.

Class Exercise: Gender Gap in Party Identification

Party Identification	Democrat	Independent	Republican	Total
Male	299	365	232	896
Female	422	381	273	1,076
Total	721	746	505	1,972

Calculate proportions for specific combinations (e.g., male and Republican).
Find conditional proportions for party identification given gender.
Visualize differences using side-by-side bar plots.

Association Between Two Quantitative Variables

Scatterplots are used to display the association between two quantitative variables. The explanatory variable is plotted on the horizontal axis, and the response variable on the vertical axis.

Trend: Linear, curved, clusters, or no pattern.
Direction: Positive, negative, or no direction.
Strength: How closely points fit the trend.
Outliers: Points that deviate from the overall trend.

Example: There is a strong negative linear association between weight and miles per gallon (mpg) of a car.

Scatterplot Creation (TI-Calculator)

Store explanatory variable values under L1 and response variable values under L2.
Select scatterplot option and assign lists to X and Y axes.
Use ZOOM 9 for better visualization.

The Correlation Coefficient, r

The correlation coefficient measures the strength and direction of the linear association between two quantitative variables.

Formula:

r ranges from -1 to +1.
Positive r: positive association; negative r: negative association.
r close to ±1: strong linear association; r close to 0: weak association.
Correlation is unitless and not resistant to outliers.
Correlation only measures linear relationships.

Example: For mpg and weight, indicates strong negative linear correlation.

Regression Line

The regression line predicts the value of the response variable as a linear function of the explanatory variable.

Equation:
y-intercept (a):
Slope (b):

Example: For mpg and weight,

Slope interpretation: On average, mpg decreases by 5.344 for each 1000 lb increase in weight.
y-intercept may not be meaningful if x = 0 is outside the observed range.

Squared Correlation, r2

The squared correlation measures the proportion of variability in the response variable explained by the linear relationship.

Formula:
Interpretation: For , means 75.69% of variation in mpg is explained by weight.

Important Points in Analyzing Associations

Extrapolation: Predicting y for x values outside the observed range is risky; the relationship may not hold.
Influential Outliers: Outliers with extreme x values can significantly affect regression results.
Regression Outlier: An observation far from the trend.

Example: An influential outlier can distort the regression line, while a non-influential outlier may not.

Correlation vs Causation

Strong correlation does not imply causation.
Correlation indicates association, not a cause-effect relationship.

Lurking Variables

A lurking variable is not included in the analysis but can influence the relationship between variables.

Example: Age is a lurking variable affecting both height and math score in children.
Lurking variables may be common causes for both explanatory and response variables.

Simpson’s Paradox and Confounding

Simpson’s Paradox: The direction of association changes when a third variable is included and data is analyzed at separate levels.
Confounding: Two explanatory variables are both associated with the response variable and with each other, making it difficult to distinguish their effects.