BackChapter 3: Relationships Between Categorical Variables – Contingency Tables
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Relationships Between Categorical Variables – Contingency Tables
Introduction
This chapter explores how to analyze relationships between categorical variables using contingency tables. It covers the construction and interpretation of contingency tables, calculation of marginal and conditional distributions, and the use of graphical displays to reveal associations and potential confounding variables.
Contingency Tables
Definition and Structure
Contingency Table: A table that displays the frequency distribution of variables to examine the relationship between two or more categorical variables.
Rows and columns represent categories of each variable.
Marginal Distribution: The totals for each row or column, representing the distribution of each variable separately.
Example Table:
Pets | Female | Male | Total |
|---|---|---|---|
Has cats | 3412 | 2388 | 5800 |
Has dogs | 3431 | 3587 | 7018 |
Has both | 897 | 577 | 1474 |
Total | 7740 | 6552 | 14,292 |
For example, there are 897 females who have both a cat and a dog.
The bottom row and rightmost column show the marginal distributions for gender and pet ownership, respectively.
Tables of Percents
Column Percents
Percentages are calculated within each column, showing the distribution of one variable for each category of the other.
Column margins sum to 100%.
Useful for comparing the distribution of pet ownership within each gender.
Example Table (Column Percents):
Pets | Female | Male | Total |
|---|---|---|---|
Has cats | 44.1% | 36.4% | 40.6% |
Has dogs | 44.3% | 54.8% | 49.1% |
Has both | 11.6% | 8.8% | 10.3% |
Total | 100% | 100% | 100% |
Example: 54.8% of pet-owning men have dogs but not cats.
Row Percents
Percentages are calculated within each row, showing the distribution of gender for each pet ownership category.
Row margins sum to 100%.
Example Table (Row Percents):
Pets | Female | Male | Total |
|---|---|---|---|
Has cats | 58.8% | 41.2% | 100% |
Has dogs | 48.9% | 51.1% | 100% |
Has both | 60.9% | 39.1% | 100% |
Total | 54.2% | 45.8% | 100% |
Example: 60.9% of dual pet owners are women.
Overall Percents
Percentages are calculated out of the grand total, showing the proportion of all individuals in each category combination.
Example Table (Overall Percents):
Pets | Female | Male | Total |
|---|---|---|---|
Has cats | 23.9% | 16.7% | 40.6% |
Has dogs | 24.0% | 25.1% | 49.1% |
Has both | 6.3% | 4.0% | 10.3% |
Total | 54.2% | 45.8% | 100% |
Example: 6.3% of OkCupid pet owners are women who have both a dog and a cat.
Marginal Distributions
Definition and Calculation
Marginal Distribution: The distribution of either variable alone, found in the margins (totals) of the contingency table.
Calculated by summing across rows or columns.
Example: In a survey about Super Bowl viewing preferences, the marginal distribution of what people plan to watch is found by summing the counts for each response across genders.
Conditional Distributions
Definition and Use
Conditional Distribution: The distribution of one variable for a specific category of another variable.
Calculated by dividing the count in each cell by the total for the conditioning category (row or column total).
Example: Among those who prefer commercials during the Super Bowl, 66% are women and 34% are men.
Independence and Association
Definitions
Independence: Two variables are independent if the distribution of one variable is the same for all categories of the other variable.
Association (Dependence): There is an association if the distribution of one variable differs across categories of the other variable.
Example: If the percentage of women and men who plan to watch the Super Bowl is different, there is an association between gender and viewing preference.
Graphical Displays of Contingency Tables
Bar Charts and Segmented Bar Charts
Bar Chart: Used to display the distribution of a categorical variable.
Segmented Bar Chart: Each bar is divided into segments representing categories of a second variable, showing conditional distributions.
Useful for visualizing associations between variables.
Mosaic Plots
Mosaic Plot: A graphical representation of a contingency table where the area of each rectangle is proportional to the cell frequency.
Helps visualize relationships and associations between categorical variables, especially with three or more variables.
Three Categorical Variables and Simpson’s Paradox
Simpson’s Paradox
Simpson’s Paradox: A phenomenon in which a trend appears in different groups of data but disappears or reverses when the groups are combined.
Occurs when a third variable (lurking variable) affects the association between the two variables of interest.
Example: In university admissions, overall data may suggest discrimination, but when broken down by department, the trend reverses or disappears.
Example Table: Simpson’s Paradox in Admissions
Gender | Admit | Reject | Total |
|---|---|---|---|
Men | 512 | 313 | 825 |
Women | 89 | 19 | 108 |
When data are broken down by department, women may have higher admission rates in several departments, revealing the importance of considering all relevant variables.
Common Pitfalls in Interpreting Contingency Tables
Do not confuse similar-sounding percentages (e.g., percent of survivors who were in first class vs. percent of first-class passengers who survived).
Always consider the context and the variables being compared.
Be cautious of small sample sizes and ensure enough individuals are included for reliable conclusions.
Watch for lurking variables that may affect the observed association.
Key Formulas
Conditional Probability: The probability of event A given event B:
Marginal Probability: The probability of a single event occurring:
Summary
Contingency tables are essential tools for analyzing relationships between categorical variables.
Marginal, conditional, and overall distributions provide different perspectives on the data.
Graphical displays such as bar charts and mosaic plots help visualize associations and potential confounding variables.
Simpson’s Paradox highlights the importance of considering all relevant variables before drawing conclusions.