BackRelationships Between Categorical Variables: Contingency Tables
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Relationships Between Categorical Variables: Contingency Tables
Introduction
This section explores how to analyze relationships between two categorical variables using contingency tables. It covers both intuitive and formal approaches, including the use of marginal and conditional distributions, and introduces key concepts such as Simpson's Paradox.
Overview of Data Types and Analysis Tools
Categorical vs. Categorical: Analyzed using contingency tables and Chi-Square tests.
Numerical vs. Numerical: Explored with scatterplots, correlation, and regression.
Categorical vs. Numerical: Analyzed using boxplots, histograms, t-tests, and ANOVA.
Reference Table:
Intuitive Explorations | Formal Analysis | |
|---|---|---|
Categorical vs Categorical | Contingency Tables | Chi-Square |
Numerical vs Numerical | Scatterplots & Correlation | Regression |
Categorical vs Numerical | Boxplots and Histograms | T-tests / ANOVA |
Contingency Tables
Definition and Purpose
Contingency Table: A table that displays the frequency distribution of variables to study the relationship between two categorical variables.
Each cell in the table shows the count (or frequency) of observations for a specific combination of categories.
Marginal and Conditional Distributions
Marginal Distribution: The distribution of values for one variable, ignoring the other variable. Found in the margins (totals) of the table.
Conditional Distribution: The distribution of one variable for a given category of the other variable.
Example: Gender and Pet Ownership
Consider a dataset from OKCupid about pet ownership by gender. The contingency table below summarizes the counts:
Gender | Has cats | Has dogs | Has both | Total |
|---|---|---|---|---|
Female | 3452 | 5800 | 408 | 9660 |
Male | 2413 | 5080 | 266 | 7759 |
Total | 5865 | 10880 | 674 | 17259 |
Table Percents
To compare proportions, convert counts to percentages by dividing each cell by the total number of observations.
Gender | Has cats | Has dogs | Has both | Total |
|---|---|---|---|---|
Female | 20.0% | 33.6% | 2.4% | 56.0% |
Male | 14.0% | 29.4% | 1.5% | 44.0% |
Total | 34.0% | 63.0% | 3.9% | 100.0% |
Row and Column Percents
Row Percents: Each cell is divided by the row total, showing the distribution within each row category.
Column Percents: Each cell is divided by the column total, showing the distribution within each column category.
These percentages help answer questions like "What percent of male cat/dog owners own only cats?"
Visualizing Categorical Data
Bar Charts: Useful for comparing categorical breakdowns.
Pie Charts: Show proportions but can be harder to compare across categories.
Side-by-side bar charts are often preferred for comparing groups.
Simpson's Paradox
Definition and Example
Simpson's Paradox: A phenomenon where a trend appears in several groups of data but reverses or disappears when the groups are combined.
This occurs due to the influence of a lurking variable or confounding factor.
Example: Comparing on-time flight rates for two people, Jill and Moe, at different times of day. When data is combined, the apparent trend may reverse due to the distribution of flights across time periods.
Time of Day | Jill On Time | Moe On Time | Total Flights |
|---|---|---|---|
Day | 90/100 | 19/20 | 120 |
Night | 10/20 | 48/100 | 120 |
When analyzing data, always consider whether to look at the overall table or at subgroups, and be aware of possible lurking variables.
Key Terms and Definitions
Term | Definition |
|---|---|
Contingency Table | A table showing the distribution of two categorical variables. |
Marginal Distribution | The distribution of values for one variable, ignoring the other variable. |
Conditional Distribution | The distribution of one variable for a specific value of the other variable. |
Segmented Bar Chart | A bar chart that displays the conditional distribution of a categorical variable within each category of another variable. |
Mosaic Plot | A graphical representation of a contingency table, with area proportional to the number of cases in each group. |
Simpson's Paradox | When a trend appears in different groups but reverses when the groups are combined. |
Lurking Variable | A variable not included in the analysis that can affect the results. |
Practical Tips
Be careful with rounding when reporting percentages.
Use software tools (e.g., StatCrunch) to generate contingency tables and percentages.
For two-way tables, use the appropriate menu options to summarize data.
Always check whether to use row, column, or table percentages based on the question.
Summary
Contingency tables are essential for analyzing relationships between categorical variables.
Understanding marginal and conditional distributions helps interpret the data correctly.
Be aware of phenomena like Simpson's Paradox and lurking variables when drawing conclusions.