Skip to main content
Back

Relationships Between Categorical Variables: Contingency Tables

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Relationships Between Categorical Variables: Contingency Tables

Introduction

This section explores how to analyze relationships between two categorical variables using contingency tables. It covers both intuitive and formal approaches, including the use of marginal and conditional distributions, and introduces key concepts such as Simpson's Paradox.

Overview of Data Types and Analysis Tools

  • Categorical vs. Categorical: Analyzed using contingency tables and Chi-Square tests.

  • Numerical vs. Numerical: Explored with scatterplots, correlation, and regression.

  • Categorical vs. Numerical: Analyzed using boxplots, histograms, t-tests, and ANOVA.

Reference Table:

Intuitive Explorations

Formal Analysis

Categorical vs Categorical

Contingency Tables

Chi-Square

Numerical vs Numerical

Scatterplots & Correlation

Regression

Categorical vs Numerical

Boxplots and Histograms

T-tests / ANOVA

Contingency Tables

Definition and Purpose

  • Contingency Table: A table that displays the frequency distribution of variables to study the relationship between two categorical variables.

  • Each cell in the table shows the count (or frequency) of observations for a specific combination of categories.

Marginal and Conditional Distributions

  • Marginal Distribution: The distribution of values for one variable, ignoring the other variable. Found in the margins (totals) of the table.

  • Conditional Distribution: The distribution of one variable for a given category of the other variable.

Example: Gender and Pet Ownership

Consider a dataset from OKCupid about pet ownership by gender. The contingency table below summarizes the counts:

Gender

Has cats

Has dogs

Has both

Total

Female

3452

5800

408

9660

Male

2413

5080

266

7759

Total

5865

10880

674

17259

Table Percents

  • To compare proportions, convert counts to percentages by dividing each cell by the total number of observations.

Gender

Has cats

Has dogs

Has both

Total

Female

20.0%

33.6%

2.4%

56.0%

Male

14.0%

29.4%

1.5%

44.0%

Total

34.0%

63.0%

3.9%

100.0%

Row and Column Percents

  • Row Percents: Each cell is divided by the row total, showing the distribution within each row category.

  • Column Percents: Each cell is divided by the column total, showing the distribution within each column category.

These percentages help answer questions like "What percent of male cat/dog owners own only cats?"

Visualizing Categorical Data

  • Bar Charts: Useful for comparing categorical breakdowns.

  • Pie Charts: Show proportions but can be harder to compare across categories.

  • Side-by-side bar charts are often preferred for comparing groups.

Simpson's Paradox

Definition and Example

  • Simpson's Paradox: A phenomenon where a trend appears in several groups of data but reverses or disappears when the groups are combined.

  • This occurs due to the influence of a lurking variable or confounding factor.

Example: Comparing on-time flight rates for two people, Jill and Moe, at different times of day. When data is combined, the apparent trend may reverse due to the distribution of flights across time periods.

Time of Day

Jill On Time

Moe On Time

Total Flights

Day

90/100

19/20

120

Night

10/20

48/100

120

When analyzing data, always consider whether to look at the overall table or at subgroups, and be aware of possible lurking variables.

Key Terms and Definitions

Term

Definition

Contingency Table

A table showing the distribution of two categorical variables.

Marginal Distribution

The distribution of values for one variable, ignoring the other variable.

Conditional Distribution

The distribution of one variable for a specific value of the other variable.

Segmented Bar Chart

A bar chart that displays the conditional distribution of a categorical variable within each category of another variable.

Mosaic Plot

A graphical representation of a contingency table, with area proportional to the number of cases in each group.

Simpson's Paradox

When a trend appears in different groups but reverses when the groups are combined.

Lurking Variable

A variable not included in the analysis that can affect the results.

Practical Tips

  • Be careful with rounding when reporting percentages.

  • Use software tools (e.g., StatCrunch) to generate contingency tables and percentages.

  • For two-way tables, use the appropriate menu options to summarize data.

  • Always check whether to use row, column, or table percentages based on the question.

Summary

  • Contingency tables are essential for analyzing relationships between categorical variables.

  • Understanding marginal and conditional distributions helps interpret the data correctly.

  • Be aware of phenomena like Simpson's Paradox and lurking variables when drawing conclusions.

Pearson Logo

Study Prep