Skip to main content
Back

Chapter 3: Relationships Between Categorical Variables – Contingency Tables

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Relationships Between Categorical Variables – Contingency Tables

Introduction

This chapter explores how to analyze relationships between categorical variables using contingency tables. It covers the construction and interpretation of contingency tables, calculation of marginal and conditional distributions, and the use of graphical displays to reveal associations and potential confounding variables.

Contingency Tables

Definition and Structure

  • Contingency Table: A table that displays the frequency distribution of variables to examine the relationship between two or more categorical variables.

  • Rows and columns represent categories of each variable.

  • Marginal Distribution: The totals for each row or column, representing the distribution of each variable separately.

Example Table:

Pets

Female

Male

Total

Has cats

3412

2388

5800

Has dogs

3431

3587

7018

Has both

897

577

1474

Total

7740

6552

14,292

  • For example, there are 897 females who have both a cat and a dog.

  • The bottom row and rightmost column show the marginal distributions for gender and pet ownership, respectively.

Tables of Percents

Column Percents

  • Percentages are calculated within each column, showing the distribution of one variable for each category of the other.

  • Column margins sum to 100%.

  • Useful for comparing the distribution of pet ownership within each gender.

Example Table (Column Percents):

Pets

Female

Male

Total

Has cats

44.1%

36.4%

40.6%

Has dogs

44.3%

54.8%

49.1%

Has both

11.6%

8.8%

10.3%

Total

100%

100%

100%

  • Example: 54.8% of pet-owning men have dogs but not cats.

Row Percents

  • Percentages are calculated within each row, showing the distribution of gender for each pet ownership category.

  • Row margins sum to 100%.

Example Table (Row Percents):

Pets

Female

Male

Total

Has cats

58.8%

41.2%

100%

Has dogs

48.9%

51.1%

100%

Has both

60.9%

39.1%

100%

Total

54.2%

45.8%

100%

  • Example: 60.9% of dual pet owners are women.

Overall Percents

  • Percentages are calculated out of the grand total, showing the proportion of all individuals in each category combination.

Example Table (Overall Percents):

Pets

Female

Male

Total

Has cats

23.9%

16.7%

40.6%

Has dogs

24.0%

25.1%

49.1%

Has both

6.3%

4.0%

10.3%

Total

54.2%

45.8%

100%

  • Example: 6.3% of OkCupid pet owners are women who have both a dog and a cat.

Marginal Distributions

Definition and Calculation

  • Marginal Distribution: The distribution of either variable alone, found in the margins (totals) of the contingency table.

  • Calculated by summing across rows or columns.

Example: In a survey about Super Bowl viewing preferences, the marginal distribution of what people plan to watch is found by summing the counts for each response across genders.

Conditional Distributions

Definition and Use

  • Conditional Distribution: The distribution of one variable for a specific category of another variable.

  • Calculated by dividing the count in each cell by the total for the conditioning category (row or column total).

Example: Among those who prefer commercials during the Super Bowl, 66% are women and 34% are men.

Independence and Association

Definitions

  • Independence: Two variables are independent if the distribution of one variable is the same for all categories of the other variable.

  • Association (Dependence): There is an association if the distribution of one variable differs across categories of the other variable.

Example: If the percentage of women and men who plan to watch the Super Bowl is different, there is an association between gender and viewing preference.

Graphical Displays of Contingency Tables

Bar Charts and Segmented Bar Charts

  • Bar Chart: Used to display the distribution of a categorical variable.

  • Segmented Bar Chart: Each bar is divided into segments representing categories of a second variable, showing conditional distributions.

  • Useful for visualizing associations between variables.

Mosaic Plots

  • Mosaic Plot: A graphical representation of a contingency table where the area of each rectangle is proportional to the cell frequency.

  • Helps visualize relationships and associations between categorical variables, especially with three or more variables.

Three Categorical Variables and Simpson’s Paradox

Simpson’s Paradox

  • Simpson’s Paradox: A phenomenon in which a trend appears in different groups of data but disappears or reverses when the groups are combined.

  • Occurs when a third variable (lurking variable) affects the association between the two variables of interest.

Example: In university admissions, overall data may suggest discrimination, but when broken down by department, the trend reverses or disappears.

Example Table: Simpson’s Paradox in Admissions

Gender

Admit

Reject

Total

Men

512

313

825

Women

89

19

108

  • When data are broken down by department, women may have higher admission rates in several departments, revealing the importance of considering all relevant variables.

Common Pitfalls in Interpreting Contingency Tables

  • Do not confuse similar-sounding percentages (e.g., percent of survivors who were in first class vs. percent of first-class passengers who survived).

  • Always consider the context and the variables being compared.

  • Be cautious of small sample sizes and ensure enough individuals are included for reliable conclusions.

  • Watch for lurking variables that may affect the observed association.

Key Formulas

  • Conditional Probability: The probability of event A given event B:

  • Marginal Probability: The probability of a single event occurring:

Summary

  • Contingency tables are essential tools for analyzing relationships between categorical variables.

  • Marginal, conditional, and overall distributions provide different perspectives on the data.

  • Graphical displays such as bar charts and mosaic plots help visualize associations and potential confounding variables.

  • Simpson’s Paradox highlights the importance of considering all relevant variables before drawing conclusions.

Pearson Logo

Study Prep