Chapter 10

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Chi-Square Tests & Goodness-of-Fit

Introduction to Categorical Data Analysis

Chi-square tests are fundamental tools in statistics for analyzing categorical data. They are used to determine whether observed frequencies differ significantly from expected frequencies under a specific hypothesis. These tests are widely applied in research involving survey data, experiments, and observational studies where variables are categorical.

Categorical Variables: Variables that take on a limited, fixed number of possible values, representing categories or groups (e.g., color, brand, gender).
Two-way Tables: Tables that display the frequency counts for combinations of two categorical variables.

Section 10.1: Requirements and Testing with Categorical Variables

Learning Objectives

Distinguish between one and two categorical variables in data analysis.
Differentiate between the chi-square test for independence and the test for homogeneity.
Compute expected counts and the chi-square statistic.
Conduct and interpret chi-square tests for one and two categorical variables.

Introductory Example: Fair Die

Suppose a die is rolled several times and the number of times each face appears is recorded. The outcomes are summarized in a frequency table. The goal is to determine if the die is fair (i.e., each face is equally likely).

Observed Frequencies: The actual counts recorded for each category.
Expected Frequencies: The counts expected if the null hypothesis is true (e.g., for a fair die, each face should appear with equal probability).

Example: A Two-Way Table

Two-way tables summarize the relationship between two categorical variables. For example, a table might show the number of students who prefer different brands of soda, broken down by gender.

Expected Frequencies

Calculating Expected Frequencies

For a single categorical variable with categories and total observations, the expected frequency for each category under the null hypothesis is:

Where is the hypothesized probability for the category.
For two categorical variables, expected frequencies in each cell of a two-way table are calculated as:

Example Table: Expected Frequencies (1-Categorical)

Category	Observed	Expected
A	4	5
B	2	5
C	6	5
D	8	5

The Chi-Square Statistic

Definition and Formula

The chi-square statistic measures how far the observed frequencies deviate from the expected frequencies. It is calculated as:

Where is the observed frequency and is the expected frequency for each cell.

Example Calculation

Observed	Expected
25	17
57	65
41	41
205	205

Compute by applying the formula to each cell and summing the results.

The Chi-Square Distribution

Properties

The chi-square distribution is right-skewed and defined only for non-negative values.
The shape depends on the degrees of freedom (), which is typically for goodness-of-fit tests (where is the number of categories).
As increases, the distribution becomes more symmetric.

Degrees of Freedom

For a one-way table (goodness-of-fit):
For a two-way table: , where is the number of rows and is the number of columns.

Conditions for Chi-Square Tests

1-CAT Chi-Square Tests (Goodness-of-Fit)

Counted Data Condition: Data must be counts of categorical outcomes.
Randomization Condition: Data should be from a random sample or randomized experiment.
Expected Cell Frequency Condition: Each expected count should be at least 5.

2-CAT Chi-Square Tests (Independence/Homogeneity)

Same as above, plus:
Independence Assumption: Observations must be independent.

Types of Chi-Square Tests

Goodness-of-Fit Test (1-CAT)

Used to determine whether the distribution of a single categorical variable follows a specified distribution.
Null Hypothesis (): The observed distribution matches the expected distribution.
Alternative Hypothesis (): The observed distribution does not match the expected distribution.

Test of Independence (2-CAT)

Used to determine whether two categorical variables are independent in a single population.
Each object in the sample is measured on two categorical variables.
Null Hypothesis (): The variables are independent.

Test of Homogeneity (2-CAT)

Used to compare the distribution of a categorical variable across two or more populations.
Each sample comes from a different population.
Null Hypothesis (): The distributions are the same across populations.

Comparison Table: Independence vs. Homogeneity

Test	Sample Structure	Research Question
Independence	One sample, two variables measured	Are the variables related?
Homogeneity	Two or more samples, one variable measured	Are the distributions the same?

Steps for Conducting a Chi-Square Test

State the Hypotheses: Define and .
Check Conditions: Ensure all assumptions are met.
Calculate Expected Counts: Use the formulas above.
Compute the Chi-Square Statistic: Apply the formula to observed and expected counts.
Find the p-value: Use the chi-square distribution with appropriate degrees of freedom.
Draw a Conclusion: Compare the p-value to the significance level () and interpret the result.

Example: Goodness-of-Fit Test

Operator	New Customers
1	11
2	12
3	15
4	13
5	21

Test whether the number of new customers is equally distributed among operators using the chi-square goodness-of-fit test.

Example: Test of Independence

Marital Status	Income Level
Single	Low, Middle, High
Married	Low, Middle, High

Test whether marital status and income level are independent in the population.

Interpreting Results

If the p-value is less than (commonly 0.05), reject ; there is evidence of a significant association or difference.
If the p-value is greater than , fail to reject ; there is not enough evidence to conclude a significant association or difference.

Summary Table: Chi-Square Test Types

Test	Number of Variables	Purpose
Goodness-of-Fit	1	Compare observed distribution to expected
Test of Independence	2	Test association between variables
Test of Homogeneity	2	Compare distributions across groups

Key Formulas

Expected Frequency (1-CAT):
Expected Frequency (2-CAT):
Chi-Square Statistic:
Degrees of Freedom (1-CAT):
Degrees of Freedom (2-CAT):

Additional info:

Chi-square tests are non-parametric and do not require the assumption of normality.
They are sensitive to sample size; large samples can detect small differences as significant.
Expected cell counts less than 5 may invalidate the test; consider combining categories or using Fisher's Exact Test in such cases.