BackChapter 10
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Chi-Square Tests & Goodness-of-Fit
Introduction to Categorical Data Analysis
Chi-square tests are fundamental tools in statistics for analyzing categorical data. They are used to determine whether observed frequencies differ significantly from expected frequencies under a specific hypothesis. These tests are widely applied in research involving survey data, experiments, and observational studies where variables are categorical.
Categorical Variables: Variables that take on a limited, fixed number of possible values, representing categories or groups (e.g., color, brand, gender).
Two-way Tables: Tables that display the frequency counts for combinations of two categorical variables.
Section 10.1: Requirements and Testing with Categorical Variables
Learning Objectives
Distinguish between one and two categorical variables in data analysis.
Differentiate between the chi-square test for independence and the test for homogeneity.
Compute expected counts and the chi-square statistic.
Conduct and interpret chi-square tests for one and two categorical variables.
Introductory Example: Fair Die
Suppose a die is rolled several times and the number of times each face appears is recorded. The outcomes are summarized in a frequency table. The goal is to determine if the die is fair (i.e., each face is equally likely).
Observed Frequencies: The actual counts recorded for each category.
Expected Frequencies: The counts expected if the null hypothesis is true (e.g., for a fair die, each face should appear with equal probability).
Example: A Two-Way Table
Two-way tables summarize the relationship between two categorical variables. For example, a table might show the number of students who prefer different brands of soda, broken down by gender.
Expected Frequencies
Calculating Expected Frequencies
For a single categorical variable with categories and total observations, the expected frequency for each category under the null hypothesis is:
Where is the hypothesized probability for the category.
For two categorical variables, expected frequencies in each cell of a two-way table are calculated as:
Example Table: Expected Frequencies (1-Categorical)
Category | Observed | Expected |
|---|---|---|
A | 4 | 5 |
B | 2 | 5 |
C | 6 | 5 |
D | 8 | 5 |
The Chi-Square Statistic
Definition and Formula
The chi-square statistic measures how far the observed frequencies deviate from the expected frequencies. It is calculated as:
Where is the observed frequency and is the expected frequency for each cell.
Example Calculation
Observed | Expected |
|---|---|
25 | 17 |
57 | 65 |
41 | 41 |
205 | 205 |
Compute by applying the formula to each cell and summing the results.
The Chi-Square Distribution
Properties
The chi-square distribution is right-skewed and defined only for non-negative values.
The shape depends on the degrees of freedom (), which is typically for goodness-of-fit tests (where is the number of categories).
As increases, the distribution becomes more symmetric.
Degrees of Freedom
For a one-way table (goodness-of-fit):
For a two-way table: , where is the number of rows and is the number of columns.
Conditions for Chi-Square Tests
1-CAT Chi-Square Tests (Goodness-of-Fit)
Counted Data Condition: Data must be counts of categorical outcomes.
Randomization Condition: Data should be from a random sample or randomized experiment.
Expected Cell Frequency Condition: Each expected count should be at least 5.
2-CAT Chi-Square Tests (Independence/Homogeneity)
Same as above, plus:
Independence Assumption: Observations must be independent.
Types of Chi-Square Tests
Goodness-of-Fit Test (1-CAT)
Used to determine whether the distribution of a single categorical variable follows a specified distribution.
Null Hypothesis (): The observed distribution matches the expected distribution.
Alternative Hypothesis (): The observed distribution does not match the expected distribution.
Test of Independence (2-CAT)
Used to determine whether two categorical variables are independent in a single population.
Each object in the sample is measured on two categorical variables.
Null Hypothesis (): The variables are independent.
Test of Homogeneity (2-CAT)
Used to compare the distribution of a categorical variable across two or more populations.
Each sample comes from a different population.
Null Hypothesis (): The distributions are the same across populations.
Comparison Table: Independence vs. Homogeneity
Test | Sample Structure | Research Question |
|---|---|---|
Independence | One sample, two variables measured | Are the variables related? |
Homogeneity | Two or more samples, one variable measured | Are the distributions the same? |
Steps for Conducting a Chi-Square Test
State the Hypotheses: Define and .
Check Conditions: Ensure all assumptions are met.
Calculate Expected Counts: Use the formulas above.
Compute the Chi-Square Statistic: Apply the formula to observed and expected counts.
Find the p-value: Use the chi-square distribution with appropriate degrees of freedom.
Draw a Conclusion: Compare the p-value to the significance level () and interpret the result.
Example: Goodness-of-Fit Test
Operator | New Customers |
|---|---|
1 | 11 |
2 | 12 |
3 | 15 |
4 | 13 |
5 | 21 |
Test whether the number of new customers is equally distributed among operators using the chi-square goodness-of-fit test.
Example: Test of Independence
Marital Status | Income Level |
|---|---|
Single | Low, Middle, High |
Married | Low, Middle, High |
Test whether marital status and income level are independent in the population.
Interpreting Results
If the p-value is less than (commonly 0.05), reject ; there is evidence of a significant association or difference.
If the p-value is greater than , fail to reject ; there is not enough evidence to conclude a significant association or difference.
Summary Table: Chi-Square Test Types
Test | Number of Variables | Purpose |
|---|---|---|
Goodness-of-Fit | 1 | Compare observed distribution to expected |
Test of Independence | 2 | Test association between variables |
Test of Homogeneity | 2 | Compare distributions across groups |
Key Formulas
Expected Frequency (1-CAT):
Expected Frequency (2-CAT):
Chi-Square Statistic:
Degrees of Freedom (1-CAT):
Degrees of Freedom (2-CAT):
Additional info:
Chi-square tests are non-parametric and do not require the assumption of normality.
They are sensitive to sample size; large samples can detect small differences as significant.
Expected cell counts less than 5 may invalidate the test; consider combining categories or using Fisher's Exact Test in such cases.