BackAssociation: Contingency, Correlation, and Regression (Chapter 3 Study Notes)
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Association: Contingency, Correlation, and Regression
Introduction to Association
In statistics, we often analyze the relationship between two or more variables. Understanding these associations helps us interpret data and make informed decisions. This chapter distinguishes between response and explanatory variables and explores how to measure and visualize associations between variables.
Response Variable (Dependent Variable): Measures the outcome of a study.
Explanatory Variable (Independent Variable): May explain or influence changes in the response variable.
Association: Exists if particular values for one variable are more likely to occur with certain values of another variable.
Note: Association does not necessarily imply causation.
Example: In a study where individuals are given different amounts of alcohol and their reaction times are measured, the amount of alcohol is the explanatory variable, and reaction time is the response variable.
Section 3.1: Association Between Two Categorical Variables
Contingency Tables
A contingency table displays the frequency distribution of variables contingent on the value of another variable. Each cell shows the count for a combination of categories.
Class | Alive | Dead | Total |
|---|---|---|---|
First | 203 | 122 | 325 |
Second | 118 | 167 | 285 |
Third | 178 | 528 | 706 |
Crew | 212 | 673 | 885 |
Total | 711 | 1490 | 2201 |
Marginal Distribution: The frequency distribution of one variable, found in the margins of the table.
Cross-tabulation: The process of finding frequencies for a contingency table.
Types of Percentages in Contingency Tables
Overall Percentage: Percentage of the total sample.
Row Percentage: Percentage within a specific row (conditional on the row variable).
Column Percentage: Percentage within a specific column (conditional on the column variable).
Example: Titanic Survival by Class (Overall Percentages)
Class | Alive (%) | Dead (%) | Total (%) |
|---|---|---|---|
First | 9.2 | 5.5 | 14.8 |
Second | 5.4 | 7.6 | 12.9 |
Third | 8.1 | 24.0 | 32.1 |
Crew | 9.6 | 30.6 | 40.2 |
Total | 32.3 | 67.7 | 100 |
Conditional Distributions
Conditional distributions show the distribution of one variable for cases that satisfy a condition on another variable. The proportions are called conditional proportions.
Class | Alive (%) | Dead (%) | Total (%) |
|---|---|---|---|
First | 62.5 | 37.5 | 100 |
Second | 41.4 | 58.6 | 100 |
Third | 25.2 | 74.8 | 100 |
Crew | 24.0 | 76.0 | 100 |
Total | 32.3 | 67.7 | 100 |
Marginal Proportion: Proportion relative to the entire dataset.
Conditional Proportion: Proportion relative to a specific subgroup.
Example Calculations:
Percentage of Titanic survivors who were first-class passengers:
Percentage of crew members who did not survive:
Graphical Display: Segmented (Stacked) Bar Chart
A segmented or stacked bar chart divides each bar proportionally into segments corresponding to the percentage in each group. This is useful for visualizing conditional distributions.
Measuring Strength of Association Between Categorical Variables
Difference of Proportions: Measures the absolute difference between proportions in two groups.
Risk Ratio (Relative Risk): Compares the probability of an event between two groups.
Odds Ratio: Compares the odds of an event occurring in one group to the odds in another group.
Section 3.2: Association Between Two Quantitative Variables
Scatterplots
A scatterplot displays the relationship between two quantitative variables measured on the same individuals. The explanatory variable is plotted on the x-axis, and the response variable on the y-axis.
Look for form (linear, curved), direction (positive, negative), and strength (strong, weak) of the relationship.
Identify any outliers (points that deviate from the overall pattern).
Types of Association
Positive Association: Above-average values of one variable tend to accompany above-average values of the other.
Negative Association: Above-average values of one variable tend to accompany below-average values of the other.
Linear Relationship: The trend in the scatterplot is well approximated by a straight line.
Example: Height and weight in a population are positively associated; number of cigarettes smoked and length of life are negatively associated.
Examples of Scatterplots and Relationships
Outlier Example: In the 2000 U.S. Presidential Election, Palm Beach County was a severe outlier in the relationship between votes for Buchanan and votes for Perot.
Adding Categorical Variables: Use different colors or symbols in scatterplots to represent categories (e.g., teams in sports data).
Describing Form, Direction, and Strength
Form: Linear or curved.
Direction: Positive or negative.
Strength: How closely the points follow a clear form.
Example: For sparrowhawks, the relationship between percent returning and number of new adults is negative and moderately strong, indicating that as more adults return, fewer new adults join the colony.
Curved Relationships
Not all relationships are linear. For example, fuel consumption versus speed in a car may show a curved pattern, where fuel used increases at both low and high speeds.
It may not make sense to describe such relationships as simply positive or negative.
Summarizing the Strength of Association: Correlation
The Correlation Coefficient (r)
The correlation coefficient (denoted as r) measures the strength and direction of a linear relationship between two quantitative variables. It ranges from -1 to 1.
r = 1: Perfect positive linear relationship.
r = -1: Perfect negative linear relationship.
r = 0: No linear relationship.
Calculating the Correlation Coefficient
Find the mean () and standard deviation () of the x-values. Compute the standard score for each x-observation:
Find the mean () and standard deviation () of the y-values. Compute the standard score for each y-observation:
The correlation coefficient is the average of the products of the standard scores:
Interpreting r
Values close to 1 or -1 indicate a strong linear relationship.
Values close to 0 indicate a weak or no linear relationship.
Correlation only measures linear association and is sensitive to outliers.
Visual Patterns and Correlation
Scatterplots with points closely clustered around a straight line have high |r| values.
Scatterplots with widely scattered points have low |r| values.
Additional info: For non-linear relationships, other measures (such as Spearman's rank correlation) or graphical analysis may be more appropriate.