Association, Correlation, and Regression: Analyzing Relationships Between Variables

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Chapter 3: Association – Contingency, Correlation, and Regression

Introduction to Association

In statistics, we often analyze the relationship between two or more variables to understand how they interact. This chapter introduces key concepts and methods for examining associations between categorical and quantitative variables, including contingency tables, correlation, and regression analysis.

Association Between Two Categorical Variables

Contingency Tables and Marginal Distributions

Contingency Table: A table that displays the frequency distribution of variables, showing how individuals are distributed across categories of two variables.
Marginal Distribution: The frequency distribution of one variable, found in the margins of the contingency table.
Cross-tabulation: The process of creating a contingency table from raw data.

Example: Titanic survival data by class is summarized in a contingency table, showing counts and percentages for each class and survival status.

Conditional Distributions and Proportions

Conditional Distribution: The distribution of one variable for cases that satisfy a condition on another variable (e.g., survival rate within each class).
Conditional Proportion: The proportion of cases in a specific category, conditional on another variable.
Marginal Proportion: The proportion of cases in a category, regardless of other variables.

Conditional distributions help us determine if an association exists between variables (e.g., survival rates differ by class on the Titanic).

Graphical Display: Segmented (Stacked) Bar Chart

A segmented or stacked bar chart visually represents conditional distributions, with each bar divided into segments proportional to the percentage in each group.

Segmented bar chart of Titanic survival by class

Measuring Strength of Association (Categorical Variables)

Difference of Proportions: The absolute difference between conditional proportions.
Risk Ratio (Relative Risk): The ratio of conditional proportions.
Odds Ratio: The ratio of odds for one group to another.

Association Between Two Quantitative Variables

Scatterplots

A scatterplot displays the relationship between two quantitative variables, with each point representing an observation. The explanatory variable is typically on the x-axis, and the response variable on the y-axis.

To interpret a scatterplot, examine the form (linear, curved, etc.), direction (positive or negative), strength (tightness of points), and outliers.
Positive Association: Above-average values of one variable tend to accompany above-average values of the other.
Negative Association: Above-average values of one variable tend to accompany below-average values of the other.

Correlation Coefficient (r)

The correlation coefficient r measures the strength and direction of a linear relationship between two quantitative variables. It ranges from -1 (perfect negative) to +1 (perfect positive), with 0 indicating no linear association.

Formula for r:

Properties of r:
- r is unitless and unaffected by changes in measurement units.
- r is sensitive to outliers (not resistant).
- r only measures linear relationships.
- r does not distinguish between explanatory and response variables.

Scatterplots showing different values of correlation coefficient r

Descriptive Statistics Example

Descriptive statistics summarize the central tendency and spread of variables before further analysis.

Descriptive statistics for car weight and MPG

Linear Regression: Predicting the Outcome of a Variable

Regression Line and Least-Squares Method

A regression line models the relationship between an explanatory variable x and a response variable y. The least-squares regression line minimizes the sum of squared vertical distances (residuals) between observed and predicted values.

Regression Equation:
Slope (b):
Intercept (a):
Residual:

The regression line always passes through the point .

Regression Analysis Example: MPG vs. Weight

Regression analysis can be performed using statistical software, such as Minitab or a TI-84 calculator. The output provides the regression equation, coefficients, and measures of fit.

Regression analysis output for MPG versus Weight

Using Calculators and Software for Regression

TI-84: Enter data in lists, use LinReg(a+bx) to compute regression equation and statistics.
Minitab: Enter data in columns, use regression menu to fit the model.

TI-84 calculator regression menu Minitab regression dialog TI-84 regression output with r and r^2

Interpreting Regression Output

Slope: The expected change in y for a one-unit increase in x.
Intercept: The predicted value of y when x = 0 (meaningful only if x = 0 is plausible).
Coefficient of Determination (r2): The proportion of variation in y explained by the regression on x.

Analysis of Residuals

Residual plots help assess the fit of a regression line. A good fit shows no systematic pattern in the residuals. Curved patterns or changing spread indicate that a linear model may not be appropriate.

Cautions in Analyzing Associations

Outliers and Influential Points

Outlier: An observation far from the overall pattern of the data.
Influential Point: An observation that, if removed, would significantly change the regression results. Outliers in the x-direction are often influential.
Correlation and regression are not resistant to outliers or influential points.

Effect of outlier on regression line

Extrapolation

Extrapolation: Using a regression line to predict values far outside the range of observed data. This is risky and may lead to inaccurate predictions.

Scatterplot showing extrapolation risk

Lurking Variables and Causation

Lurking Variable: A variable not included in the analysis that affects the relationship between the studied variables.
Association does not imply causation. Even strong associations may be due to lurking variables or confounding factors.
Simpson’s Paradox: The direction of an association changes when a third variable is considered.
Criteria for causation (when experiments are not possible): strength, consistency, dose-response, temporal precedence, and plausibility.

Simpson's Paradox illustrated with scatterplot

Summary Table: Key Concepts in Association, Correlation, and Regression

Concept	Definition	Key Properties
Contingency Table	Displays frequency distribution for two categorical variables	Shows marginal and conditional distributions
Correlation (r)	Measures strength and direction of linear association	Range: -1 to 1; unitless; sensitive to outliers
Regression Line	Best-fit line predicting y from x	Minimizes sum of squared residuals
Coefficient of Determination (r2)	Proportion of variance in y explained by x	Range: 0 to 1; higher is better fit
Outlier	Observation far from overall pattern	Can strongly affect r and regression
Lurking Variable	Unmeasured variable affecting association	Can create or mask associations