Skip to main content
Back

(Lecture 8) Cautions in Analyzing Associations: Outliers, Lurking Variables, and Simpson’s Paradox

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Section 3.4: Cautions in Analyzing Associations

Extrapolation in Regression Analysis

Extrapolation refers to the use of a regression line to predict values for x that fall outside the observed range of the data. This practice can be risky and may lead to unreliable predictions.

  • Definition: Extrapolation is the process of estimating values beyond the range of observed data using a regression equation.

  • Key Point: The farther we move from the range of observed x values, the less reliable the predictions become.

  • Limitation: There is no guarantee that the relationship described by the regression equation holds outside the sampled range.

  • Example: Predicting future sales based on past data may be inaccurate if market conditions change outside the observed period.

Outliers and Influential Points

Outliers and influential points can significantly affect the results of correlation and regression analyses. It is important to identify and consider these points before drawing conclusions.

  • Regression Outlier: An observation that lies far away from the trend followed by the rest of the data.

  • Influential Observation: An observation is influential if its x value is much lower or higher than the rest, or if it is a regression outlier.

  • Effect: Influential points can pull the regression line toward themselves, distorting the overall trend.

  • Example: In a scatterplot, a single point far from the cluster can change the slope of the regression line.

Table: Characteristics of Outliers and Influential Points

Type

Description

Effect on Regression

Regression Outlier

Far from trend of data

May distort slope/correlation

Influential Point

Extreme x-value or outlier

Pulls regression line toward itself

Correlation Does Not Imply Causation

While a strong correlation between two variables indicates a strong linear association, it does not necessarily mean that one variable causes changes in the other.

  • Key Point: Correlation measures association, not causality.

  • Example: Ice cream sales and drowning rates may be correlated due to a lurking variable (temperature), not because one causes the other.

Lurking Variables

A lurking variable is an unobserved variable that influences the association between the variables of primary interest.

  • Definition: Lurking variable is a variable not included in the analysis but affects the relationship between explanatory and response variables.

  • Examples:

    • Ice cream sales and drowning: lurking variable = temperature

    • Reading level and shoe size: lurking variable = age

    • Childhood obesity rate and GDP: lurking variable = income

Simpson’s Paradox

Simpson’s Paradox occurs when the direction of an association between two variables reverses after including a third variable and analyzing the data at separate levels of that variable.

  • Definition: Simpson’s Paradox is the phenomenon where a trend appears in several different groups of data but disappears or reverses when these groups are combined.

  • Example: The association between education and crime rate may change when urbanization is considered as a third variable.

Table: Simpson’s Paradox Example (Kidney Stone Treatment)

Treatment

Success (Overall)

Success (Small Stones)

Success (Large Stones)

A

78%

93%

73%

B

83%

87%

69%

Additional info: Although Treatment B appears more effective overall, Treatment A is more effective for both small and large stones when data is stratified by stone size. This reversal is Simpson’s Paradox.

Confounding Variables

Confounding occurs when two explanatory variables are both associated with a response variable and with each other, making it difficult to determine the effect of each variable.

  • Definition: A confounding variable is associated with both the explanatory and response variables, potentially distorting the observed relationship.

  • Example: In the kidney stone treatment example, stone size is a confounding variable affecting the success rate of treatments.

Table: Variables in Simpson’s Paradox Example

Explanatory Variable

Response Variable

Confounding Variable

Type of Treatment

Success Rate

Stone Size

The Effect of Lurking Variables on Associations

Lurking variables can affect associations in various ways, often acting as common causes for both explanatory and response variables. In practice, multiple causes may exist, making it challenging to isolate the effect of any single variable.

  • Key Point: Multiple lurking or confounding variables may exist, complicating the analysis of associations.

  • Example: Age can affect both smoking status and survival rates, as seen in the Simpson’s Paradox example with smokers and nonsmokers.

Pearson Logo

Study Prep