Step-by-Step Guidance for Stat1P99 Final Exam Practice Questions

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Q1. Anscombe’s Data: Two data sets have almost the same r ≈ 0.816. Explain why reporting only r can be very misleading and what important feature you would miss without drawing scatterplots.

Background

Topic: Correlation and Data Visualization

This question tests your understanding of the limitations of the correlation coefficient (r) and the importance of visualizing data with scatterplots.

Key Terms and Concepts:

Correlation coefficient (r): A measure of the strength and direction of a linear relationship between two variables.
Scatterplot: A graphical representation of the relationship between two quantitative variables.
Outlier: A data point that differs significantly from other observations.

Step-by-Step Guidance

Recall that the correlation coefficient measures the strength and direction of a linear relationship, but does not capture non-linear patterns or outliers.
Consider that two very different data sets can have the same value, even if their underlying relationships are not similar.
Think about what information a scatterplot provides that a single summary statistic like does not. For example, scatterplots can reveal non-linear relationships, clusters, or outliers.
Reflect on why relying solely on could lead to incorrect conclusions about the nature of the relationship between variables.

Try solving on your own before revealing the answer!

Q2. Effect of Outlier on r: A scatterplot has 10 points with one clear outlier at (10,10). Describe how this outlier affects the value of r. What happens to r if the outlier is removed? What lesson does this teach?

Background

Topic: Sensitivity of Correlation to Outliers

This question examines how outliers can influence the correlation coefficient and the importance of checking for outliers in data analysis.

Key Terms and Concepts:

Outlier: An observation that lies an abnormal distance from other values in a data set.
Correlation coefficient (r): Sensitive to extreme values, especially in small data sets.

Step-by-Step Guidance

Recall that is calculated using all data points, so an outlier can have a large effect on its value.
Consider how the outlier at (10,10) might increase or decrease , depending on its position relative to the trend of the other points.
Think about what would happen to if you removed the outlier: would $r$ become stronger (closer to 1 or -1) or weaker (closer to 0)?
Reflect on the broader lesson about the importance of examining data visually and not relying solely on summary statistics.

Try solving on your own before revealing the answer!

Q3. Correlation vs Causation: Strong correlations are found between (i) storks and human births, (ii) pleasure boats and manatee deaths, (iii) cheese consumption and engineering PhDs. Does this mean one causes the other? Give a possible lurking variable for each.

Background

Topic: Correlation vs. Causation and Lurking Variables

This question tests your understanding of the difference between correlation and causation, and the concept of lurking variables.

Key Terms and Concepts:

Correlation: A statistical association between two variables.
Causation: When one variable directly affects another.
Lurking variable: A variable not included in the analysis that influences both variables being studied.

Step-by-Step Guidance

Recall that correlation does not imply causation; two variables can be correlated due to a third, unmeasured variable.
For each example, think about what external factor (lurking variable) could be influencing both variables.
Consider how these lurking variables could create a spurious correlation between the two variables in each case.
Write down a plausible lurking variable for each pair.