BackUnderstanding and Comparing Distributions: Visual Displays, Summaries, and Transformations
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Chapter 4: Understanding and Comparing Distributions
Section 4.1: Displays for Comparing Groups
Comparing distributions is a fundamental task in statistics, allowing us to understand differences and similarities between groups. This section introduces graphical and numerical methods for effective comparison.
Wind Speeds in Hopkins Memorial Forest: A Case Study
Data Description: Daily average wind speeds for every day of 2011.
Distribution Shape: Unimodal and skewed right (most days have low wind speeds, with a few very windy days).
Summary Statistics:
Median ≈ 1.12 mph
Interquartile Range (IQR) ≈ 1.82 mph
Possible outliers, e.g., a very windy day at 6.73 mph
Example: The histogram shows most days have wind speeds less than 1 mph, with a long tail to the right.
Comparing Seasons: Summer vs. Winter Wind Speeds
Summer: Unimodal, skewed right, typically calm (<1 mph), few high wind days.
Winter: Less skewed, nearly uniform, more spread out, often windier days.
Statistical Comparison: Both standard deviation and IQR are higher in winter, indicating greater variability.
Season | Mean | StdDev | Median | IQR |
|---|---|---|---|---|
Summer | 1.11 | 1.10 | 0.71 | 1.27 |
Winter | 1.90 | 1.29 | 1.72 | 1.82 |
Example 1: Comparing Groups with Stem-and-Leaf Plots
Stem-and-Leaf Diagram: Useful for comparing distributions of small datasets, such as nest egg indices (savings and investments) across regions.
Back-to-Back Format: Allows direct visual comparison between two groups (e.g., South/West vs. Northeast/Midwest).
Interpretation: Northeast and Midwest generally have higher indices than South and West.
Five-Number Summary
Components: Minimum, First Quartile (Q1), Median, Third Quartile (Q3), Maximum.
Example: For wind speed data:
Min: 0.00 mph
Q1: 0.46 mph
Median: 1.12 mph
Q3: 2.28 mph
Max: 6.73 mph
IQR Calculation: mph
Boxplots
Definition: A graphical summary of the five-number summary, showing the central 50% of data (the box), the median, and potential outliers.
Interpretation:
If the median is centered, the distribution is symmetric.
Whiskers of unequal length indicate skewness.
Outliers are plotted individually.
Use: Excellent for comparing multiple groups side by side.
Example 2: Comparing Roller Coaster Speeds with Boxplots
Comparison: Median speed for wooden coasters is higher than for steel, but steel coasters have a much wider range and more variability.
Outliers: Exceptionally fast steel coasters are visible as outliers.
Step-by-Step Example: Comparing Coffee Cup Types
Experiment: Four types of coffee cups tested for heat retention (temperature measured after 30 minutes, 8 trials each).
Variables: Quantitative (temperature change).
Five-Number Summaries and IQRs:
Cup | Min | Q1 | Median | Q3 | Max | IQR |
|---|---|---|---|---|---|---|
UPPS | 6.0 | 8.25 | 14.25 | 18.50 | 8.25 | |
Nissan | 0.0 | 1.2 | 4.50 | 7.0 | 3.50 | |
GG | 9.0 | 11.50 | 14.25 | 21.75 | 24.50 | 10.25 |
Starbucks | 6.0 | 6.50 | 8.50 | 14.25 | 17.50 | 7.75 |
Conclusion: Nissan cups retain heat best (lowest median temperature loss, smallest IQR), GG cups perform worst.
Section 4.2: Outliers
Outliers are data points that differ significantly from other observations. Identifying and understanding outliers is crucial for accurate data analysis.
Approach:
Check for data entry or measurement errors (e.g., unit confusion, transposed digits).
Consider if the outlier is a legitimate but extraordinary value (e.g., special events, rare occurrences).
Common Causes: Data entry errors, misunderstanding survey questions, misreading results, confusion about units, cheating, rare events.
Importance: Outliers can be the most interesting data values, revealing important phenomena or errors.
Example: The fastest roller coaster (Formula Rossa) is an outlier due to its unique hydraulic launch mechanism.
Section 4.3: Re-Expressing Data (Transformations)
Re-expressing or transforming data can make distributions more symmetric, clarify relationships, and improve interpretability.
Motivation: Skewed data can make it difficult to summarize or compare groups. Transformations can help.
Common Transformations:
For right-skewed data: Use logarithm (), square root (), or reciprocal ().
For left-skewed data: Use square ().
Example: CEO compensation is highly skewed; taking the logarithm makes the distribution more symmetric and easier to interpret.
Boxplots after Transformation: Log transformation can reveal differences between groups that were hidden in the original scale.
What Can Go Wrong?
Avoid inconsistent scales when comparing groups.
Label plots clearly.
Be aware of outliers and consider transformations when appropriate.
Be careful when taking logarithms (cannot take log of zero or negative numbers).
Summary: What Have We Learned?
Choose the right graphical tool: histograms for a few groups, boxplots for many groups.
Treat outliers with care—investigate their cause and consider their impact.
Transform data when necessary to improve symmetry and comparability.
Use both graphical and numerical summaries for a complete understanding of distributions.