Skip to main content
Back

Boxplots, Outliers, Standardization, and the Normal Model: Study Notes for Statistics

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Boxplots and Comparing Distributions

Boxplots: Construction and Interpretation

Boxplots are graphical displays that summarize the distribution of a quantitative variable using five-number summaries: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. They are especially useful for comparing groups and identifying outliers.

  • Five-number summary: Minimum, Q1, Median, Q3, Maximum

  • Quartiles: Divide the data into four equal parts, each containing 25% of the data

  • Median: The middle value, divides the data into two equal halves

  • Outliers: Values that fall outside the typical range, often marked with asterisks

Example: The boxplot for the number of U.S. states visited shows the median, quartiles, and an outlier. Boxplot with quartiles and median Boxplot with outliers marked Boxplot: Number of U.S. States Visited

Five-Number Summary Table

Variable

Minimum

Q1

Median

Q3

Maximum

Number of U.S. Visited

1

8

12

16

29

Statistics table for Number of U.S. States Visited

Comparing Boxplots and Histograms

Boxplots and histograms both display the distribution of data, but boxplots are better for comparing groups and identifying outliers, while histograms show the shape of the distribution.

  • Boxplot: Summarizes spread, center, and outliers

  • Histogram: Shows frequency and shape (e.g., symmetric, skewed, bimodal)

Example: The histogram and boxplot for the number of U.S. states visited show a roughly symmetric distribution with one outlier. Histogram: Number of U.S. States Visited Boxplot: Number of U.S. States Visited

Identifying Outliers

Formal Criterion for Outliers

Outliers are formally identified using the interquartile range (IQR). The boundaries for outliers are:

  • Lower boundary:

  • Upper boundary:

Values outside these boundaries are considered outliers. Boxplot with outliers marked

Comparing Groups: Side-by-Side Boxplots

Comparative Studies and Grouping Variables

When comparing groups, side-by-side boxplots are used to visualize differences in distributions. The grouping variable is categorical, and the response variable is quantitative.

  • Grouping variable: Separates data into categories (e.g., voice part in a choral group)

  • Response variable: The quantitative variable being measured (e.g., height)

Example: Heights of singers separated by voice part. Boxplot of Height by Voice Part

Voice Part

Count

Bass

36

Tenor

20

Alto

35

Soprano

39

Total

130

Tally table for Voice Part

Numerical Summaries: Mean, Median, Standard Deviation, and IQR

Mean and Median

  • Mean: The arithmetic average, sensitive to outliers

  • Median: The middle value, resistant to outliers

Mean and Median on a balance

Standard Deviation and IQR

  • Standard deviation (SD): Measures spread around the mean, not resistant to outliers

  • Interquartile Range (IQR): Measures spread of the middle 50% of data, resistant to outliers

Formula for Standard Deviation: Formula for IQR: Histogram: Summer Olympics 2024 Statistics table for Olympic Medals

Standardization: Z-scores

Using the Standard Deviation to Standardize Values

Standardization transforms data into Z-scores, which indicate how many standard deviations an observation is from the mean.

  • Z-score formula:

  • Interpretation: Positive Z-score = above mean; Negative Z-score = below mean

  • Application: Useful for comparing values from different distributions

Example: Body temperature observations compared to the mean.

Variable

N

Mean

StDev

Median

BodyTemp

578,222

98.0

0.7

98.0

Body temperature study article Body temperature study news Statistics table for Body Temperature Histogram of Body Temperature with 98.6 marked Histogram of Body Temperature with 98.0 marked

The Normal Model and the Empirical Rule

Normal Distribution

The normal model describes a bell-shaped, symmetric distribution defined by its mean () and standard deviation ().

  • Standard normal distribution:

  • Notation:

Empirical Rule (68-95-99.7 Rule)

  • 68% of observations within 1 standard deviation of the mean

  • 95% within 2 standard deviations

  • 99.7% within 3 standard deviations

Formula:

  • contains 68%

  • contains 95%

  • contains 99.7%

Empirical Rule diagram

Application Examples

Olympic Medals Data

  • Histogram and boxplot show right-skewed distribution with multiple outliers

  • Mean and median comparison helps determine skewness

Histogram: Summer Olympics 2024 Boxplot of Total Olympic Medals Summary report for Olympic Medals

Life Expectancy Data

  • Histogram and boxplot show distribution of life expectancy across countries

  • Five-number summary provides key statistics

Variable

Minimum

Q1

Median

Q3

Maximum

Average Life Expectancy

49.0

65.1

73.1

76.8

84.0

Statistics table for Life Expectancy Histogram: Life Expectancy Boxplot: Life Expectancy

Summary Table: Resistant vs. Non-Resistant Statistics

Statistic

Resistant?

Mean

No

Median

Yes

Standard Deviation

No

IQR

Yes

Key Takeaways

  • Boxplots and histograms are essential for visualizing and comparing distributions

  • Outliers are formally identified using the IQR criterion

  • Standardization (Z-scores) allows comparison across different scales

  • The normal model and empirical rule provide benchmarks for interpreting data

  • Resistant statistics (median, IQR) are preferred for skewed or outlier-prone data

Pearson Logo

Study Prep