Displaying and Summarizing Quantitative Data

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

W2 Displaying and Summarizing Quantitative Data

Introduction

Quantitative data analysis is a foundational aspect of statistics, enabling us to understand, summarize, and communicate the characteristics of numerical datasets. This section covers the main graphical and numerical methods for displaying and summarizing quantitative data, including histograms, stem-and-leaf plots, dotplots, measures of center and spread, the five-number summary, and boxplots.

Displaying Quantitative Data

Histograms

Histograms are graphical representations that display the distribution of a quantitative variable by grouping data into bins of equal width and plotting the frequency or relative frequency of observations in each bin. Unlike bar charts, which are used for categorical data, histograms are used for quantitative data and the bars touch each other to indicate the continuity of the data.

Key Features: Bins (intervals), frequency or relative frequency on the y-axis, no gaps between bars unless a bin is empty.
Purpose: To visualize the shape, center, and spread of a distribution.
Steps to Construct:
1. Choose data boundaries and bin size.
2. Build a frequency table (frequency, relative frequency, cumulative frequency).
3. Plot bins on the x-axis and frequencies on the y-axis.

Histogram of singers' heights Histogram of earthquake magnitudes Relative frequency histogram of earthquake magnitudes

Stem-and-Leaf Plots

Stem-and-leaf plots display quantitative data while preserving individual data values. Each value is split into a "stem" (all but the final digit) and a "leaf" (the final digit). This method provides a quick way to visualize the distribution and retains the original data.

Construction: Separate each value into stem and leaf, list stems in order, and align leaves accordingly.
Advantage: Retains all original data values and shows distribution shape.

Stem-and-leaf plot of pulse rates

Dotplots

Dotplots are simple displays where each data value is represented by a dot placed along a number line. Dotplots are useful for small to moderate-sized datasets and allow for easy identification of clusters, gaps, and outliers.

Key Features: Each dot represents one observation; can be displayed horizontally or vertically.

Dotplot of Kentucky Derby winning times

Describing Distributions

Shape, Center, and Spread

When describing a distribution, always address three main aspects: shape, center, and spread.

Shape: Modality (number of peaks), symmetry, skewness, and presence of outliers or gaps.
Center: Mode, median, or mean.
Spread: Range, interquartile range (IQR), variance, and standard deviation.

Shape: Modality and Symmetry

Unimodal: One peak.
Bimodal: Two peaks.
Multimodal: More than two peaks.
Uniform: All bars approximately the same height.
Symmetric: Both sides of the distribution are mirror images.
Skewed: One tail is longer than the other (skewed left or right).

Histogram showing modality Unimodal and bimodal histograms Uniform histogram Symmetric histogram Skewed left and right histograms

Outliers and Gaps

Outliers: Data points that fall far from the main body of the distribution.
Gaps: Intervals with no data, possibly indicating multiple groups.

Histogram with outliers and gaps

Measures of Center

Mode: Most common value.
Median: Middle value when data are ordered; divides data into two equal halves.
Mean (Average): Sum of all values divided by the number of values. For a dataset :

Histogram showing center Median in a histogram Histogram with balancing point (mean)

Choosing Mean or Median

If the distribution is symmetric, mean and median are similar; either can be used.
If the distribution is skewed or has outliers, the median is preferred as it is less affected by extreme values.

Symmetric distribution: mean and median Skewed distribution: median preferred

Measures of Spread

Range: Difference between the maximum and minimum values. Range = Highest Value − Lowest Value
Interquartile Range (IQR): Difference between the upper quartile (Q3, 75th percentile) and lower quartile (Q1, 25th percentile). IQR = Q3 − Q1
Variance: Average of the squared deviations from the mean (for sample variance):
Standard Deviation: Square root of the variance; measures average distance from the mean:

Histogram showing range Histogram showing IQR Histogram showing IQR and percentiles Deviations from the mean

The Five-Number Summary

The five-number summary consists of the minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It provides a concise summary of the distribution and is the basis for constructing boxplots.

Statistic	Description
Minimum	Smallest value (0th percentile)
Q1	Lower quartile (25th percentile)
Median	Middle value (50th percentile)
Q3	Upper quartile (75th percentile)
Maximum	Largest value (100th percentile)

Boxplots

Boxplots are graphical displays of the five-number summary. They are especially useful for comparing distributions between groups and identifying outliers.

Box spans from Q1 to Q3, with a line at the median.
Whiskers extend to the most extreme values within 1.5 IQRs of the quartiles.
Outliers are plotted individually beyond the whiskers.

Boxplot and histogram Boxplot construction steps Boxplot with outliers

Standard Deviation and the Empirical Rule

Standard Deviation in Symmetric Distributions

For unimodal and symmetric distributions, the standard deviation is the most informative measure of spread. The Empirical Rule states:

About 68% of data lie within ±1 standard deviation from the mean.
About 95% within ±2 standard deviations.
About 99.7% within ±3 standard deviations.

Empirical rule (68-95-99.7)

Choosing Appropriate Measures

If the distribution is symmetric, use mean and standard deviation or variance.
If the distribution is skewed or has outliers, use median and IQR.

Symmetric vs. skewed: which measure to use

Summary Table: Measures of Center and Spread

Distribution Type	Center	Spread
Symmetric	Mean or Median	Standard Deviation, Variance, IQR, Range
Skewed/Outliers	Median	IQR, Range

Common Mistakes

Do not use histograms for categorical data.
Do not describe shape, center, and spread using bar charts.
Choose appropriate bin widths for histograms.
Use the correct units and scales for your data.

Key Takeaways

Make and interpret histograms, stem-and-leaf plots, and dotplots.
Describe the shape, center, and spread of a distribution.
Calculate the standard deviation and IQR.
Use the five-number summary to construct a boxplot and identify outliers.