Displaying and Summarizing Quantitative Data & Comparing Distributions

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Displaying and Summarizing Quantitative Data

Frequency Distribution

A frequency distribution organizes quantitative data into classes (bins) and counts the number of occurrences (frequency) in each class. This is a foundational method for summarizing large datasets and identifying patterns.

Steps to Construct a Frequency Distribution:
1. Identify the smallest and largest observations in the dataset.
2. Divide the interval between the smallest and largest observations into non-overlapping subintervals (bins).
3. Assign each observation to a bin, ensuring each falls into one and only one subinterval. Left end-point convention: All intervals (except the last) include the left end-point but not the right.
4. Count the number of observations (frequency) in each bin. Optionally, compute the relative frequency (proportion in each bin).
Example: Age (in years) data of 25 patients: 42, 44, 44, 47, 50, 51, 52, 55, 55, 57, 58, 58, 60, 61, 61, 64, 68, 69, 71, 72, 80, 82, 85

Bin	Frequency	Relative frequency
40 to 50	5	20%
50 to 60	9	36%
60 to 70	6	24%
70 to 80	2	8%
80 to 90	3	12%

Graphical Presentations of Distributions

Histogram

A histogram is a graphical representation of a frequency distribution. It uses adjacent rectangular bars to show the frequency or relative frequency of data within each bin.

Draw bars for each bin with height proportional to the frequency or relative frequency.
The horizontal axis represents the bins (data intervals), and the vertical axis represents frequency or relative frequency.
Histograms are useful for visualizing the shape, center, and spread of data.

Stem-and-Leaf Display

A stem-and-leaf display is a method of displaying quantitative data that retains the original data values while showing the distribution.

Partition each observation into a stem (leading digits) and a leaf (trailing digit).
Write stems in a vertical column; record leaves in order in the row corresponding to their stem.
Do not omit repeated values.
Example: For the age data above:

Stem	Leaf
4	2 4 4 7
5	0 1 2 5 5 7 8 8
6	0 1 1 4 8 9
7	1 2
8	0 2 5

Describing a Distribution for Quantitative Data

Shape

Unimodal: One peak
Bimodal: Two peaks
Multimodal: More than two peaks
Symmetric: Both sides of the center are approximately mirror images
Skewed: One tail is longer than the other
- Skewed to the left: Long left tail
- Skewed to the right: Long right tail
Outliers: Unusually large or small observations (extreme values)

Center

Where do the observations center about? (e.g., mean or median)

Spread

How spread out are the observations? (e.g., range, interquartile range, standard deviation)

Comparing Distributions

When comparing two or more distributions, plot histograms on the same scale for accurate comparison.

Measures of Center

Mean

The mean is the arithmetic average of the observations.

Formula:

is the th observation, is the number of observations.

Median

The median is the middle value of the ordered data set, dividing it into two equal parts.

Arrange data in ascending order.
If is odd: median is the th observation.
If is even: median is the average of the th and th observations.

Measures of Spread

Range

The range is the difference between the maximum and minimum values.

Interquartile Range (IQR)

The interquartile range (IQR) is the range that encloses the middle 50% of the observations.

Quartiles:
- (first quartile): 25th percentile
- (second quartile): 50th percentile (median)
- (third quartile): 75th percentile
Formula:

Percentile: The th percentile is the value below which of the observations fall.

Variance and Standard Deviation

Variance measures the average squared deviation from the mean. Standard deviation (SD) is the square root of the variance and has the same units as the data.

Variance formula:

Standard deviation formula:

Variance and SD are always non-negative.
SD has the same unit as the original data.
Both are zero if all observations are equal.

Five-Number Summary & Boxplots

The five-number summary consists of: minimum, , median (), , and maximum. A boxplot graphically displays these values and is useful for comparing distributions.

Outliers: Observations above or below are considered suspected outliers and are plotted as separate points.
Whiskers: Extend to the smallest and largest non-outlier observations.

Sensitivity to Outliers

Sensitive to Outliers	Not Sensitive to Outliers
Mean	Median
Range	IQR
Variance, SD

Summary statistics like the mean, range, variance, and SD are easily influenced by outliers.
The median and IQR are more robust to outliers.

Choosing Summary Statistics

If the data distribution is roughly symmetric, use the mean, variance, and SD.
If the distribution is skewed, use the median and IQR.
The median is always reported with the IQR; the mean with the variance or SD.
For multimodal distributions, summary statistics may not adequately describe the data.
Do not simply discard outliers; consider reporting statistics with and without them.