BackDisplaying and Summarizing Quantitative Data & Comparing Distributions
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Displaying and Summarizing Quantitative Data
Frequency Distribution
A frequency distribution organizes quantitative data into classes (bins) and counts the number of occurrences (frequency) in each class. This is a foundational method for summarizing large datasets and identifying patterns.
Steps to Construct a Frequency Distribution:
Identify the smallest and largest observations in the dataset.
Divide the interval between the smallest and largest observations into non-overlapping subintervals (bins).
Assign each observation to a bin, ensuring each falls into one and only one subinterval. Left end-point convention: All intervals (except the last) include the left end-point but not the right.
Count the number of observations (frequency) in each bin. Optionally, compute the relative frequency (proportion in each bin).
Example: Age (in years) data of 25 patients: 42, 44, 44, 47, 50, 51, 52, 55, 55, 57, 58, 58, 60, 61, 61, 64, 68, 69, 71, 72, 80, 82, 85
Bin | Frequency | Relative frequency |
|---|---|---|
40 to 50 | 5 | 20% |
50 to 60 | 9 | 36% |
60 to 70 | 6 | 24% |
70 to 80 | 2 | 8% |
80 to 90 | 3 | 12% |
Graphical Presentations of Distributions
Histogram
A histogram is a graphical representation of a frequency distribution. It uses adjacent rectangular bars to show the frequency or relative frequency of data within each bin.
Draw bars for each bin with height proportional to the frequency or relative frequency.
The horizontal axis represents the bins (data intervals), and the vertical axis represents frequency or relative frequency.
Histograms are useful for visualizing the shape, center, and spread of data.
Stem-and-Leaf Display
A stem-and-leaf display is a method of displaying quantitative data that retains the original data values while showing the distribution.
Partition each observation into a stem (leading digits) and a leaf (trailing digit).
Write stems in a vertical column; record leaves in order in the row corresponding to their stem.
Do not omit repeated values.
Example: For the age data above:
Stem | Leaf |
|---|---|
4 | 2 4 4 7 |
5 | 0 1 2 5 5 7 8 8 |
6 | 0 1 1 4 8 9 |
7 | 1 2 |
8 | 0 2 5 |
Describing a Distribution for Quantitative Data
Shape
Unimodal: One peak
Bimodal: Two peaks
Multimodal: More than two peaks
Symmetric: Both sides of the center are approximately mirror images
Skewed: One tail is longer than the other
Skewed to the left: Long left tail
Skewed to the right: Long right tail
Outliers: Unusually large or small observations (extreme values)
Center
Where do the observations center about? (e.g., mean or median)
Spread
How spread out are the observations? (e.g., range, interquartile range, standard deviation)
Comparing Distributions
When comparing two or more distributions, plot histograms on the same scale for accurate comparison.
Measures of Center
Mean
The mean is the arithmetic average of the observations.
Formula:
is the th observation, is the number of observations.
Median
The median is the middle value of the ordered data set, dividing it into two equal parts.
Arrange data in ascending order.
If is odd: median is the th observation.
If is even: median is the average of the th and th observations.
Measures of Spread
Range
The range is the difference between the maximum and minimum values.
Interquartile Range (IQR)
The interquartile range (IQR) is the range that encloses the middle 50% of the observations.
Quartiles:
(first quartile): 25th percentile
(second quartile): 50th percentile (median)
(third quartile): 75th percentile
Formula:
Percentile: The th percentile is the value below which of the observations fall.
Variance and Standard Deviation
Variance measures the average squared deviation from the mean. Standard deviation (SD) is the square root of the variance and has the same units as the data.
Variance formula:
Standard deviation formula:
Variance and SD are always non-negative.
SD has the same unit as the original data.
Both are zero if all observations are equal.
Five-Number Summary & Boxplots
The five-number summary consists of: minimum, , median (), , and maximum. A boxplot graphically displays these values and is useful for comparing distributions.
Outliers: Observations above or below are considered suspected outliers and are plotted as separate points.
Whiskers: Extend to the smallest and largest non-outlier observations.
Sensitivity to Outliers
Sensitive to Outliers | Not Sensitive to Outliers |
|---|---|
Mean | Median |
Range | IQR |
Variance, SD |
Summary statistics like the mean, range, variance, and SD are easily influenced by outliers.
The median and IQR are more robust to outliers.
Choosing Summary Statistics
If the data distribution is roughly symmetric, use the mean, variance, and SD.
If the distribution is skewed, use the median and IQR.
The median is always reported with the IQR; the mean with the variance or SD.
For multimodal distributions, summary statistics may not adequately describe the data.
Do not simply discard outliers; consider reporting statistics with and without them.