BackDescribing Data with Numbers: Measures of Center, Variability, Outliers, and Summary Statistics
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Describing Data with Numbers
Measures of Center (Mean, Median, Mode)
Measures of center help us summarize a dataset by identifying a typical or central value. The three most common measures are the mean, median, and mode.
Mean (Arithmetic Average): The mean is the sum of all observations divided by the number of observations. It is sensitive to extreme values (outliers).
Median: The median is the middle value when the data are ordered from smallest to largest. If there is an even number of observations, the median is the average of the two middle values. The median is resistant to outliers.
Mode: The mode is the value that occurs most frequently in the dataset. There can be more than one mode or no mode at all if all values are unique.
Example: For the dataset 10, 12, 12, 15, 17, 18, 18, 19, 20, 22, 90:
Mean:
Median: 18
Mode: 12 and 18 (bimodal)
Choosing the Right Measure:
Use the mean for symmetric distributions without outliers.
Use the median for skewed distributions or when outliers are present.
Use the mode for categorical data or to identify the most common category.
The Shape of a Distribution
Understanding the shape of a distribution helps determine which measure of center is most appropriate.
Symmetric Distribution: Mean and median are close together.
Right-Skewed Distribution: Mean is greater than the median.
Left-Skewed Distribution: Mean is less than the median.
Illustration: Histograms can visually show the relationship between mean and median in different distributions.
Measures of Variability (Range, IQR, Standard Deviation)
Measures of variability describe how spread out the data are. Common measures include the range, interquartile range (IQR), and standard deviation.
Range: The difference between the maximum and minimum values.
Quartiles (Q1, Q3): Values that divide the data into quarters.
Interquartile Range (IQR): The difference between the third and first quartiles.
Variance and Standard Deviation: The variance is the average squared deviation from the mean. The standard deviation is the square root of the variance.
Example: For the dataset 2, 4, 4, 4, 5, 5, 7, 9:
Mean: 5
Range: 7
IQR: 4 (Q3 = 7, Q1 = 3)
Standard deviation:
Identifying Outliers
Outliers are values that are unusually far from the rest of the data. Two common rules for identifying outliers are the IQR rule and the z-score rule.
IQR Rule: A value is a potential outlier if it is below or above .
Z-score Rule: A value is a potential outlier if its z-score is greater than 3 or less than -3.
Empirical Rule: For bell-shaped distributions, about 68% of data fall within 1 standard deviation of the mean, 95% within 2, and 99.7% within 3.
The Five-Number Summary and Boxplots
The five-number summary consists of the minimum, first quartile (Q1), median, third quartile (Q3), and maximum. A boxplot is a graphical display of the five-number summary.
Boxplot: The box shows the IQR, the line inside the box marks the median, and the "whiskers" extend to the minimum and maximum values (excluding outliers).
Statistic | Group A | Group B |
|---|---|---|
Min | 70.0 | 60.0 |
Q1 | 74.0 | 68.0 |
Median | 75.0 | 71.0 |
Q3 | 76.0 | 74.0 |
Max | 78.0 | 80.0 |
Example: Side-by-side boxplots allow for comparison of distributions between groups.
Choosing Appropriate Summary Statistics
The choice of summary statistic depends on the type of data and the shape of the distribution.
Categorical and Binary Data: Use counts and proportions.
Ordinal Data: Use medians and quartiles.
Quantitative Data: For symmetric distributions without outliers, use mean and standard deviation. For skewed distributions or those with outliers, use median and IQR.
Test Scores | Mean | Median |
|---|---|---|
80, 82, 85, 88, 90 | 85 | 85 |
60, 70, 80, 90, 100 | 80 | 80 |
Example: For highly skewed income data, the median is a better measure of center than the mean.
Recap: Key Terms
Keyword | Definition |
|---|---|
Mean | The arithmetic average; sum of all observations divided by the number of observations. |
Median | The middle value when data are ordered; resistant to outliers. |
Mode | The most frequent value in a dataset. |
Range | Difference between maximum and minimum values. |
Quartile | Values that divide the data into four equal parts. |
IQR | Interquartile range; difference between Q3 and Q1. |
Variance | Average squared deviation from the mean. |
Standard deviation | Square root of the variance; typical distance from the mean. |
Outlier | Observation that lies far from the rest of the data. |
Additional info:
Boxplots are especially useful for comparing multiple groups on the same scale.
Summary statistics should be chosen based on data type and distribution shape to avoid misleading conclusions.