BackDescriptive Statistics 2: Measures of Variation and Boxplots
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Measures of Variation and Dispersion
Introduction
Understanding the spread or variability of data is essential in statistics. Measures of variation quantify how much the data values differ from each other and from the center of the distribution. This section covers the main measures of variation and their applications.
Range
Definition: The range is the simplest measure of dispersion, representing the difference between the maximum and minimum values in a dataset.
Formula:
Example: If the highest return on an investment is 2700 and the lowest is -2250, the range is 4950.
Variance and Standard Deviation
Variance: Measures the average squared deviation of each data point from the mean. There are two types: sample variance and population variance.
Sample Variance Formula:
Population Variance Formula:
Standard Deviation: The square root of the variance, providing a measure of spread in the same units as the data.
Formula:
Application: Standard deviation is widely used to measure risk or volatility, such as in investment returns.
When to Use Mean and Standard Deviation
Use the mean and standard deviation to summarize symmetric data distributions.
Example: Restaurant rating scores (n = 100, Mean = 21.27, Standard deviation = 2.39) are summarized using these measures when the distribution is approximately symmetric.
Percentiles and Interquartile Range (IQR)
Percentile: The value below which a given percentage of observations fall.
Key Percentiles:
– 25th percentile (lower quartile)
– 50th percentile (median)
– 75th percentile (upper quartile)
Interquartile Range (IQR): Measures the spread of the middle 50% of data values.
Formula:
Example: For company size data (n = 2546), Median = 41, IQR = [14, 161].
How to Summarize Skewed Data
For skewed data, use the median and IQR instead of the mean and standard deviation.
Example: Company size (number of employees) is often right-skewed, so median and IQR are more informative than mean and standard deviation.
Boxplots
Introduction
Boxplots are graphical representations that display the distribution of data based on quartiles, highlighting the median, spread, and potential outliers.
Components of a Boxplot
Box: Spans from (25th percentile) to (75th percentile).
Median: Shown as a line inside the box (50th percentile).
Whiskers: Extend to the minimum and maximum values within 1.5 × IQR from the quartiles.
Outliers: Data points outside the whiskers are plotted individually.
Example: Systolic Blood Pressure (200 adults)
Max: 217
: 146.5
Median: 133
: 119
Min: 93.5
Comparing Groups with Boxplots
Boxplots are effective for comparing distributions between groups.
Example: Earnings per hour of managers in small (<50 employees) and large (>500 employees) firms.
Median values: Small firm = €26, Large firm = €42
Case Study: Boxplot in Technology
Network latency (NL) is the time taken for data to travel between devices.
Boxplots were used to compare latency measurements between two products, visually demonstrating differences in performance.
Coefficient of Variation
Definition and Formula
The coefficient of variation (CV) expresses the standard deviation as a percentage of the mean, allowing comparison of variability across datasets with different units or means.
Population CV:
Sample CV:
Example: Risky Investments
Given share price data for several companies, the coefficient of variation helps identify which investment is riskier.
Company | Sample size | Mean | Standard deviation | Coefficient of Variation |
|---|---|---|---|---|
A | 1000 | 15.64 | 2.63 | 16.82% |
B | 250 | 21.52 | 5.61 | 26.07% |
C | 200 | 10.21 | 1.21 | 11.85% |
D | 450 | 1.13 | 0.02 | 1.76% |
Company B has the highest risk (highest CV).
Example: Length of Stay (Nights) for Visitors
Package | Independent | |
|---|---|---|
Mean | 3.5 | 8.1 |
Median | 4.0 | 5.2 |
Std. Deviation | 2.3 | 4.2 |
Interquartile Range | 2.0 | 3.9 |
Coefficient of Variation | 65.7% | 51.9% |
Sample size | 180 | 225 |
Package visitors have a shorter and less variable length of stay compared to independent visitors.
Summary
For symmetric data, use mean and standard deviation to summarize central tendency and dispersion.
For skewed data, use median and IQR.
Boxplots are valuable for identifying outliers and comparing groups visually.