BackDescribing Data: Numerical Measures – Study Notes for Statistics for Business
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Describing Data: Numerical Measures
Chapter Overview
This chapter introduces essential numerical measures used to summarize and describe data in business statistics. Understanding these measures allows for effective data analysis, comparison, and interpretation, which are crucial for informed business decision-making.
Central Tendency: Mean, median, and mode
Variation: Range, variance, standard deviation, coefficient of variation
Relative Location: Percentiles, quartiles
Distribution Shape: Symmetry and skewness
Graphical Summaries: Five-number summary, box-and-whisker plots
Relationships: Covariance and correlation
Measures of Central Tendency
Arithmetic Mean
The arithmetic mean (or simply, mean) is the most common measure of central tendency. It is calculated as the sum of all values divided by the number of values.
Population Mean:
Sample Mean:
Sensitivity: The mean is affected by extreme values (outliers).
Median
The median is the middle value in an ordered list (50% above, 50% below). It is not affected by outliers.
Median Position: value in ordered data
If n is odd, the median is the middle value; if even, it is the average of the two middle values.
Mode
The mode is the value that occurs most frequently in a dataset. It can be used for both numerical and categorical data and is not affected by outliers.
There may be no mode, one mode (unimodal), or multiple modes (bimodal, multimodal).
Choosing the Best Measure
Mean: Generally preferred, unless outliers are present.
Median: Preferred when data contain outliers or are skewed.
Mode: Useful for categorical data or when identifying the most common value.
Shape of a Distribution
The shape describes how data are distributed:
Symmetric: Mean = Median = Mode
Positively Skewed: Mean > Median > Mode
Negatively Skewed: Mean < Median < Mode
Measures of Relative Location
Percentiles and Quartiles
Percentiles: Divide ordered data into 100 equal parts. The p-th percentile is the value below which p% of observations fall.
Quartiles: Divide data into four equal segments.
First quartile (Q1): 25% below
Second quartile (Q2): 50% below (the median)
Third quartile (Q3): 75% below
Quartile Position Formulas:
Five-Number Summary and Box-and-Whisker Plots
Five-Number Summary
Minimum
First Quartile (Q1)
Median (Q2)
Third Quartile (Q3)
Maximum
Order: Minimum < Q1 < Median < Q3 < Maximum
Box-and-Whisker Plot
Graphical representation of the five-number summary
Box shows Q1 to Q3 with a line at the median
Whiskers extend to minimum and maximum values
Measures of Variability
Range
Definition: Difference between the largest and smallest values
Limitation: Sensitive to outliers and ignores data distribution
Interquartile Range (IQR)
Definition: Spread of the middle 50% of data
Advantage: Reduces the effect of outliers
Variance and Standard Deviation
Variance: Average squared deviation from the mean
Population Variance:
Sample Variance:
Standard Deviation: Square root of variance; restores original units
Population:
Sample:
Coefficient of Variation (CV)
Definition: Measures variation relative to the mean (unitless, percentage)
Population:
Sample:
Use: Compare variability between datasets with different units or means
Empirical Rule and Chebyshev's Theorem
Empirical Rule (for bell-shaped distributions)
About 68% of data within 1 standard deviation of the mean
About 95% within 2 standard deviations
About 99.7% within 3 standard deviations
Chebyshev's Theorem (any distribution)
At least of data falls within k standard deviations of the mean (for k > 1)
For k = 2: at least 75% within 2 standard deviations
For k = 3: at least 89% within 3 standard deviations
z-Score
A z-score standardizes a value by expressing its distance from the mean in terms of standard deviations.
(population)
z > 0: value above mean; z < 0: value below mean; z = 0: value equals mean
Weighted Mean and Grouped Data
Weighted Mean
Used when data values have different weights
Grouped Data Approximations
For data grouped into classes, use class midpoints and frequencies to estimate mean and variance
Mean: , where is frequency and is midpoint
Variance:
Measures of Relationship: Covariance and Correlation
Covariance
Measures the direction of the linear relationship between two variables X and Y
Population:
Sample:
Cov(X,Y) > 0: positive relationship; Cov(X,Y) < 0: negative relationship; Cov(X,Y) = 0: no linear relationship
Correlation Coefficient (r)
Measures both the strength and direction of a linear relationship
Population:
Sample:
Range: -1 ≤ r ≤ 1
r close to 1: strong positive; r close to -1: strong negative; r close to 0: weak or no linear relationship
Tabular Example: Summary Statistics for Four Locations
The following table summarizes key statistics for four locations (from the boxplot example):
Location | Mean | Min | Q1 | Median | Q3 | Max | IQR | Range |
|---|---|---|---|---|---|---|---|---|
1 | 10.1 | 6 | 8.0 | 10.5 | 12.5 | 14 | 4.5 | 8 |
2 | 13.6 | 8 | 10.75 | 13.5 | 16.75 | 19 | 6.0 | 11 |
3 | 17.5 | 11 | 15.0 | 17.5 | 20.5 | 25 | 5.5 | 14 |
4 | 12.5 | 8 | 10.5 | 12.0 | 15.0 | 18 | 4.5 | 10 |
Key Takeaways
Central tendency and variability are fundamental for summarizing data.
Relative location measures (percentiles, quartiles) help interpret individual values within a dataset.
Boxplots and five-number summaries provide visual and numerical summaries of data distribution.
Covariance and correlation quantify relationships between variables, essential for regression and prediction.