Comparing and Summarizing Distributions & The Standard Deviation as a Ruler and the Normal Model

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Comparing and Summarizing Distributions

Displaying Quantitative Data

Quantitative data can be visualized using histograms, stem-and-leaf plots, dotplots, and boxplots. These tools help us understand the shape, center, and spread of a distribution.

Shape: Describes modality (number of peaks), symmetry, skewness, and presence of outliers.
Center: Measured by mean, median, or mode.
Spread: Measured by range, interquartile range (IQR), variance, and standard deviation.

Example: The histogram and boxplot below show the distribution of average wind speed in Hopkins Forest, 2011. The distribution is unimodal and skewed to the right, with several outliers.

Histogram and boxplot of average wind speed

Comparing Groups

Comparing groups is a fundamental aspect of statistical analysis. We use summary statistics and graphical displays to compare distributions across categories or time periods.

Histograms and boxplots allow us to compare shapes, centers, and spreads.
Boxplots are especially useful for side-by-side comparisons, as they summarize key features and hide unnecessary details.

Example: The histograms below compare average wind speed in summer and winter.

Histograms of wind speed for summer and winter

The table summarizes the statistics for each season:

Group	Mean	StdDev	Median	IQR	Q1	Q3	Minimum	Maximum
Winter	1.90	1.29	1.72	1.82	0.84	2.66	0.02	6.73
Summer	1.11	1.10	0.71	1.27	0.34	1.62	0.00	5.47

Summary statistics table for wind speed

Monthly boxplots further illustrate variation throughout the year:

Monthly boxplots of wind speed

Timeplots

Timeplots (time series plots) display data values against time, revealing trends and patterns in temporal data. They are useful for identifying point-to-point variation and underlying smooth trends.

Timeplot of average wind speed Timeplot with smooth trend line

Dealing with Outliers

Outliers are values that deviate markedly from the rest of the data. They may indicate exceptional cases or errors. When reporting summary statistics, it is important to consider the effect of outliers.

Report statistics with and without outliers.
Outliers can distort the mean and standard deviation.

Re-expressing Skewed Data

Skewed distributions can be made more symmetric by applying transformations, such as logarithmic functions. This process, called re-expression, facilitates analysis and interpretation.

Histogram of log-transformed CEO compensation

Global Income Distribution Example

Income distributions often exhibit right skewness, with a long tail of high values. Comparing distributions across years can reveal changes in economic inequality.

Global income distribution in 2003 and 2013

Common Pitfalls in Displaying Data

Avoid inconsistent scales when comparing displays.
Label axes clearly.
Be cautious when comparing groups with different spreads.

Example of inconsistent scales in a graph Boxplots of cotinine levels by smoke exposure

The Standard Deviation as a Ruler and the Normal Model

Standardizing Data: Z-Scores

Standardization allows comparison of values measured in different units or scales. The z-score measures the distance of an observation from the mean, expressed in standard deviations.

Formula: for a sample, for a population
Interpretation: z = 0 means the value equals the mean; positive z indicates above the mean; negative z indicates below the mean.
Standardized values allow comparison across different groups or variables.

Z-score interpretation diagram

Shifting and Scaling Data

Shifting (subtracting a constant) changes the center but not the spread. Scaling (multiplying/dividing by a constant) changes the spread. Standardizing (z-scores) shifts the mean to zero and rescales the standard deviation to one.

Histogram of men's weights Histogram of kg above recommended weight Histograms of weights in kg and pounds

Properties of Z-Scores

Does not change the shape of the distribution.
Changes the center (mean = 0).
Changes the spread (standard deviation = 1).
Values below the mean have negative z-scores; values above have positive z-scores.

Density Curves and the Normal Model

Density curves are smooth curves fitted to histograms of quantitative data. The Normal Model is a bell-shaped, symmetric, unimodal density curve defined by two parameters: mean () and standard deviation ().

Density curves must be non-negative and have total area = 1.
The Normal Model is used to generalize and simplify data analysis.

Histogram with density curve Density curve with area under curve Normal model density curve

Parameters and Statistics

Sample summaries use Latin letters (mean , standard deviation ).
Population parameters use Greek letters (mean , standard deviation ).

Normal model with parameters Normal model with mu and sigma

The 68-95-99.7 Rule

In a Normal distribution:

About 68% of values fall within ±1 standard deviation of the mean.
About 95% fall within ±2 standard deviations.
About 99.7% fall within ±3 standard deviations.

Normal curve with 68-95-99.7 rule Normal curve with shaded regions

Finding Normal Percentiles

To find the percentile for a value in a Normal distribution:

Convert the value to a z-score.
Use the standard Normal table (Table Z) to find the area to the left of the z-score.

Example: If the mean TOEFL score is 540 and the standard deviation is 60, a score of 648 has a z-score of .

Normal curve with z-score and area

To find the proportion between two z-scores, subtract the areas:

Area (z < 1.0) = 0.8413
Area (z < -0.5) = 0.3085
Proportion = 0.8413 - 0.3085 = 0.5328 (53.28%)

Normal curve with shaded region between z-scores Normal table Z

Normal Probability Plots

Normal probability plots help assess whether data are approximately Normal. If the plot is roughly a straight line, the data are nearly Normal. Curvature indicates skewness or deviation from Normality.

Normal probability plot example Normal probability plot with skewness Histogram and normal probability plot for nearly normal data

What Can Go Wrong?

Do not use the Normal model for distributions that are not unimodal and symmetric.
Do not use mean and standard deviation when outliers are present.
Do not round results in the middle of calculations.

Summary of Key Concepts

Choose appropriate tools for comparing distributions (histograms, boxplots).
Treat outliers with care.
Use timeplots for temporal data.
Re-express skewed data to improve symmetry.
Standardize data using z-scores for comparison.
Check for Normality before applying the Normal model.
Use the 68-95-99.7 Rule and Normal tables to find probabilities and percentiles.