Chapter 3: Describing, Exploring, and Comparing Data – Study Notes

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Describing, Exploring, and Comparing Data

Overview

This chapter focuses on methods for summarizing and comparing data sets. The main topics include measures of center, measures of variation, and measures of relative standing, which are fundamental for understanding and interpreting statistical data.

Measures of Center

Definition and Importance

Measures of center are values that represent the middle or central point of a data set. They help summarize a data set with a single value, making it easier to interpret and compare.

Mean (Arithmetic Mean): The sum of all data values divided by the number of values. It is sensitive to outliers and uses every data value.
Median: The middle value when data is ordered. It is resistant to outliers and does not use every data value directly.
Mode: The value(s) that occur most frequently. Can be used with qualitative data and may have no mode, one mode, or multiple modes.
Midrange: The value midway between the maximum and minimum values. It is not resistant to outliers and is rarely used in practice.

Mean (Arithmetic Mean)

The mean is calculated as follows:

Sample Mean:
Population Mean:

Properties:

Sample means vary less than other measures of center.
Mean is not resistant to outliers.

Median

The median is found by sorting the data and:

If odd number of values: Median is the middle value.
If even number of values: Median is the mean of the two middle values.

Properties:

Median is resistant to outliers.
Median does not use every data value directly.

Mode

The mode is the value(s) with the highest frequency. Data sets can be:

Bimodal: Two modes
Multimodal: More than two modes
No mode: No repeated values

Midrange

The midrange is calculated as:

Properties:

Very sensitive to extremes; not resistant.
Easy to compute.

Round-Off Rules

Mean, median, midrange: Carry one more decimal place than original data.
Mode: Leave as is, no rounding.

Critical Thinking

Always consider whether measures of center are meaningful for the data type and the sampling method used.

Mean from a Frequency Distribution

When data is summarized in a frequency distribution, the mean is approximated by:

Formula for mean from frequency distribution Frequency distribution table

Weighted Mean

When data values have different weights:

Measures of Variation

Definition and Importance

Measures of variation describe how spread out the data values are. They are crucial for understanding the reliability and consistency of data.

Range: Difference between maximum and minimum values.
Standard Deviation: Measures average deviation from the mean.
Variance: Square of the standard deviation.

Range

Calculated as:

Properties:

Very sensitive to extremes; not resistant.
Does not reflect variation among all values.

Standard Deviation

Standard deviation quantifies the spread of data values around the mean.

Sample Standard Deviation:
Population Standard Deviation:

Properties:

Never negative; zero if all values are identical.
Units are same as original data.
Sample standard deviation is a biased estimator of population standard deviation.

Range Rule of Thumb

Most values lie within 2 standard deviations of the mean. Significant values are those outside this range.

Range rule of thumb diagram

Estimating Standard Deviation

Variance

Variance is the square of the standard deviation.

Sample Variance:
Population Variance:

Properties:

Units are squares of original units.
Not resistant to outliers.
Sample variance is an unbiased estimator of population variance.

Empirical Rule

For bell-shaped distributions:

68% within 1 standard deviation
95% within 2 standard deviations
99.7% within 3 standard deviations

Empirical rule bell curve

Chebyshev’s Theorem

For any data set, at least of values lie within k standard deviations of the mean (k > 1).

Coefficient of Variation

Expresses standard deviation relative to the mean as a percentage:

Sample:
Population:

Measures of Relative Standing and Boxplots

Definition and Importance

Measures of relative standing indicate the position of a data value within a data set. Common measures include z scores, percentiles, and quartiles. Boxplots visually summarize these measures.

z Scores

A z score shows how many standard deviations a value is from the mean:

Sample:
Population:

Properties:

z scores have no units.
z ≤ -2: Significantly low; z ≥ 2: Significantly high.
Negative z: Value below mean.

z score significance diagram

Percentiles

Percentiles divide data into 100 groups, each with about 1% of values. To find the percentile of a value:

Percentile of x =

Converting a Percentile to a Data Value

To find the kth percentile:

Compute locator:
If L is a whole number, the percentile is midway between the Lth and (L+1)th values.
If L is not a whole number, round up and use the Lth value.

Percentile calculation flowchart Percentile midpoint explanation Percentile values table

Quartiles

Quartiles divide data into four groups:

Q1: First quartile (25th percentile)
Q2: Second quartile (50th percentile, median)
Q3: Third quartile (75th percentile)

Statistics defined using quartiles:

Interquartile Range (IQR):
Semi-interquartile Range:
Midquartile Range:
10-90 Quartile Range:

5-Number Summary

The 5-number summary consists of:

Minimum
Q1
Median (Q2)
Q3
Maximum

Minimum and maximum values table Quartile values table

Boxplot (Box-and-Whisker Diagram)

A boxplot is a graphical representation of the 5-number summary. It consists of a box from Q1 to Q3, a line at the median, and whiskers extending to the minimum and maximum values.

Boxplot diagram

Skewness

A boxplot can help identify skewness. A distribution is skewed if it is not symmetric and extends more to one side.

Identifying Outliers for Modified Boxplots

Outliers are values that fall outside the range:

Above Q3 by more than 1.5 × IQR
Below Q1 by more than 1.5 × IQR

Modified boxplots mark outliers with special symbols and extend whiskers only to the most extreme non-outlier values.