Skip to main content
Back

Exploring Data with Graphs and Numerical Summaries

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Chapter 2: Exploring Data with Graphs and Numerical Summaries

Introduction to Data Exploration

Exploring data involves both graphical and numerical methods to understand the distribution, center, and spread of a dataset. Graphical summaries provide a visual representation, while numerical summaries offer precise measures of central tendency and variability.

  • Graphical summaries help visualize the shape and distribution of data.

  • Always graph the data first to gain an initial understanding.

  • Follow up with numerical summaries to describe typical values and the spread of observations.

Section 2.3: Describing the Center of Quantitative Data

Learning Objectives

  • Calculating the mean

  • Calculating the median

  • Comparing the mean and median

  • Definition of resistant measures

  • Identifying the mode of a distribution

Mean

The mean is the arithmetic average of a set of observations and represents the center of mass of the data.

  • Definition: The mean is the sum of all observations divided by the number of observations.

  • Formula:

  • The mean is sensitive to every value in the dataset, including outliers.

  • Example: For sodium content in cereals, the mean can be calculated using statistical software or calculators.

Median

The median is the midpoint of the ordered observations and divides the data into two equal halves.

  • Definition: The median is the value that separates the higher half from the lower half of the data.

  • Calculation Steps:

    1. Order the observations from smallest to largest.

    2. If the number of observations (n) is odd, the median is the middle value.

    3. If n is even, the median is the average of the two middle values.

  • Example: For the ordered data [78, 91, 94, 98, 99, 101, 103, 105, 114], if n = 9 (odd), the median is the 5th value: 99. If n = 10 (even), the median is the average of the 5th and 6th values: (99 + 101)/2 = 100.

Mode

The mode is the value that occurs most frequently in a dataset.

  • It is the highest bar in a histogram.

  • Most useful for categorical data, but can be applied to quantitative data as well.

Comparing Mean and Median

The mean and median are both measures of center, but they respond differently to the shape of the distribution and outliers.

  • For symmetric distributions, the mean and median are close together.

  • For skewed distributions, the mean is pulled toward the tail, while the median remains closer to the center.

  • Mean is preferred for symmetric data; median is preferred for skewed data or data with outliers.

Resistant Measures

A resistant measure is not significantly affected by extreme values (outliers).

  • The median is resistant to outliers.

  • The mean is not resistant and can be greatly influenced by outliers.

Section 2.4: Describing the Spread of Quantitative Data

Learning Objectives

  • Calculate the range

  • Calculate the standard deviation

  • Understand properties of the standard deviation

  • Interpret the magnitude of s

  • Apply the Empirical Rule

Range

The range measures the spread by calculating the difference between the largest and smallest values.

  • Formula:

  • The range is strongly affected by outliers.

Standard Deviation

The standard deviation quantifies the average distance of each observation from the mean.

  • Definition: The standard deviation is the square root of the average squared deviations from the mean.

  • Formula:

  • Calculation Steps:

    1. Find the mean.

    2. Calculate the deviation of each value from the mean.

    3. Square each deviation.

    4. Sum the squared deviations.

    5. Divide by n-1 and take the square root.

  • Example: For metabolic rates of 7 men: [1792, 1666, 1362, 1614, 1460, 1867, 1439], mean = 1600, sum of squared deviations = 214,870, calories.

Properties of Standard Deviation

  • s = 0 only when all observations are identical; otherwise, s > 0.

  • As the spread increases, s increases.

  • s has the same units as the original data; variance () has squared units.

  • s is not resistant to outliers or skewness.

Empirical Rule

The Empirical Rule applies to bell-shaped (normal) distributions:

  • Approximately 68% of observations fall within 1 standard deviation of the mean ().

  • Approximately 95% fall within 2 standard deviations ().

  • Nearly all (99.7%) fall within 3 standard deviations ().

Section 2.5: Measures of Position and Spread

Learning Objectives

  • Obtain quartiles and the 5-number summary

  • Calculate interquartile range (IQR) and detect outliers

  • Draw boxplots

  • Compare distributions

  • Calculate a z-score

Percentiles and Quartiles

A percentile is a value below which a given percentage of observations fall. Quartiles divide the data into four equal parts:

  • First quartile (Q1): 25% of data below

  • Second quartile (Q2): Median, 50% below

  • Third quartile (Q3): 75% below

Interquartile Range (IQR)

The IQR measures the spread of the middle 50% of the data.

  • Formula:

  • Used to identify potential outliers.

Identifying Outliers

  • An observation is a potential outlier if it falls below or above .

Five-Number Summary

The five-number summary consists of:

  • Minimum

  • First quartile (Q1)

  • Median

  • Third quartile (Q3)

  • Maximum

Boxplots

A boxplot visually displays the five-number summary and highlights outliers.

  • The box spans from Q1 to Q3, with a line at the median.

  • Whiskers extend to the smallest and largest non-outlier values.

  • Outliers are plotted individually.

Comparing Distributions

  • Boxplots are useful for comparing multiple distributions, though they do not show the exact shape as histograms do.

Z-Score

The z-score indicates how many standard deviations an observation is from the mean.

  • Formula:

  • Observations with or are considered potential outliers in a normal distribution.

Section 2.6: Misuse of Graphical Summaries

Guidelines for Constructing Effective Graphs

  • Label both axes and provide a clear heading.

  • Start the vertical axis at zero for accurate comparison.

  • Use simple graphical elements (bars, lines, points).

  • Be cautious when comparing groups with very different values.

Lesson Summary

  • Descriptive statistics use graphical and numerical methods to summarize data.

  • Graphical measures reveal the shape and outliers; numerical measures describe center and spread.

  • Use mean and standard deviation for symmetric data without outliers.

  • Use the five-number summary for skewed data or data with outliers.

Comparison Table: Measures of Center and Spread

Measure

Definition

Resistant to Outliers?

Best Used For

Mean

Arithmetic average

No

Symmetric distributions

Median

Middle value

Yes

Skewed distributions, data with outliers

Mode

Most frequent value

Yes

Categorical data

Range

Max - Min

No

Quick measure of spread

Standard Deviation

Average distance from mean

No

Symmetric distributions

IQR

Q3 - Q1

Yes

Skewed distributions, data with outliers

Pearson Logo

Study Prep