BackExploring Data with Graphs and Numerical Summaries
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Chapter 2: Exploring Data with Graphs and Numerical Summaries
Introduction to Data Exploration
Exploring data involves both graphical and numerical methods to understand the distribution, center, and spread of a dataset. Graphical summaries provide a visual representation, while numerical summaries offer precise measures of central tendency and variability.
Graphical summaries help visualize the shape and distribution of data.
Always graph the data first to gain an initial understanding.
Follow up with numerical summaries to describe typical values and the spread of observations.
Section 2.3: Describing the Center of Quantitative Data
Learning Objectives
Calculating the mean
Calculating the median
Comparing the mean and median
Definition of resistant measures
Identifying the mode of a distribution
Mean
The mean is the arithmetic average of a set of observations and represents the center of mass of the data.
Definition: The mean is the sum of all observations divided by the number of observations.
Formula:
The mean is sensitive to every value in the dataset, including outliers.
Example: For sodium content in cereals, the mean can be calculated using statistical software or calculators.
Median
The median is the midpoint of the ordered observations and divides the data into two equal halves.
Definition: The median is the value that separates the higher half from the lower half of the data.
Calculation Steps:
Order the observations from smallest to largest.
If the number of observations (n) is odd, the median is the middle value.
If n is even, the median is the average of the two middle values.
Example: For the ordered data [78, 91, 94, 98, 99, 101, 103, 105, 114], if n = 9 (odd), the median is the 5th value: 99. If n = 10 (even), the median is the average of the 5th and 6th values: (99 + 101)/2 = 100.
Mode
The mode is the value that occurs most frequently in a dataset.
It is the highest bar in a histogram.
Most useful for categorical data, but can be applied to quantitative data as well.
Comparing Mean and Median
The mean and median are both measures of center, but they respond differently to the shape of the distribution and outliers.
For symmetric distributions, the mean and median are close together.
For skewed distributions, the mean is pulled toward the tail, while the median remains closer to the center.
Mean is preferred for symmetric data; median is preferred for skewed data or data with outliers.
Resistant Measures
A resistant measure is not significantly affected by extreme values (outliers).
The median is resistant to outliers.
The mean is not resistant and can be greatly influenced by outliers.
Section 2.4: Describing the Spread of Quantitative Data
Learning Objectives
Calculate the range
Calculate the standard deviation
Understand properties of the standard deviation
Interpret the magnitude of s
Apply the Empirical Rule
Range
The range measures the spread by calculating the difference between the largest and smallest values.
Formula:
The range is strongly affected by outliers.
Standard Deviation
The standard deviation quantifies the average distance of each observation from the mean.
Definition: The standard deviation is the square root of the average squared deviations from the mean.
Formula:
Calculation Steps:
Find the mean.
Calculate the deviation of each value from the mean.
Square each deviation.
Sum the squared deviations.
Divide by n-1 and take the square root.
Example: For metabolic rates of 7 men: [1792, 1666, 1362, 1614, 1460, 1867, 1439], mean = 1600, sum of squared deviations = 214,870, calories.
Properties of Standard Deviation
s = 0 only when all observations are identical; otherwise, s > 0.
As the spread increases, s increases.
s has the same units as the original data; variance () has squared units.
s is not resistant to outliers or skewness.
Empirical Rule
The Empirical Rule applies to bell-shaped (normal) distributions:
Approximately 68% of observations fall within 1 standard deviation of the mean ().
Approximately 95% fall within 2 standard deviations ().
Nearly all (99.7%) fall within 3 standard deviations ().
Section 2.5: Measures of Position and Spread
Learning Objectives
Obtain quartiles and the 5-number summary
Calculate interquartile range (IQR) and detect outliers
Draw boxplots
Compare distributions
Calculate a z-score
Percentiles and Quartiles
A percentile is a value below which a given percentage of observations fall. Quartiles divide the data into four equal parts:
First quartile (Q1): 25% of data below
Second quartile (Q2): Median, 50% below
Third quartile (Q3): 75% below
Interquartile Range (IQR)
The IQR measures the spread of the middle 50% of the data.
Formula:
Used to identify potential outliers.
Identifying Outliers
An observation is a potential outlier if it falls below or above .
Five-Number Summary
The five-number summary consists of:
Minimum
First quartile (Q1)
Median
Third quartile (Q3)
Maximum
Boxplots
A boxplot visually displays the five-number summary and highlights outliers.
The box spans from Q1 to Q3, with a line at the median.
Whiskers extend to the smallest and largest non-outlier values.
Outliers are plotted individually.
Comparing Distributions
Boxplots are useful for comparing multiple distributions, though they do not show the exact shape as histograms do.
Z-Score
The z-score indicates how many standard deviations an observation is from the mean.
Formula:
Observations with or are considered potential outliers in a normal distribution.
Section 2.6: Misuse of Graphical Summaries
Guidelines for Constructing Effective Graphs
Label both axes and provide a clear heading.
Start the vertical axis at zero for accurate comparison.
Use simple graphical elements (bars, lines, points).
Be cautious when comparing groups with very different values.
Lesson Summary
Descriptive statistics use graphical and numerical methods to summarize data.
Graphical measures reveal the shape and outliers; numerical measures describe center and spread.
Use mean and standard deviation for symmetric data without outliers.
Use the five-number summary for skewed data or data with outliers.
Comparison Table: Measures of Center and Spread
Measure | Definition | Resistant to Outliers? | Best Used For |
|---|---|---|---|
Mean | Arithmetic average | No | Symmetric distributions |
Median | Middle value | Yes | Skewed distributions, data with outliers |
Mode | Most frequent value | Yes | Categorical data |
Range | Max - Min | No | Quick measure of spread |
Standard Deviation | Average distance from mean | No | Symmetric distributions |
IQR | Q3 - Q1 | Yes | Skewed distributions, data with outliers |