BackDisplaying and Describing Data: Chapter 2 Study Notes (Intro Stats)
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Chapter 2: Displaying and Describing Data
Three Rules of Data Analysis
Effective data analysis begins with visualizing data. The following three rules guide the process:
Make a Picture: Helps you think clearly about patterns and relationships hidden in the data table.
Make a Picture: Shows the important features of the data.
Make a Picture: Tells others about the data.
Visual representations are essential for both understanding and communicating statistical findings.
Titanic Misconception
Visual displays can sometimes mislead. For example, when comparing the number of Titanic crew members to passengers, a graphic with disproportionate area can trick the eye.
There were three times as many crew members as second-class passengers.
However, a misleading graphic may make the crew appear nine times as large due to the area used.
Key Point: Always interpret visual data carefully and be aware of graphical distortions.
The Area Principle
The Area Principle states that the area occupied by a part of a graph should correspond to the magnitude of the value it represents.
Bars in a bar chart should have equal widths.
Be cautious when using two-dimensional pictures to exhibit one-dimensional data.
Example: In a bar chart of exam grades, each bar's area should accurately reflect the number of students in each grade category.
Section 2.1: Summarizing and Displaying a Categorical Variable
Frequency Tables
A frequency table is a table whose first column displays each distinct outcome and whose second column displays that outcome’s frequency.
Recommended to combine outcomes into a few categories if there are many distinct outcomes.
Class | Count |
|---|---|
First | 324 |
Second | 284 |
Third | 709 |
Crew | 891 |
Application: Frequency tables are useful for summarizing categorical data such as passenger classes on the Titanic.
Relative Frequency Tables
A relative frequency table displays the proportion (or percentage) of each outcome rather than the count.
Useful for comparing categories when sample sizes differ.
Class | Relative Frequency |
|---|---|
First | 0.14 |
Second | 0.12 |
Third | 0.31 |
Crew | 0.39 |
Additional info: Relative frequencies are calculated by dividing each count by the total number of observations.
Bar Charts
A bar chart displays the frequency or relative frequency of each category using bars of equal width.
Bar charts are effective for general audiences.
Each bar’s height represents the count or proportion for a category.
Pie Charts and Ring Charts
Pie charts present each category as a slice of a circle, with the size proportional to the whole. Ring charts (or donut charts) partition a ring in proportion to each category’s value.
Both are good for displaying the fraction of the whole that each category represents.
Best used when categories do not overlap.
Choosing the Right Chart
Choose the chart that best tells the story of your data and suits your audience. Charts work best when categories do not overlap, and honesty in representation is crucial.
Bar and pie charts for categorical data.
Histograms, stem-and-leaf, and dotplots for quantitative data.
Section 2.2: Displaying a Quantitative Variable
Histograms
A histogram is a chart that displays quantitative data by grouping values into bins and showing the frequency of values in each bin.
Useful for visualizing the distribution of data.
Bin width selection affects the story told by the histogram.
Example: Earthquake magnitudes and hours worked per week are often displayed using histograms.
Stem-and-Leaf Displays
Stem-and-leaf displays show both the shape of the distribution and all individual values. They are best for small data sets.
Stems represent the leftmost digit(s); leaves show the remaining digit(s).
Example: Pulse rates: 5|6 means 56 beats per minute.
Dotplots
Dotplots display dots to describe the shape of the distribution, suitable for small data sets and visually appealing.
Density Plots
Density plots smooth the bins in a histogram, providing a continuous estimate of the distribution’s shape.
Section 2.3: Shape
Modes
The mode of a histogram is a hump or high-frequency region.
One mode: Unimodal
Two modes: Bimodal
Three or more: Multimodal
Uniform Distributions
A uniform distribution has all bins with approximately the same frequency, resulting in a flat histogram.
Symmetry and Skewness
A symmetric distribution looks the same on the left and right of its center. A histogram is skewed right if the longer tail is on the right, and skewed left if the longer tail is on the left.
Outliers
An outlier is a data value far above or below the rest. Outliers may be errors or important data points.
Examples: CEO income, high fever temperature, Death Valley elevation.
Section 2.4: Center
Median
The median is the center value of ordered data. Half the data values are to the left, half to the right.
For odd n: Median is the middle value.
For even n: Median is the average of the two middle values.
Example: For data 2, 4, 5, 6, 7, 9, median is 6.
Mean
The mean is the arithmetic average.
Formula:
Sum all values and divide by the number of data points.
Mean vs. Median
Calculate both mean and median, investigate outliers, and decide which to report based on the data’s characteristics. Median is preferred for skewed distributions (e.g., income).
Section 2.5: Spread
Range
The range is the difference between the maximum and minimum values.
Formula:
Sensitive to outliers.
Percentiles and Quartiles
Percentiles divide data into 100 groups. The n-th percentile is the value below which n percent of the data lies.
Median is the 50th percentile.
First quartile (Q1) is the 25th percentile.
Third quartile (Q3) is the 75th percentile.
Interquartile Range (IQR)
The interquartile range (IQR) is the difference between the upper and lower quartiles.
Formula:
Measures the range of the middle half of the data.
Not sensitive to outliers.
Example: If Q1 = 23 and Q3 = 44, then IQR = 21.
Standard Deviation and Variance
Variance measures how far data is spread from the mean.
Formula:
Units are the square of the original data units.
Standard deviation is the square root of the variance.
Formula:
Represents the average distance from the mean.
Small standard deviation: data close to mean; large: data spread out.
Reporting Center and Spread
For skewed distributions: report median and IQR.
For symmetric distributions: report mean and standard deviation.
Example: Credit card expenditures: mean and standard deviation are affected by outliers, while median and IQR are more robust.
Step-by-Step: Summarizing a Distribution
To summarize a quantitative variable:
Make a histogram or stem-and-leaf display.
Discuss shape (unimodal, symmetric, outliers).
Report center and spread (median/IQR or mean/SD).
Discuss unusual features (multiple modes, outliers).
What Can Go Wrong?
Do not violate the area principle.
Keep displays honest and clear.
Do not make histograms of categorical variables.
Do not compute numerical summaries for categorical variables.
Choose appropriate bin widths for histograms.
Sort values before finding median or percentiles.
Do not report excessive decimal places.
Do not round in the middle of calculations.
Watch for multiple modes and outliers.
Be aware of inappropriate summaries.