BackExploring Data with Graphs and Numerical Summaries (STA 215, Chapter 2)
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Exploring Data with Graphs and Numerical Summaries
Section 2.1: Different Types of Data
Understanding the types of variables in a data set is essential for selecting appropriate methods of analysis and visualization. Variables can be classified as either categorical or quantitative, each requiring different summary and graphical techniques.
Variable: Any characteristic of an individual that can take different values for different cases.
Categorical Variable: Places an individual into one of several groups or categories. Analyzed using frequencies, proportions, and percentages; visualized with bar or pie charts.
Quantitative Variable: Takes numerical values for which arithmetic operations make sense. Analyzed using mean, median, standard deviation; visualized with histograms or scatterplots.
Examples:
Categorical: Favorite color, employee classification, policy agreement (yes/no).
Quantitative: Length of chalkboard (inches), time spent studying (minutes), football rushing yards (season total).
Distribution: Describes what values a variable takes and how often it takes these values.
For quantitative variables: Key features are shape, center, and variability (spread).
For categorical variables: Focus on the relative number of observations in each category; the category with the largest frequency is the modal category.
Types of Categorical Variables:
Nominal: No inherent ordering (e.g., gender, religious affiliation).
Ordinal: Categories can be ordered (e.g., letter grades, cancer stages).
Types of Quantitative Variables:
Discrete: Finite or countable values (e.g., number of students in a class).
Continuous: Any value within an interval (e.g., time on ice, amount of gas in a car).
Proportion: The number of observations in a category divided by the total number of observations.
Frequency Table: Lists possible values for a variable and the number of observations for each value.
Section 2.2: Graphical Summaries of Data
Graphical summaries help visualize the main features of data distributions. The choice of graph depends on the type of variable.
Data Tables: Should have a clear title, variable labels (with units), and data source.
Pie Charts: Show how a whole is divided into parts; each slice represents a category's proportion.
Bar Graphs: Display frequencies or percentages for categories; bars are typically vertical and separated.
Pareto Chart: Bar graph with bars ordered by height; highlights the most common categories.
Pictograms: Use images instead of bars; can be misleading if area does not match data ratios.
Example Table: Class Distribution of Titanic Passengers
Class | Frequency | Relative Frequency | Percent | Degrees in Angle |
|---|---|---|---|---|
First | 325 | 0.1477 | 14.77% | 53 |
Second | 285 | 0.1294 | 12.94% | 47 |
Third | 706 | 0.3207 | 32.07% | 115 |
Crew | 885 | 0.4021 | 40.21% | 145 |
Total | 2201 | 0.9999 | 99.99% | 360 |
Additional info: Degrees in Angle are calculated as (Relative Frequency) × 360°.
Bar Graph vs. Histogram:
Histogram: Quantitative data, bars touch, base scale is numerical and equal units, bar width is meaningful.
Bar Graph: Categorical data, bars separated, base scale is categories, bar width not meaningful.
Histograms: Used for quantitative data; bars represent frequencies or relative frequencies for intervals (bins).
Find the range:
Choose number of intervals (approx. )
Calculate interval width: (always round up)
Count data points in each interval and draw the histogram.
Describing Distributions:
Center: Typical value (e.g., mean, median).
Spread: Variability (e.g., range, standard deviation).
Shape: Symmetric, skewed right, skewed left, unimodal, bimodal.
Outliers: Observations outside the overall pattern.
Other Graphical Displays:
Stem-and-Leaf Plot: Retains actual data values; best for small data sets.
Dot Plot: Useful for small quantitative data sets; each dot represents a value.
Time Plot: Shows variable behavior over time (time series); reveals trends and seasonal variation.
Section 2.3: Measuring the Center of Quantitative Data
Measures of central tendency describe the "typical" value in a data set. The three main measures are the mean, median, and mode.
Mean (Arithmetic Mean): Sum of all values divided by the number of values.
Formula:
Median: The middle value when data are ordered. If is odd, it's the middle value; if $n$ is even, it's the average of the two middle values.
Mode: The value that occurs most frequently. Data can be unimodal, bimodal, or have no mode.
Comparing Mean and Median:
Symmetric distribution: mean = median
Skewed right: mean > median
Skewed left: mean < median
Median is resistant to outliers; mean is not.
Section 2.4: Measuring the Variability of Quantitative Data
Measures of variability describe the spread of data values.
Range: Difference between highest and lowest values.
Variance: Average squared deviation from the mean.
Sample Variance Formula:
Standard Deviation: Square root of the variance; typical distance from the mean.
Sample Standard Deviation Formula:
Mean Absolute Deviation (MAD): Average absolute deviation from the mean.
Formula:
Properties of Standard Deviation:
Increases as data spread increases.
Zero only if all values are identical.
Sensitive to outliers.
The Empirical Rule (68-95-99.7 Rule): For bell-shaped distributions:
~68% of data within 1 standard deviation of mean
~95% within 2 standard deviations
~99.7% within 3 standard deviations
Mathematically:
contains about 68% of data
contains about 95% of data
contains about 99.7% of data
Section 2.5: Using Measures of Position to Describe Variability
Measures of position describe the relative standing of a value within a data set.
Quartiles: Divide data into four equal parts.
Percentiles: The pth percentile is the value below which p% of observations fall.
Procedure to Compute Quartiles:
Order data from smallest to largest.
Find the median (Q2).
Q1: Median of lower half (not including Q2 if n is odd).
Q3: Median of upper half (not including Q2 if n is odd).
Interquartile Range (IQR): Measures spread of the middle 50% of data.
Formula:
Detecting Potential Outliers: An observation is a potential outlier if it is:
Below
Above
Box Plot (Box-and-Whisker Plot): Graphical summary based on the five-number summary (minimum, Q1, median, Q3, maximum). Useful for comparing distributions and identifying outliers.
Box from Q1 to Q3; line at median.
Whiskers extend to min and max (or to non-outlier values in modified box plots).
Five-Number Summary Example (Presidents' Ages):
Minimum | Q1 | Median | Q3 | Maximum |
|---|---|---|---|---|
42 | 51 | 55 | 61 | 78 |
Interquartile Range: years
Potential Outliers: Any value above $76Q_3 + 1.5 \times 10 () is a potential outlier.
Summary Table: Graphical Methods for Data Types
Graph Type | Variable Type | Main Use |
|---|---|---|
Pie Chart | Categorical | Show how a whole is divided into parts |
Bar Graph | Categorical | Display frequencies or percentages |
Histogram | Quantitative | Show distribution shape, center, spread |
Stem-and-Leaf Plot | Quantitative (small data) | Retain actual values, show distribution |
Dot Plot | Quantitative (small data) | Show distribution, identify modes |
Time Plot | Quantitative (time series) | Show trends and patterns over time |
Box Plot | Quantitative | Compare distributions, identify outliers |
Additional info: For large data sets, histograms are preferred over stem-and-leaf or dot plots. Box plots are best for comparing multiple groups.