Exploring Data with Graphs and Numerical Summaries (STA 215, Chapter 2)

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Exploring Data with Graphs and Numerical Summaries

Section 2.1: Different Types of Data

Understanding the types of variables in a data set is essential for selecting appropriate methods of analysis and visualization. Variables can be classified as either categorical or quantitative, each requiring different summary and graphical techniques.

Variable: Any characteristic of an individual that can take different values for different cases.
Categorical Variable: Places an individual into one of several groups or categories. Analyzed using frequencies, proportions, and percentages; visualized with bar or pie charts.
Quantitative Variable: Takes numerical values for which arithmetic operations make sense. Analyzed using mean, median, standard deviation; visualized with histograms or scatterplots.

Examples:

Categorical: Favorite color, employee classification, policy agreement (yes/no).
Quantitative: Length of chalkboard (inches), time spent studying (minutes), football rushing yards (season total).

Distribution: Describes what values a variable takes and how often it takes these values.

For quantitative variables: Key features are shape, center, and variability (spread).
For categorical variables: Focus on the relative number of observations in each category; the category with the largest frequency is the modal category.

Types of Categorical Variables:

Nominal: No inherent ordering (e.g., gender, religious affiliation).
Ordinal: Categories can be ordered (e.g., letter grades, cancer stages).

Types of Quantitative Variables:

Discrete: Finite or countable values (e.g., number of students in a class).
Continuous: Any value within an interval (e.g., time on ice, amount of gas in a car).

Proportion: The number of observations in a category divided by the total number of observations.

Frequency Table: Lists possible values for a variable and the number of observations for each value.

Section 2.2: Graphical Summaries of Data

Graphical summaries help visualize the main features of data distributions. The choice of graph depends on the type of variable.

Data Tables: Should have a clear title, variable labels (with units), and data source.
Pie Charts: Show how a whole is divided into parts; each slice represents a category's proportion.
Bar Graphs: Display frequencies or percentages for categories; bars are typically vertical and separated.
Pareto Chart: Bar graph with bars ordered by height; highlights the most common categories.
Pictograms: Use images instead of bars; can be misleading if area does not match data ratios.

Example Table: Class Distribution of Titanic Passengers

Class	Frequency	Relative Frequency	Percent	Degrees in Angle
First	325	0.1477	14.77%	53
Second	285	0.1294	12.94%	47
Third	706	0.3207	32.07%	115
Crew	885	0.4021	40.21%	145
Total	2201	0.9999	99.99%	360

Additional info: Degrees in Angle are calculated as (Relative Frequency) × 360°.

Bar Graph vs. Histogram:
- Histogram: Quantitative data, bars touch, base scale is numerical and equal units, bar width is meaningful.
- Bar Graph: Categorical data, bars separated, base scale is categories, bar width not meaningful.

Histograms: Used for quantitative data; bars represent frequencies or relative frequencies for intervals (bins).

Find the range:
Choose number of intervals (approx. )
Calculate interval width: (always round up)
Count data points in each interval and draw the histogram.

Describing Distributions:

Center: Typical value (e.g., mean, median).
Spread: Variability (e.g., range, standard deviation).
Shape: Symmetric, skewed right, skewed left, unimodal, bimodal.
Outliers: Observations outside the overall pattern.

Other Graphical Displays:

Stem-and-Leaf Plot: Retains actual data values; best for small data sets.
Dot Plot: Useful for small quantitative data sets; each dot represents a value.
Time Plot: Shows variable behavior over time (time series); reveals trends and seasonal variation.

Section 2.3: Measuring the Center of Quantitative Data

Measures of central tendency describe the "typical" value in a data set. The three main measures are the mean, median, and mode.

Mean (Arithmetic Mean): Sum of all values divided by the number of values.

Formula:

Median: The middle value when data are ordered. If is odd, it's the middle value; if $n$ is even, it's the average of the two middle values.
Mode: The value that occurs most frequently. Data can be unimodal, bimodal, or have no mode.

Comparing Mean and Median:

Symmetric distribution: mean = median
Skewed right: mean > median
Skewed left: mean < median
Median is resistant to outliers; mean is not.

Section 2.4: Measuring the Variability of Quantitative Data

Measures of variability describe the spread of data values.

Range: Difference between highest and lowest values.
Variance: Average squared deviation from the mean.

Sample Variance Formula:

Standard Deviation: Square root of the variance; typical distance from the mean.

Sample Standard Deviation Formula:

Mean Absolute Deviation (MAD): Average absolute deviation from the mean.

Formula:

Properties of Standard Deviation:
- Increases as data spread increases.
- Zero only if all values are identical.
- Sensitive to outliers.

The Empirical Rule (68-95-99.7 Rule): For bell-shaped distributions:

~68% of data within 1 standard deviation of mean
~95% within 2 standard deviations
~99.7% within 3 standard deviations

Mathematically:

contains about 68% of data
contains about 95% of data
contains about 99.7% of data

Section 2.5: Using Measures of Position to Describe Variability

Measures of position describe the relative standing of a value within a data set.

Quartiles: Divide data into four equal parts.
Percentiles: The pth percentile is the value below which p% of observations fall.

Procedure to Compute Quartiles:

Order data from smallest to largest.
Find the median (Q2).
Q1: Median of lower half (not including Q2 if n is odd).
Q3: Median of upper half (not including Q2 if n is odd).

Interquartile Range (IQR): Measures spread of the middle 50% of data.

Formula:

Detecting Potential Outliers: An observation is a potential outlier if it is:

Below
Above

Box Plot (Box-and-Whisker Plot): Graphical summary based on the five-number summary (minimum, Q1, median, Q3, maximum). Useful for comparing distributions and identifying outliers.

Box from Q1 to Q3; line at median.
Whiskers extend to min and max (or to non-outlier values in modified box plots).

Five-Number Summary Example (Presidents' Ages):

Minimum	Q1	Median	Q3	Maximum
42	51	55	61	78

Interquartile Range: years

Potential Outliers: Any value above $76Q_3 + 1.5 \times 10 () is a potential outlier.

Summary Table: Graphical Methods for Data Types

Graph Type	Variable Type	Main Use
Pie Chart	Categorical	Show how a whole is divided into parts
Bar Graph	Categorical	Display frequencies or percentages
Histogram	Quantitative	Show distribution shape, center, spread
Stem-and-Leaf Plot	Quantitative (small data)	Retain actual values, show distribution
Dot Plot	Quantitative (small data)	Show distribution, identify modes
Time Plot	Quantitative (time series)	Show trends and patterns over time
Box Plot	Quantitative	Compare distributions, identify outliers

Additional info: For large data sets, histograms are preferred over stem-and-leaf or dot plots. Box plots are best for comparing multiple groups.