Skip to main content
Back

Exploring Data with Graphs and Numerical Summaries (STA 215, Chapter 2)

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Exploring Data with Graphs and Numerical Summaries

Section 2.1: Different Types of Data

Understanding the types of variables in a data set is essential for selecting appropriate methods of analysis and visualization. Variables can be classified as either categorical or quantitative, each requiring different summary and graphical techniques.

  • Variable: Any characteristic of an individual that can take different values for different cases.

  • Categorical Variable: Places an individual into one of several groups or categories. Analyzed using frequencies, proportions, and percentages; visualized with bar or pie charts.

  • Quantitative Variable: Takes numerical values for which arithmetic operations make sense. Analyzed using mean, median, standard deviation; visualized with histograms or scatterplots.

Examples:

  • Categorical: Favorite color, employee classification, policy agreement (yes/no).

  • Quantitative: Length of chalkboard (inches), time spent studying (minutes), football rushing yards (season total).

Distribution: Describes what values a variable takes and how often it takes these values.

  • For quantitative variables: Key features are shape, center, and variability (spread).

  • For categorical variables: Focus on the relative number of observations in each category; the category with the largest frequency is the modal category.

Types of Categorical Variables:

  • Nominal: No inherent ordering (e.g., gender, religious affiliation).

  • Ordinal: Categories can be ordered (e.g., letter grades, cancer stages).

Types of Quantitative Variables:

  • Discrete: Finite or countable values (e.g., number of students in a class).

  • Continuous: Any value within an interval (e.g., time on ice, amount of gas in a car).

Proportion: The number of observations in a category divided by the total number of observations.

Frequency Table: Lists possible values for a variable and the number of observations for each value.

Section 2.2: Graphical Summaries of Data

Graphical summaries help visualize the main features of data distributions. The choice of graph depends on the type of variable.

  • Data Tables: Should have a clear title, variable labels (with units), and data source.

  • Pie Charts: Show how a whole is divided into parts; each slice represents a category's proportion.

  • Bar Graphs: Display frequencies or percentages for categories; bars are typically vertical and separated.

  • Pareto Chart: Bar graph with bars ordered by height; highlights the most common categories.

  • Pictograms: Use images instead of bars; can be misleading if area does not match data ratios.

Example Table: Class Distribution of Titanic Passengers

Class

Frequency

Relative Frequency

Percent

Degrees in Angle

First

325

0.1477

14.77%

53

Second

285

0.1294

12.94%

47

Third

706

0.3207

32.07%

115

Crew

885

0.4021

40.21%

145

Total

2201

0.9999

99.99%

360

Additional info: Degrees in Angle are calculated as (Relative Frequency) × 360°.

  • Bar Graph vs. Histogram:

    • Histogram: Quantitative data, bars touch, base scale is numerical and equal units, bar width is meaningful.

    • Bar Graph: Categorical data, bars separated, base scale is categories, bar width not meaningful.

Histograms: Used for quantitative data; bars represent frequencies or relative frequencies for intervals (bins).

  1. Find the range:

  2. Choose number of intervals (approx. )

  3. Calculate interval width: (always round up)

  4. Count data points in each interval and draw the histogram.

Describing Distributions:

  • Center: Typical value (e.g., mean, median).

  • Spread: Variability (e.g., range, standard deviation).

  • Shape: Symmetric, skewed right, skewed left, unimodal, bimodal.

  • Outliers: Observations outside the overall pattern.

Other Graphical Displays:

  • Stem-and-Leaf Plot: Retains actual data values; best for small data sets.

  • Dot Plot: Useful for small quantitative data sets; each dot represents a value.

  • Time Plot: Shows variable behavior over time (time series); reveals trends and seasonal variation.

Section 2.3: Measuring the Center of Quantitative Data

Measures of central tendency describe the "typical" value in a data set. The three main measures are the mean, median, and mode.

  • Mean (Arithmetic Mean): Sum of all values divided by the number of values.

Formula:

  • Median: The middle value when data are ordered. If is odd, it's the middle value; if $n$ is even, it's the average of the two middle values.

  • Mode: The value that occurs most frequently. Data can be unimodal, bimodal, or have no mode.

Comparing Mean and Median:

  • Symmetric distribution: mean = median

  • Skewed right: mean > median

  • Skewed left: mean < median

  • Median is resistant to outliers; mean is not.

Section 2.4: Measuring the Variability of Quantitative Data

Measures of variability describe the spread of data values.

  • Range: Difference between highest and lowest values.

  • Variance: Average squared deviation from the mean.

Sample Variance Formula:

  • Standard Deviation: Square root of the variance; typical distance from the mean.

Sample Standard Deviation Formula:

  • Mean Absolute Deviation (MAD): Average absolute deviation from the mean.

Formula:

  • Properties of Standard Deviation:

    • Increases as data spread increases.

    • Zero only if all values are identical.

    • Sensitive to outliers.

The Empirical Rule (68-95-99.7 Rule): For bell-shaped distributions:

  • ~68% of data within 1 standard deviation of mean

  • ~95% within 2 standard deviations

  • ~99.7% within 3 standard deviations

Mathematically:

  • contains about 68% of data

  • contains about 95% of data

  • contains about 99.7% of data

Section 2.5: Using Measures of Position to Describe Variability

Measures of position describe the relative standing of a value within a data set.

  • Quartiles: Divide data into four equal parts.

  • Percentiles: The pth percentile is the value below which p% of observations fall.

Procedure to Compute Quartiles:

  1. Order data from smallest to largest.

  2. Find the median (Q2).

  3. Q1: Median of lower half (not including Q2 if n is odd).

  4. Q3: Median of upper half (not including Q2 if n is odd).

Interquartile Range (IQR): Measures spread of the middle 50% of data.

Formula:

Detecting Potential Outliers: An observation is a potential outlier if it is:

  • Below

  • Above

Box Plot (Box-and-Whisker Plot): Graphical summary based on the five-number summary (minimum, Q1, median, Q3, maximum). Useful for comparing distributions and identifying outliers.

  • Box from Q1 to Q3; line at median.

  • Whiskers extend to min and max (or to non-outlier values in modified box plots).

Five-Number Summary Example (Presidents' Ages):

Minimum

Q1

Median

Q3

Maximum

42

51

55

61

78

Interquartile Range: years

Potential Outliers: Any value above $76Q_3 + 1.5 \times 10 () is a potential outlier.

Summary Table: Graphical Methods for Data Types

Graph Type

Variable Type

Main Use

Pie Chart

Categorical

Show how a whole is divided into parts

Bar Graph

Categorical

Display frequencies or percentages

Histogram

Quantitative

Show distribution shape, center, spread

Stem-and-Leaf Plot

Quantitative (small data)

Retain actual values, show distribution

Dot Plot

Quantitative (small data)

Show distribution, identify modes

Time Plot

Quantitative (time series)

Show trends and patterns over time

Box Plot

Quantitative

Compare distributions, identify outliers

Additional info: For large data sets, histograms are preferred over stem-and-leaf or dot plots. Box plots are best for comparing multiple groups.

Pearson Logo

Study Prep