(Lecture 3) Graphical Representation and Measures of Center and Spread in Statistics

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Choosing a Graph Type

Guidelines for Selecting Graphical Displays

Choosing the appropriate graph type is essential for effectively representing a data distribution. The choice depends on the size and nature of the data set.

Dot Plot and Stem-and-Leaf Plot:
- Best for small data sets.
- Retain individual data values, allowing for detailed inspection.
Histogram:
- Best for large data sets.
- Provides a compact display of data.
- Allows flexibility in defining intervals (bins).

Interpreting Histograms

Key Features of Distributions

Histograms help visualize the overall pattern of a distribution, which consists of three main aspects: center, spread, and shape.

Center:
- Assessed by finding the median (the value with 50% of data below and 50% above).
Spread:
- Refers to the variability or dispersion of the data.
Shape:
- Describes the symmetry or skewness of the distribution (e.g., symmetric, skewed right, skewed left).

Shape of Distributions

Modes and Modality

The mode is the value that occurs most frequently in a data set. The modality of a distribution refers to the number of peaks or mounds it has.

Unimodal:
- One single peak or mound.
Bimodal:
- Two distinct peaks or mounds.

Symmetry and Skewness

Distributions can be classified based on their symmetry and the direction of their tails.

Symmetric Distribution:
- Both sides of the histogram are mirror images.
Skewed to the Left (Negative Skew):
- Left tail is longer than the right tail.
Skewed to the Right (Positive Skew):
- Right tail is longer than the left tail.

Examples of Skewness

Life Span:
- Skewed to the left; most people die at older ages, with fewer at younger ages.
Income:
- Skewed to the right; most people have moderate incomes, with a few having very high incomes.
San Francisco Home Prices:
- Right-skewed distribution; most homes are moderately priced, with a few very expensive ones.
Retirement Age:
- Left-skewed distribution; most people retire at a typical age, with fewer retiring much earlier.

Time Plots

Displaying Time Series Data

A time series is a data set collected over time. Time plots are used to graphically display time series data, plotting each observation against the time it was measured.

Common patterns over time, known as trends, should be noted.
Connecting data points in sequence can help visualize trends more clearly.

Examples of Time Plots

Whooping Cough Incidence Rates (1980–2012):
- Shows changes in disease incidence over time.
University Enrollment (2004–2012):
- Displays total enrollment and STEM major enrollment over several years.

Measuring the Center of Quantitative Data

The Mean

The mean (average) is the sum of all observations divided by the number of observations.

Formula:
Example: For data: 0, 340, 70, 140, 200, 180, 210, 150, 100, 130, 140, 180, 190, 160, 290, 50, 220, 180, 200, 210

The Median

The median is the middle value when observations are ordered from smallest to largest.

If n is odd, the median is the middle observation.
If n is even, the median is the average of the two middle observations.
Example: For data: 18, 20, 23, 32, 46, 65 (n=6, even) Median =
Example: For data: 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68, 70 (n=21, odd) Median = 62 (11th value)

Outliers

An outlier is an observation that falls well above or below the bulk of the data.

Outliers can affect the mean significantly but have little effect on the median.
Example: CO2 emissions per person in 9 countries: 0.3, 0.4, 0.8, 1.4, 1.8, 2.1, 5.9, 11.6, 16.9 Median = 1.8; 16.9 is an outlier.

Comparing Mean and Median

The shape of a distribution influences the relationship between the mean and median.

For symmetric distributions: Mean = Median
For left-skewed distributions: Mean < Median
For right-skewed distributions: Mean > Median

Resistant Measures

A measure is resistant if it is not affected by outliers.
The median is resistant; the mean is not.

The Mode

The mode is the value that occurs most often in a data set. It is most useful for categorical data and corresponds to the highest bar in a histogram.

Example: For data: 53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68, 70 Mode = 62

Measuring the Variability of Quantitative Data

Range

The range is the difference between the largest and smallest values in the data set.

Formula:
The range is simple to compute but is affected by outliers.

Standard Deviation

The standard deviation measures the average distance of each observation from the mean.

For each observation , the deviation from the mean is .
The sum of all deviations is zero.
The variance is the average of the squared deviations.
Formula for sample standard deviation:
Example: For data: 6, 7, 10, 11, 11, 13, 16, 18, 25 (n=9) Mean = 13

Properties of Standard Deviation

Standard deviation is zero only when all observations are identical.
As the spread of data increases, standard deviation increases.
Standard deviation has the same units as the original data; variance has squared units.
Standard deviation is not resistant to outliers.

The Empirical Rule

For bell-shaped (normal) distributions, the empirical rule provides approximate percentages of data within certain standard deviations from the mean:

About 68% of observations fall within 1 standard deviation ().
About 95% fall within 2 standard deviations ().
Nearly all (99.7%) fall within 3 standard deviations ().

Empirical Rule Table

Interval	Approximate Percentage
	68%
	95%
	99.7%

Additional info: The Empirical Rule is fundamental for understanding the spread of data in normal distributions and is widely used in statistical inference.