Summarising Data: Numerical Methods in Statistics

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Measures of Variability

Interquartile Range (IQR)

The interquartile range (IQR) is a measure of variability that focuses on the middle 50% of the data, reducing the influence of extreme values. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1):

Formula:
Advantages: Easy to interpret, not influenced by extreme values.
Disadvantage: Only considers the middle 50% of the data.
Example: For starting salaries of 12 graduates, if and , then .

Variance

Variance measures the average squared deviation from the mean, utilizing all data points. It can be defined for both populations and samples:

Population variance:
Sample variance:
Interpretation: Variance is expressed in squared units of the variable.

Standard Deviation

The standard deviation is the positive square root of the variance, representing the average deviation of observations from their mean. It is expressed in the same units as the data.

Population standard deviation:
Sample standard deviation:
Example: For class sizes [46, 54, 42, 46, 32] with , .

Coefficient of Variation (CV)

The coefficient of variation is a relative measure of variability, expressing the standard deviation as a percentage of the mean:

Formula:
Interpretation: Useful for comparing variability between datasets with different units or means.
Example: If and , .

Measures of Distribution Shape, Relative Location, and Detecting Outliers

Distribution Shapes

Understanding the shape of a distribution is essential for data analysis. Common shapes include:

Skewed to the left (Negative skew): The left tail is longer; more outliers on the left.

Negative skew distribution

Skewed to the right (Positive skew): The right tail is longer; more outliers on the right.

Positive skew distribution

Symmetric: Both tails are of equal length; the normal distribution is a key example.

Z-scores

Z-scores measure the relative location of a value within a dataset, indicating how many standard deviations an observation is from the mean:

Formula:
Interpretation: Negative z-scores are below the mean; positive are above.
Example: For , , , (1.25 standard deviations above the mean).

The Empirical Rule

The empirical rule applies to bell-shaped (normal) distributions, describing the percentage of data within certain standard deviations from the mean:

Approximately 68% within 1 standard deviation
Approximately 95% within 2 standard deviations
Approximately 100% within 3 standard deviations

Empirical rule for normal distribution

Examples Using the Empirical Rule

IQ scores between 85 and 115 (mean 100, SD 15): 68% of people

IQ scores between 85 and 115

IQ scores between 70 and 130: 95% of people

IQ scores between 70 and 130

IQ scores above 130: 2.5% of people

IQ scores above 130

16th percentile (P16): IQ = 85

16th percentile at IQ 85

84th percentile (P84): IQ = 115

84th percentile at IQ 115

Outlier detection: IQ = 160 is an outlier (z = 4, more than 3 SDs from mean)

IQ outlier at 160

Detecting Outliers

Outliers are extreme values that may affect statistical analysis. Two main methods for detecting outliers:

For bell-shaped distributions: Any value with or is considered an outlier.
For non-bell-shaped distributions: Calculate lower and upper limits:
- Lower limit:
- Upper limit:
- Values outside these limits are outliers.

Five-Number Summaries and Boxplots

Five-Number Summary

The five-number summary consists of:

Minimum
First quartile (Q1)
Median (Q2)
Third quartile (Q3)
Maximum

Box Plot

A box plot is a graphical summary based on the five-number summary, useful for visualizing the spread and detecting outliers:

The box spans Q1 to Q3 (middle 50% of data).
A line marks the median; an X may indicate the mean.
Whiskers extend to the smallest/largest values within 1.5(IQR) of Q1/Q3.
Outliers are plotted as individual points.
Box plots can indicate skewness by the position of the median and the length of whiskers.

Box plot with outlier

Measures of Association Between Two Variables

Covariance

Covariance measures the direction of the linear relationship between two variables X and Y:

Formula:
Interpretation:
- : Positive linear relationship
- : Negative linear relationship
- : No linear relationship
Limitation: Covariance does not indicate the strength of the relationship due to dependence on units.

Correlation Coefficient

The correlation coefficient (r) measures the strength and direction of the linear relationship between two variables, standardized to be unitless:

Formula:
Range:
Interpretation:
- : Perfect positive linear relationship
- : Perfect negative linear relationship
- : No linear relationship

Correlation coefficient scale Perfect negative linear relationship No linear relationship Nonlinear relationship

The Weighted Mean and Grouped Data

Weighted Mean

When observations have different levels of importance, the weighted mean is used:

Formula:
Example: Calculating the mean cost per pound for purchases with different quantities.

Grouped Data

For data presented in frequency distributions, the mean and variance can be estimated using group midpoints:

Sample mean:
Sample variance:
Where: is the midpoint of the ith group, is the frequency, and is the total number of observations.

Summary Table: Key Measures and Their Formulas

Measure	Formula	Interpretation
Interquartile Range (IQR)		Middle 50% spread
Variance (Sample)		Average squared deviation
Standard Deviation (Sample)		Average deviation from mean
Coefficient of Variation		Relative variability (%)
Z-score		Relative position in SD units
Covariance		Direction of linear relationship
Correlation Coefficient		Strength and direction of linear relationship
Weighted Mean		Mean with weights
Grouped Data Mean		Mean for grouped data