Skip to main content
Back

Summarising Data: Numerical Methods in Statistics

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Measures of Variability

Interquartile Range (IQR)

The interquartile range (IQR) is a measure of variability that focuses on the middle 50% of the data, reducing the influence of extreme values. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1):

  • Formula:

  • Advantages: Easy to interpret, not influenced by extreme values.

  • Disadvantage: Only considers the middle 50% of the data.

  • Example: For starting salaries of 12 graduates, if and , then .

Variance

Variance measures the average squared deviation from the mean, utilizing all data points. It can be defined for both populations and samples:

  • Population variance:

  • Sample variance:

  • Interpretation: Variance is expressed in squared units of the variable.

Standard Deviation

The standard deviation is the positive square root of the variance, representing the average deviation of observations from their mean. It is expressed in the same units as the data.

  • Population standard deviation:

  • Sample standard deviation:

  • Example: For class sizes [46, 54, 42, 46, 32] with , .

Coefficient of Variation (CV)

The coefficient of variation is a relative measure of variability, expressing the standard deviation as a percentage of the mean:

  • Formula:

  • Interpretation: Useful for comparing variability between datasets with different units or means.

  • Example: If and , .

Measures of Distribution Shape, Relative Location, and Detecting Outliers

Distribution Shapes

Understanding the shape of a distribution is essential for data analysis. Common shapes include:

  • Skewed to the left (Negative skew): The left tail is longer; more outliers on the left.

Negative skew distribution

  • Skewed to the right (Positive skew): The right tail is longer; more outliers on the right.

Positive skew distribution

  • Symmetric: Both tails are of equal length; the normal distribution is a key example.

Z-scores

Z-scores measure the relative location of a value within a dataset, indicating how many standard deviations an observation is from the mean:

  • Formula:

  • Interpretation: Negative z-scores are below the mean; positive are above.

  • Example: For , , , (1.25 standard deviations above the mean).

The Empirical Rule

The empirical rule applies to bell-shaped (normal) distributions, describing the percentage of data within certain standard deviations from the mean:

  • Approximately 68% within 1 standard deviation

  • Approximately 95% within 2 standard deviations

  • Approximately 100% within 3 standard deviations

Empirical rule for normal distribution

Examples Using the Empirical Rule

  • IQ scores between 85 and 115 (mean 100, SD 15): 68% of people

IQ scores between 85 and 115

  • IQ scores between 70 and 130: 95% of people

IQ scores between 70 and 130

  • IQ scores above 130: 2.5% of people

IQ scores above 130

  • 16th percentile (P16): IQ = 85

16th percentile at IQ 85

  • 84th percentile (P84): IQ = 115

84th percentile at IQ 115

  • Outlier detection: IQ = 160 is an outlier (z = 4, more than 3 SDs from mean)

IQ outlier at 160

Detecting Outliers

Outliers are extreme values that may affect statistical analysis. Two main methods for detecting outliers:

  1. For bell-shaped distributions: Any value with or is considered an outlier.

  2. For non-bell-shaped distributions: Calculate lower and upper limits:

    • Lower limit:

    • Upper limit:

    • Values outside these limits are outliers.

Five-Number Summaries and Boxplots

Five-Number Summary

The five-number summary consists of:

  1. Minimum

  2. First quartile (Q1)

  3. Median (Q2)

  4. Third quartile (Q3)

  5. Maximum

Box Plot

A box plot is a graphical summary based on the five-number summary, useful for visualizing the spread and detecting outliers:

  • The box spans Q1 to Q3 (middle 50% of data).

  • A line marks the median; an X may indicate the mean.

  • Whiskers extend to the smallest/largest values within 1.5(IQR) of Q1/Q3.

  • Outliers are plotted as individual points.

  • Box plots can indicate skewness by the position of the median and the length of whiskers.

Box plot with outlier

Measures of Association Between Two Variables

Covariance

Covariance measures the direction of the linear relationship between two variables X and Y:

  • Formula:

  • Interpretation:

    • : Positive linear relationship

    • : Negative linear relationship

    • : No linear relationship

  • Limitation: Covariance does not indicate the strength of the relationship due to dependence on units.

Correlation Coefficient

The correlation coefficient (r) measures the strength and direction of the linear relationship between two variables, standardized to be unitless:

  • Formula:

  • Range:

  • Interpretation:

    • : Perfect positive linear relationship

    • : Perfect negative linear relationship

    • : No linear relationship

Correlation coefficient scalePerfect negative linear relationshipNo linear relationshipNonlinear relationship

The Weighted Mean and Grouped Data

Weighted Mean

When observations have different levels of importance, the weighted mean is used:

  • Formula:

  • Example: Calculating the mean cost per pound for purchases with different quantities.

Grouped Data

For data presented in frequency distributions, the mean and variance can be estimated using group midpoints:

  • Sample mean:

  • Sample variance:

  • Where: is the midpoint of the ith group, is the frequency, and is the total number of observations.

Summary Table: Key Measures and Their Formulas

Measure

Formula

Interpretation

Interquartile Range (IQR)

Middle 50% spread

Variance (Sample)

Average squared deviation

Standard Deviation (Sample)

Average deviation from mean

Coefficient of Variation

Relative variability (%)

Z-score

Relative position in SD units

Covariance

Direction of linear relationship

Correlation Coefficient

Strength and direction of linear relationship

Weighted Mean

Mean with weights

Grouped Data Mean

Mean for grouped data

Pearson Logo

Study Prep