Measures of Central Tendency and Variability in Biostatistics

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Measures of Central Tendency

Introduction to Central Tendency

Measures of central tendency are statistical tools used to describe the typical value in a dataset. They help summarize quantitative data with a single number, providing researchers and clinicians with a concise representation of the 'typical' subject or observation.

Key Measures: Mean, Median, Mode
Applications: Used to describe characteristics such as age, socioeconomic status, or general health in public health datasets.

Mean

The mean is the arithmetic average of a set of values. It is calculated by summing all values and dividing by the number of observations.

Formula:
Example: For the dataset (7, 4, 4, 5):
Properties: Sensitive to extreme values (outliers); not robust.
Application in R: x <- c(7,4,4,5); mean(x) returns 5.

Median

The median is the midpoint of a dataset, such that half the values are smaller and half are larger. It is less affected by outliers and skewed data.

Calculation: Arrange data in order and find the middle value. If the number of observations () is odd, the median is the middle value. If is even, the median is the mean of the two middle values.
Example: For (1, 4, 3, 2), sorted as (1, 2, 3, 4), the median is
Robustness: Median is robust to extreme values.
Application in R: x <- c(1,4,3,2); median(x) returns 2.5.

Mode

The mode is the value that occurs most frequently in a dataset. It can be used for both quantitative and categorical data.

Example: For ("M", "F", "M", "F", "F"), the mode is "F" (occurs three times).
Application in R: table(Sex) provides frequency counts for categorical variables.
Note: Mode is less commonly used for quantitative data but is useful for categorical data.

Comparing Mean and Median

When a dataset is skewed, the mean and median can differ significantly. In such cases, the median is preferred to describe the typical value.

Symmetric Distribution: Mean = Median
Left-Skewed: Mean < Median
Right-Skewed: Mean > Median

Measures of Variability

Introduction to Variability

Measures of variability describe the spread or dispersion of data. They complement measures of central tendency by indicating how much the data varies.

Key Measures: Range, Standard Deviation, Variance, Interquartile Range (IQR)
Example: Two patients with the same mean systolic blood pressure (SBP) may have different variability, affecting clinical decisions.

Range

The range is the difference between the largest and smallest values in a dataset.

Formula:
Example: For (7, 4, 4, 5):
Application in R: max(x) - min(x)

Standard Deviation and Variance

Standard deviation measures the average distance of each data point from the mean. Variance is the square of the standard deviation.

Variance Formula:
Standard Deviation Formula:
Properties: Standard deviation is widely used in biostatistics; a value of 0 indicates no variability.
Application in R: sd(y) for a vector y.
Note: Variance is not robust to outliers.

Quartiles and Interquartile Range (IQR)

Quartiles divide ordered data into four equal parts. The first quartile (Q1) is the value below which 25% of the data fall, and the third quartile (Q3) is the value below which 75% of the data fall.

IQR Formula:
Application: IQR is a robust measure of variability, less affected by outliers.

Boxplot and Outliers

A boxplot is a graphical summary displaying the minimum, Q1, median, Q3, and maximum. Outliers are typically defined as values less than or greater than .

Interpretation: Outliers should be investigated, as they may indicate errors or true extreme values.

Summary Table: Measures of Central Tendency and Variability

Measure	Definition	Formula	Robustness
Mean	Arithmetic average		Not robust
Median	Middle value	Middle of ordered data	Robust
Mode	Most frequent value	N/A	Robust
Range	Max - Min		Not robust
Standard Deviation	Average deviation from mean		Not robust
IQR	Middle 50% spread		Robust

Additional info:

Examples and R code snippets are provided to illustrate calculations.
Boxplots and histograms are useful for visualizing distributions and identifying outliers.
In biostatistics, understanding both central tendency and variability is crucial for interpreting health data and making informed decisions.