Biostatistics Week 3: Random Variables, Binomial and Normal Distributions

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Random Variables

Definition of Random Variables

A random variable is a variable whose value is determined by the outcome of a random event or experiment. The value cannot be predicted with certainty in advance.

Example: If you take someone's blood pressure, the result is a random variable because the outcome is not known until measured.

Types of Random Variables

It is important to distinguish between discrete and continuous random variables, as different statistical techniques are used for each type.

Discrete random variable: Takes on a countable number of distinct values. Examples: Number of speeding tickets you have had; number of songs on your phone.
Continuous random variable: Can take on any value within a given range. Examples: How old you are; how long it takes you to drive home.

Binomial Random Variable

Definition and Application

A binomial random variable counts the number of successes in a fixed number of independent trials, where each trial has only two possible outcomes (success or failure).

Example: Suppose there is an infectious disease with a probability of 0.25 that a person who has had contact with an infected person gets the disease. In a sample of 50 independent patients, the number who get the disease is a binomial random variable.
Questions of interest: What is the probability that exactly 10 have the disease? What is the probability that at most 15 have the disease?

Binomial Distribution

Characteristics of a Binomial Experiment

The experiment consists of n identical and independent trials.
Each trial results in one of two outcomes: Success or Failure.
The probability of success (p) is the same for each trial; probability of failure is 1 - p.
The binomial random variable X is the number of successes in n trials.

Example: n = 50, p = 0.25, X = number of patients who have the disease.

Calculating Binomial Probabilities

To find the probability of exactly k successes:
In R, use dbinom(k, n, p) for exact probabilities. Example: for n=50, p=0.25: dbinom(10, 50, 0.25) = 0.1851841
To find the probability of at most k successes: In R, use pbinom(k, n, p). Example: for n=50, p=0.25: pbinom(15, 50, 0.25) = 0.759167
To find the probability of at least k successes: Example: for n=50, p=0.25: 1 - pbinom(9, 50, 0.25) = 0.8161

Normal Distribution

Definition and Properties

The normal distribution (also known as the bell curve) is the most important continuous probability distribution in statistics. Many naturally occurring variables are approximately normally distributed.

Defined by two parameters: mean (μ) and standard deviation (σ).
The mean determines the center of the distribution; the standard deviation determines the spread.
The normal distribution is symmetric about the mean, and the mean, median, and mode are equal.

Standard Normal Distribution: A normal distribution with μ = 0 and σ = 1, denoted as .

Normal Distribution Formula

The probability density function (PDF) of the normal distribution is:

Using the Normal Distribution

To find the probability that a value is less than a given x, use the cumulative distribution function (CDF): .
In R, use pnorm(x, mu, sigma) to find the area to the left of x.
To find the value corresponding to a given percentile, use the quantile function: qnorm(p, mu, sigma).

Example: If ages are normally distributed with mean 45 and standard deviation 10, the 25th percentile is found by qnorm(0.25, 45, 10) = 38.25.

Empirical Rule (68-95-99.7 Rule)

About 68% of values fall within 1 standard deviation of the mean.
About 95% of values fall within 2 standard deviations of the mean.
About 99.7% of values fall within 3 standard deviations of the mean.

Central Limit Theorem (CLT)

Statement and Importance

The Central Limit Theorem states that, for a sufficiently large sample size, the distribution of the sample mean (X̄) will be approximately normal, regardless of the shape of the population distribution.

The mean of the sampling distribution is equal to the population mean (μ).
The standard deviation of the sampling distribution (standard error) is , where n is the sample size.

Example: If a random sample of n=64 is taken from a population with mean 425 and standard deviation 5, the standard error is .

Confidence Intervals

Definition and Calculation

A confidence interval provides a range of values that is likely to contain the true population parameter (such as the mean) with a specified level of confidence (e.g., 95%).

The general form for a confidence interval for the mean is: (when population standard deviation is known) (when population standard deviation is unknown, use sample standard deviation s and t-distribution)
Lower Confidence Limit (LCL): Lower end of the interval
Upper Confidence Limit (UCL): Upper end of the interval

Example: For a sample mean of 13.0, sample standard deviation 5, and n=36, the 95% confidence interval is calculated as: So, the interval is (11.367, 14.633).

Confidence Intervals for Two Groups

Confidence intervals can also be used to compare means between two groups (e.g., male vs. female BMI). If the interval includes 0, the difference is not statistically significant at the chosen confidence level.

Summary Table: Discrete vs. Continuous Random Variables

Type	Definition	Examples
Discrete	Takes on countable values	Number of tickets, number of songs
Continuous	Can take any value in a range	Age, time to drive home

Summary Table: Binomial vs. Normal Distribution

Distribution	Type of Variable	Parameters	Shape	Example
Binomial	Discrete	n (number of trials), p (probability of success)	Symmetric (if p=0.5), otherwise skewed	Number of patients with disease
Normal	Continuous	μ (mean), σ (standard deviation)	Bell-shaped, symmetric	Blood pressure, height