Review of Probability Distributions, Hypothesis Testing, and Statistical Inference

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Probability Distributions

Discrete and Continuous Distributions

Probability distributions describe how the probabilities are distributed over the values of the random variable. Discrete distributions are used for countable outcomes, while continuous distributions are used for measurable outcomes.

Binomial Distribution: Models the number of successes in a fixed number of independent Bernoulli trials with the same probability of success. Formula:
Geometric Distribution: Models the number of trials needed to get the first success in repeated, independent Bernoulli trials. Formula:
Poisson Distribution: Models the number of events occurring in a fixed interval of time or space, given the events occur with a known constant mean rate and independently of the time since the last event. Formula:
Exponential Distribution: Models the time between events in a Poisson process. Formula: for

Example: If you pick 50 candies from a bag of M&Ms with replacement and the probability of picking a red one is 1/5, the number of red M&Ms follows a binomial distribution: .

Applications of Distributions

Binomial: Number of red candies in a fixed number of picks.
Geometric: Number of picks needed to get the first red candy.
Poisson: Number of distractions in a fixed time interval, given a constant rate.
Exponential: Time until the next distraction or event occurs.

Example: If distractions occur at a rate of 1 every 40 seconds, the number of distractions in 1 minute follows a Poisson distribution with (since ).

Random Variables and Their Properties

Normal Distribution and Sums of Random Variables

The normal distribution is a continuous probability distribution characterized by its mean and variance . The sum or difference of independent normal random variables is also normally distributed.

If and are independent:

Example: If and , then .

Central Limit Theorem (CLT)

The CLT states that the sum (or average) of a large number of independent, identically distributed random variables will be approximately normally distributed, regardless of the original distribution.

Application: The average amount spent per grocery run after many trips will be approximately normal.

Statistical Inference

Hypothesis Testing

Hypothesis testing is a statistical method used to make decisions about population parameters based on sample data.

Null Hypothesis (): The default assumption (e.g., no difference, no effect).
Alternative Hypothesis (): The competing claim (e.g., there is a difference).
Significance Level (): The probability of rejecting when it is true (Type I error).
p-value: The probability of observing data as extreme as the sample, assuming is true.

Example: To test if the percentage of UCSD students spending Thanksgiving with family decreased from 65% in 2019, set , .

Types of Statistical Tests

One-sample t-test: Compares the mean of a single group to a known value.
Two-sample t-test: Compares the means of two independent groups.
Paired t-test: Compares means from the same group at different times or under different conditions.
Chi-square test for goodness-of-fit: Tests if observed categorical data fit a specified distribution.
Chi-square test for independence: Tests if two categorical variables are independent.
Chi-square test for homogeneity: Tests if distributions of a categorical variable are the same across different populations.

Example: To test if the distribution of grades matches expected proportions, use the chi-square test for goodness-of-fit.

Confidence Intervals (CI)

A confidence interval gives a range of plausible values for a population parameter, based on sample data.

Interpretation: A 95% CI means that if we repeated the sampling process many times, 95% of the intervals would contain the true parameter.
Formula for mean: (for known )

Example: If the 95% CI for average calories consumed at Thanksgiving is [3000, 5000], we are 95% confident the true mean lies within this interval.

Power and Type I Error

Power is the probability of correctly rejecting the null hypothesis when it is false. Type I error is the probability of incorrectly rejecting the null hypothesis when it is true.

Increasing sample size increases power.
Type I error rate is set by the significance level .

Example: If the power of a test is 0.90 for a sample of 40, increasing the sample to 100 will increase the power.

Interpreting Statistical Results

p-value Interpretation

The p-value quantifies the evidence against the null hypothesis. A small p-value (typically < 0.05) suggests rejecting .

Contextual Example: A p-value of 0.07 means that, if the null hypothesis is true, there is a 7% chance of observing a sample mean as extreme as the one observed.

Confidence Interval Interpretation

Confidence intervals should be interpreted in terms of repeated sampling, not as a probability statement about the parameter.

Example: In a study where 4% of parents said they never spank their children, and the 95% CI is [2.9%, 5.1%], we are 95% confident that the true proportion is between 2.9% and 5.1%.

Summary Table: Common Distributions and Tests

Distribution/Test	When to Use	Key Formula
Binomial	Fixed number of independent trials, two outcomes
Geometric	Number of trials until first success
Poisson	Number of events in fixed interval, constant rate
Exponential	Time between events in Poisson process
One-sample t-test	Compare sample mean to known value
Two-sample t-test	Compare means of two independent groups
Chi-square test	Compare observed and expected frequencies

Additional info:

Some questions refer to the interpretation of confidence intervals and p-values, which are essential for understanding statistical inference.
Questions about power and error rates highlight the importance of sample size and significance level in hypothesis testing.
Test selection (t-test, chi-square, etc.) depends on the type of data (categorical vs. quantitative) and study design (independent vs. paired samples).