Core Concepts in Probability, Distributions, Estimation, Hypothesis Testing, and Categorical Data Analysis

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Probability and Probability Distributions

Definition and Properties of Probability

Probability quantifies the likelihood of an event occurring, expressed as a value between 0 and 1. It is foundational to statistical inference and is interpreted as the proportion of times an outcome would occur in a very long sequence of observations.

Probability: The proportion of times an outcome occurs in repeated trials.
Range: Probability values range from 0 (impossible event) to 1 (certain event).
Long-run frequency: Probability is best assessed over many trials; small samples may be misleading.
Subjective probability: When empirical data is unavailable, probability is based on belief and available information.
Bayesian statistics: Uses subjective probability, based on Bayes' theorem.

Example: The probability of flipping heads on a fair coin is 0.5.

Probability Distributions

A probability distribution lists all possible outcomes of a random variable and their associated probabilities. It is essential for understanding the behavior of random variables.

Random variable: A variable whose outcomes are determined by random variation.
Discrete random variable: Takes distinct values (e.g., 0, 1, 2, ...).
Continuous random variable: Takes values on a continuum (e.g., all real numbers between 0 and 1).
Probability distribution: Assigns probabilities to each possible outcome.

Example: The probability distribution for rolling a fair die assigns 1/6 probability to each outcome (1–6).

Normal Distribution and Z-Scores

Normal Distribution

The normal distribution is a symmetric, bell-shaped curve defined by its mean () and standard deviation (). It is the most important probability distribution for statistical inference.

Symmetry: The distribution is symmetric about the mean.
Parameters: Defined by mean () and standard deviation ().
Empirical Rule: For normal distributions:
- 68% of data within 1 standard deviation
- 95% within 2 standard deviations
- 99.7% within 3 standard deviations
Formula: The probability density function is:

Example: Heights of adult humans are approximately normally distributed.

Z-Score

The z-score measures how many standard deviations a value is from the mean. It is used to standardize values and compare across distributions.

Definition:
Interpretation: Positive z-scores are above the mean; negative are below.

Example: If a test score is 85, the mean is 80, and the standard deviation is 5, then .

Sampling Distributions and Estimation

Sampling Distributions

Sampling distributions describe the distribution of a statistic (e.g., sample mean) over repeated samples from the population. They are crucial for making inferences about population parameters.

Central Limit Theorem: For large samples, the sampling distribution of the mean is approximately normal, regardless of the population distribution.
Standard error: The standard deviation of the sampling distribution.

Example: The sampling distribution of the sample mean for n=100 will be bell-shaped if the population is normal or n is large.

Point and Interval Estimation

Estimation methods are used to infer population parameters from sample data. Two main types are point estimates and interval estimates (confidence intervals).

Point estimate: A single value used as the best guess for a parameter.
Interval estimate (confidence interval): A range around the point estimate believed to contain the parameter with a specified probability.
Margin of error: The amount added/subtracted from the point estimate to form the confidence interval.
Estimator: A statistic used to estimate a parameter (e.g., sample mean for population mean).
Estimate: The value of the estimator for a particular sample.
Confidence level: The probability that the interval contains the parameter (e.g., 0.95 or 0.99).

Example: A 95% confidence interval for the proportion of students who binge drink is 0.73 ± 0.02.

Methods of Estimation

Maximum Likelihood: Estimates parameters by maximizing the likelihood function.
Bootstrap: Uses resampling with replacement to estimate the distribution of a statistic.

Example: The sample mean is a maximum likelihood estimator for the population mean in normal distributions.

Hypothesis Testing

Structure of a Statistical Test

Hypothesis testing is a formal procedure for evaluating claims about population parameters using sample data.

Assumptions: Conditions required for the test (e.g., random sampling, normality).
Hypotheses: Null hypothesis () and alternative hypothesis ().
Test statistic: A function of sample data used to assess the plausibility of .
P-value: The probability of observing a test statistic as extreme as the one observed, assuming is true.
Conclusion: Decision to reject or fail to reject based on the p-value and significance level ().

Example: Testing whether the mean salary of graduates is $50,000 using a sample mean and standard deviation.

Hypotheses

Null hypothesis (): The population parameter equals a specific value (e.g., no effect).
Alternative hypothesis (): The parameter differs from the null value (e.g., some effect).

Example: ,

P-value and Significance

P-value:
Significance level (): The threshold for rejecting (commonly 0.05).

Example: If p-value < 0.05, reject .

Categorical Data Analysis and Chi-Square Tests

Contingency Tables

Contingency tables summarize the relationship between two categorical variables, displaying frequencies for each combination of categories.

Purpose: To analyze associations between categorical variables.
Statistical independence: Variables are independent if conditional distributions are identical across categories.
Statistical dependence: Conditional distributions differ across categories.

Example: A table showing party ID (Democrat, Republican, Independent) by gender (Male, Female).

Chi-Square Test of Independence

The chi-square test assesses whether two categorical variables are independent.

Test statistic: Measures the difference between observed and expected frequencies.
Significance test: Determines if the association is statistically significant.
Residual analysis: Describes the nature of the association after the test.

Example: Testing if party ID and gender are independent using a contingency table.

Types of Categorical Variables

Nominal variables: Categories without order (e.g., preferred candidate).
Ordinal variables: Categories with a natural order (e.g., opinion levels).
Categorical scales for continuous variables: Continuous variables grouped into categories (e.g., income brackets).

HTML Table: Example Contingency Table

Party ID	Male	Female
Democrat	120	130
Republican	100	110
Independent	80	90

Additional info: Table values are inferred for illustration.

Summary Table: Key Concepts and Definitions

Concept	Definition	Example
Probability	Proportion of times an event occurs in repeated trials	Probability of heads in coin toss = 0.5
Random Variable	Variable with outcomes determined by random variation	Number of heads in 10 coin tosses
Normal Distribution	Symmetric, bell-shaped distribution defined by mean and standard deviation	Heights of adults
Z-score	Number of standard deviations from the mean
Point Estimate	Single value as best guess for parameter	Sample mean
Confidence Interval	Range around point estimate likely to contain parameter	0.73 ± 0.02
Hypothesis Test	Procedure to evaluate claims about parameters	Test if mean salary = $50,000
Chi-Square Test	Test for independence between categorical variables	Party ID vs. Gender