BackCore Concepts in Probability, Distributions, Estimation, Hypothesis Testing, and Categorical Data Analysis
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Probability and Probability Distributions
Definition and Properties of Probability
Probability quantifies the likelihood of an event occurring, expressed as a value between 0 and 1. It is foundational to statistical inference and is interpreted as the proportion of times an outcome would occur in a very long sequence of observations.
Probability: The proportion of times an outcome occurs in repeated trials.
Range: Probability values range from 0 (impossible event) to 1 (certain event).
Long-run frequency: Probability is best assessed over many trials; small samples may be misleading.
Subjective probability: When empirical data is unavailable, probability is based on belief and available information.
Bayesian statistics: Uses subjective probability, based on Bayes' theorem.
Example: The probability of flipping heads on a fair coin is 0.5.
Probability Distributions
A probability distribution lists all possible outcomes of a random variable and their associated probabilities. It is essential for understanding the behavior of random variables.
Random variable: A variable whose outcomes are determined by random variation.
Discrete random variable: Takes distinct values (e.g., 0, 1, 2, ...).
Continuous random variable: Takes values on a continuum (e.g., all real numbers between 0 and 1).
Probability distribution: Assigns probabilities to each possible outcome.
Example: The probability distribution for rolling a fair die assigns 1/6 probability to each outcome (1–6).
Normal Distribution and Z-Scores
Normal Distribution
The normal distribution is a symmetric, bell-shaped curve defined by its mean () and standard deviation (). It is the most important probability distribution for statistical inference.
Symmetry: The distribution is symmetric about the mean.
Parameters: Defined by mean () and standard deviation ().
Empirical Rule: For normal distributions:
68% of data within 1 standard deviation
95% within 2 standard deviations
99.7% within 3 standard deviations
Formula: The probability density function is:
Example: Heights of adult humans are approximately normally distributed.
Z-Score
The z-score measures how many standard deviations a value is from the mean. It is used to standardize values and compare across distributions.
Definition:
Interpretation: Positive z-scores are above the mean; negative are below.
Example: If a test score is 85, the mean is 80, and the standard deviation is 5, then .
Sampling Distributions and Estimation
Sampling Distributions
Sampling distributions describe the distribution of a statistic (e.g., sample mean) over repeated samples from the population. They are crucial for making inferences about population parameters.
Central Limit Theorem: For large samples, the sampling distribution of the mean is approximately normal, regardless of the population distribution.
Standard error: The standard deviation of the sampling distribution.
Example: The sampling distribution of the sample mean for n=100 will be bell-shaped if the population is normal or n is large.
Point and Interval Estimation
Estimation methods are used to infer population parameters from sample data. Two main types are point estimates and interval estimates (confidence intervals).
Point estimate: A single value used as the best guess for a parameter.
Interval estimate (confidence interval): A range around the point estimate believed to contain the parameter with a specified probability.
Margin of error: The amount added/subtracted from the point estimate to form the confidence interval.
Estimator: A statistic used to estimate a parameter (e.g., sample mean for population mean).
Estimate: The value of the estimator for a particular sample.
Confidence level: The probability that the interval contains the parameter (e.g., 0.95 or 0.99).
Example: A 95% confidence interval for the proportion of students who binge drink is 0.73 ± 0.02.
Methods of Estimation
Maximum Likelihood: Estimates parameters by maximizing the likelihood function.
Bootstrap: Uses resampling with replacement to estimate the distribution of a statistic.
Example: The sample mean is a maximum likelihood estimator for the population mean in normal distributions.
Hypothesis Testing
Structure of a Statistical Test
Hypothesis testing is a formal procedure for evaluating claims about population parameters using sample data.
Assumptions: Conditions required for the test (e.g., random sampling, normality).
Hypotheses: Null hypothesis () and alternative hypothesis ().
Test statistic: A function of sample data used to assess the plausibility of .
P-value: The probability of observing a test statistic as extreme as the one observed, assuming is true.
Conclusion: Decision to reject or fail to reject based on the p-value and significance level ().
Example: Testing whether the mean salary of graduates is $50,000 using a sample mean and standard deviation.
Hypotheses
Null hypothesis (): The population parameter equals a specific value (e.g., no effect).
Alternative hypothesis (): The parameter differs from the null value (e.g., some effect).
Example: ,
P-value and Significance
P-value:
Significance level (): The threshold for rejecting (commonly 0.05).
Example: If p-value < 0.05, reject .
Categorical Data Analysis and Chi-Square Tests
Contingency Tables
Contingency tables summarize the relationship between two categorical variables, displaying frequencies for each combination of categories.
Purpose: To analyze associations between categorical variables.
Statistical independence: Variables are independent if conditional distributions are identical across categories.
Statistical dependence: Conditional distributions differ across categories.
Example: A table showing party ID (Democrat, Republican, Independent) by gender (Male, Female).
Chi-Square Test of Independence
The chi-square test assesses whether two categorical variables are independent.
Test statistic: Measures the difference between observed and expected frequencies.
Significance test: Determines if the association is statistically significant.
Residual analysis: Describes the nature of the association after the test.
Example: Testing if party ID and gender are independent using a contingency table.
Types of Categorical Variables
Nominal variables: Categories without order (e.g., preferred candidate).
Ordinal variables: Categories with a natural order (e.g., opinion levels).
Categorical scales for continuous variables: Continuous variables grouped into categories (e.g., income brackets).
HTML Table: Example Contingency Table
Party ID | Male | Female |
|---|---|---|
Democrat | 120 | 130 |
Republican | 100 | 110 |
Independent | 80 | 90 |
Additional info: Table values are inferred for illustration.
Summary Table: Key Concepts and Definitions
Concept | Definition | Example |
|---|---|---|
Probability | Proportion of times an event occurs in repeated trials | Probability of heads in coin toss = 0.5 |
Random Variable | Variable with outcomes determined by random variation | Number of heads in 10 coin tosses |
Normal Distribution | Symmetric, bell-shaped distribution defined by mean and standard deviation | Heights of adults |
Z-score | Number of standard deviations from the mean | |
Point Estimate | Single value as best guess for parameter | Sample mean |
Confidence Interval | Range around point estimate likely to contain parameter | 0.73 ± 0.02 |
Hypothesis Test | Procedure to evaluate claims about parameters | Test if mean salary = $50,000 |
Chi-Square Test | Test for independence between categorical variables | Party ID vs. Gender |