BackApplied Statistics Exam 2 Study Guide
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Data Visualization
Choosing and Interpreting Plots
Data visualization is essential for summarizing and understanding the distribution and characteristics of data. Different plots are appropriate for different types of data and analytical goals.
Bar Plots: Best for displaying categorical data. Each bar represents a category, and the height shows the frequency or proportion.
Histograms: Used for numerical (quantitative) data to show the distribution of a variable. Data is grouped into bins, and the height of each bar shows the number of observations in each bin.
Boxplots: Summarize the distribution of a dataset using the median, quartiles, and potential outliers. Useful for comparing distributions across groups.
Identifying Outliers: Boxplots and histograms can help detect outliers—values that fall far from the rest of the data.
Assessing Skewness: Visualizations can reveal if a distribution is symmetric, negatively skewed (left-skewed), or positively skewed (right-skewed).
Data Transformations: Statisticians may transform data (e.g., log, square root) to meet assumptions of statistical tests or to make patterns more apparent.
Example: A histogram of household incomes may show positive skewness, indicating most households earn less than the mean.
Point Estimation
Estimators, Bias, and Variance
Point estimation involves using sample data to estimate population parameters.
Population Parameter: A fixed, usually unknown value describing a characteristic of a population (e.g., mean μ).
Point Estimator: A statistic calculated from sample data used to estimate a population parameter (e.g., sample mean \( \bar{x} \)).
Bias: The difference between the expected value of the estimator and the true parameter value. An unbiased estimator has zero bias.
Variance: Measures the spread of the estimator's sampling distribution. Lower variance is preferred for precision.
Example: The sample mean \( \bar{x} \) is an unbiased estimator of the population mean μ.
Sampling Distributions and the Central Limit Theorem (CLT)
Sampling Distribution: The probability distribution of a statistic (e.g., sample mean) over all possible samples from the population.
Central Limit Theorem: For large sample sizes, the sampling distribution of the sample mean is approximately normal, regardless of the population's distribution.
Significance: The CLT justifies using normal-based inference for means when sample sizes are large.
Confidence Intervals
t Distribution vs. Normal Distribution
t Distribution: Used when estimating the mean from small samples and the population standard deviation is unknown. It is wider than the normal distribution, especially for small sample sizes.
Standard Normal (Z) Distribution: Used when the population standard deviation is known or the sample size is large.
When to Use: Use the t distribution for small samples (typically n < 30) with unknown population standard deviation.
Calculating Confidence Intervals for the Mean
Formula:
\( \bar{x} \): sample mean
\( s \): sample standard deviation
\( n \): sample size
\( t^* \): critical value from the t distribution for the desired confidence level
Interpretation: A 95% confidence interval means that, in repeated sampling, 95% of such intervals would contain the true population mean.
Width and Confidence Level: Higher confidence levels produce wider intervals; larger samples produce narrower intervals.
Hypothesis Testing
General Steps and Types of Errors
Steps:
State null (H₀) and alternative (H₁) hypotheses.
Choose significance level (α).
Calculate test statistic.
Determine p-value or critical value.
Make a decision: reject or fail to reject H₀.
Type I Error (α): Rejecting H₀ when it is true.
Type II Error (β): Failing to reject H₀ when H₁ is true.
Test Directionality
Upper-Tailed Test: H₁: parameter > value
Lower-Tailed Test: H₁: parameter < value
Two-Tailed Test: H₁: parameter ≠ value
One-Sample t-Test
When to Use and Relationship to Confidence Intervals
Use: To test if the mean of a single sample differs from a known or hypothesized value when the population standard deviation is unknown.
Relationship: A two-sided hypothesis test at significance level α corresponds to a 100(1−α)% confidence interval; if the hypothesized value is outside the interval, reject H₀.
Two-Sample t-Test
When to Use and Effect Size
Use: To compare the means of two independent groups.
Effect Size: Quantifies the magnitude of the difference between groups, aiding interpretation beyond statistical significance.
Joint and Conditional Probability
Calculations and Independence
Joint Probability: Probability that two events occur together: P(A ∩ B).
Marginal Probability: Probability of a single event, regardless of others.
Conditional Probability: Probability of one event given another: P(A|B) = P(A ∩ B)/P(B).
Multiplication Rule: P(A ∩ B) = P(A|B)P(B).
Independence: Events A and B are independent if P(A|B) = P(A).
Law of Total Probability: If B₁, B₂, ..., Bₙ are mutually exclusive and exhaustive, then P(A) = Σ P(A|Bᵢ)P(Bᵢ).
Bayes’ Theorem: Updates probabilities based on new information:
Example Table:
Event A | Event B | P(A ∩ B) | P(A|B) | P(B) |
|---|---|---|---|---|
Yes | Yes | 0.10 | 0.20 | 0.50 |
Yes | No | 0.15 | 0.30 | 0.50 |
No | Yes | 0.40 | 0.80 | 0.50 |
No | No | 0.35 | 0.70 | 0.50 |
Additional info: Table values are illustrative.
Chi-Squared Test for Independence
When to Use
Purpose: To test whether two categorical variables are independent.
Data: Requires a contingency table of observed frequencies.
O: observed frequency
E: expected frequency under independence
Covariance and Correlation
Joint Distributions and Correlation Coefficient
Jointly Distributed Random Variables: Two or more variables whose values are observed together.
Covariance: Measures the direction of the linear relationship between two variables.
Correlation Coefficient (r): Measures the strength and direction of a linear relationship, scaled between -1 and 1.
Interpretation: Correlation quantifies linear association but does not imply causation or capture non-linear relationships.
Population Correlation: Estimated using the sample correlation coefficient.
t-Test for Lack of Correlation: Used to test if the population correlation is zero.
Designing and Interpreting Statistical Tests
Test Selection, Hypotheses, and Assumptions
Test Selection: Choose the appropriate test based on the research question and data type (e.g., one-sample t-test, two-sample t-test, chi-squared test, t-test for correlation).
Hypotheses: Clearly state null and alternative hypotheses relevant to the research question.
Assumptions: List assumptions (e.g., normality, independence, equal variances) required for the validity of the test.
Test Statistic: Calculate the appropriate statistic for the test.
Interpreting Results
Statistical Decision: Based on the p-value or confidence interval, decide whether to reject or fail to reject the null hypothesis.
Contextual Interpretation: Relate the statistical result back to the original research question.
Exam Preparation Tips
Focus on conceptual understanding, not just calculations.
Practice with a variety of problems, including those from WebAssign.
Familiarize yourself with your calculator and formula sheet.
Show all work on the exam for partial credit.
Formulate a plan for each problem, even if unsure of the final answer.