BackStat 240/250 Midterm Study Guide: Describing and Analyzing Quantitative Data & Regression
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Describing Data: The Five W's and Variable Types
Cases and Variables
Understanding the structure of a dataset begins with identifying the cases (the subjects or units of analysis) and the variables (the characteristics measured).
Who: Refers to the cases or subjects being studied.
What: Refers to the variables measured for each case.
Variable Types
Quantitative Variables: Variables that represent numerical values and can be measured or counted (e.g., height, weight).
Categorical Variables: Variables that represent categories or groups (e.g., gender, color).
Case: An individual unit or subject in the dataset.
Describing Quantitative Variables: Graphs and Numerical Summaries
Graphical Representations
Histogram: A graphical display of the distribution of a quantitative variable. Shows shape, center, and spread.
Measures of Center
Mean: The arithmetic average. Sensitive to outliers.
Median: The middle value when data are ordered. Resistant to outliers.
Percentiles and Five-Number Summary
Percentiles: Values below which a certain percent of data fall. The five-number summary uses specific percentiles.
Five-Number Summary: Minimum, Q1 (25th percentile), Median (50th percentile), Q3 (75th percentile), Maximum.
Measures of Spread
Standard Deviation (SD): Measures average distance from the mean. Nonresistant to outliers.
Interquartile Range (IQR): Difference between Q3 and Q1. Resistant to outliers.
Choosing Statistics
Use mean and SD for symmetric distributions without outliers.
Use median and IQR for skewed distributions or with outliers.
Resistant vs. Nonresistant Statistics
Resistant: Not affected by extreme values (e.g., median, IQR).
Nonresistant: Sensitive to extreme values (e.g., mean, SD).
Identifying Shape
Four ways: symmetry/skewness, modality (number of peaks), outliers, spread.
Boxplots and Outlier Detection
Boxplots
Visualize the five-number summary.
Can compare distributions side-by-side.
Show shape, center, spread, and outliers.
Outlier Identification
Outliers are values that fall outside typical boundaries.
Common boundaries: values beyond ±2 SD (moderate outliers), beyond ±3 SD (serious outliers).
Explanatory vs. Response Variables
Explanatory (Predictor): Variable used to explain or predict another.
Response: Variable being predicted or explained.
Z-Scores
Standardizes values for comparison.
Formula:
Interpretation: How many SDs a value is from the mean.
Empirical Rule
Applies to normal distributions.
About 68% of data within ±1 SD, 95% within ±2 SD, 99.7% within ±3 SD.
Describing Two Quantitative Variables: Scatterplots and Correlation
Scatterplots
Graphical display of two quantitative variables.
Shows pattern, direction, strength, and outliers.
Correlation
Measures strength and direction of linear relationship.
Range: -1 to 1.
Formula:
Interpretation: Positive/negative, strong/weak.
Explanatory vs. Response Variables
Explanatory (predictor) variable on x-axis; response variable on y-axis.
Simple Linear Regression Model
Equation:
Slope (b1): Change in y per unit change in x.
Intercept (b0): Predicted y when x = 0.
Find predicted value by substituting x.
Regression Model Variation and Diagnostics
Least Squares Method
Estimates slope and intercept by minimizing sum of squared residuals.
Residual:
Analysis of Variance (ANOVA) Table
Summarizes sources of variation in regression.
Components: Regression, Residual/Error, Total.
Generic ANOVA table:
Source | DF | SS | MS | F |
|---|---|---|---|---|
Regression | 1 | SSR | MSR | F |
Residual | n-2 | SSE | MSE | |
Total | n-1 | SST |
Quantity Minimized
Least Squares minimizes
R2 (Coefficient of Determination)
Proportion of variance in y explained by x.
Formula:
Interpretation: Higher R2 means better fit, but does not guarantee best model.
Standard Error of Residuals (se)
Measures typical size of residuals.
Compare se to sy (SD of y) to assess prediction quality.
Regression Model Conditions and Diagnostics
Model Conditions
Check graphs for linearity, constant variance, independence, and normality of residuals.
Meeting conditions is distinct from model's predictive ability.
Regression Diagnostics: Unusual Observations
Y-axis (R flags): Outliers in response variable.
X-axis (X flags): Outliers in predictor variable.
Highly Influential Observations: Points that greatly affect regression estimates.
Outlier Boundaries and Labels
Moderate outliers: Beyond ±2 SD.
Serious outliers: Beyond ±3 SD.
Summary and Integration
Bringing Concepts Together
Apply concepts of descriptive statistics and regression to analyze real datasets.
Use worksheets and practice problems to reinforce understanding.
Example: Given a dataset of heights and weights, describe the distribution of heights using histograms, five-number summary, and boxplots. Then, analyze the relationship between height and weight using scatterplots, correlation, and linear regression.
Additional info: Academic context and formulas have been expanded for clarity and completeness.