Stat 240/250 Midterm Study Guide: Describing and Analyzing Quantitative Data & Regression

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Describing Data: The Five W's and Variable Types

Cases and Variables

Understanding the structure of a dataset begins with identifying the cases (the subjects or units of analysis) and the variables (the characteristics measured).

Who: Refers to the cases or subjects being studied.
What: Refers to the variables measured for each case.

Variable Types

Quantitative Variables: Variables that represent numerical values and can be measured or counted (e.g., height, weight).
Categorical Variables: Variables that represent categories or groups (e.g., gender, color).

Case: An individual unit or subject in the dataset.

Describing Quantitative Variables: Graphs and Numerical Summaries

Graphical Representations

Histogram: A graphical display of the distribution of a quantitative variable. Shows shape, center, and spread.

Measures of Center

Mean: The arithmetic average. Sensitive to outliers.
Median: The middle value when data are ordered. Resistant to outliers.

Percentiles and Five-Number Summary

Percentiles: Values below which a certain percent of data fall. The five-number summary uses specific percentiles.
Five-Number Summary: Minimum, Q1 (25th percentile), Median (50th percentile), Q3 (75th percentile), Maximum.

Measures of Spread

Standard Deviation (SD): Measures average distance from the mean. Nonresistant to outliers.
Interquartile Range (IQR): Difference between Q3 and Q1. Resistant to outliers.

Choosing Statistics

Use mean and SD for symmetric distributions without outliers.
Use median and IQR for skewed distributions or with outliers.

Resistant vs. Nonresistant Statistics

Resistant: Not affected by extreme values (e.g., median, IQR).
Nonresistant: Sensitive to extreme values (e.g., mean, SD).

Identifying Shape

Four ways: symmetry/skewness, modality (number of peaks), outliers, spread.

Boxplots and Outlier Detection

Boxplots

Visualize the five-number summary.
Can compare distributions side-by-side.
Show shape, center, spread, and outliers.

Outlier Identification

Outliers are values that fall outside typical boundaries.
Common boundaries: values beyond ±2 SD (moderate outliers), beyond ±3 SD (serious outliers).

Explanatory vs. Response Variables

Explanatory (Predictor): Variable used to explain or predict another.
Response: Variable being predicted or explained.

Z-Scores

Standardizes values for comparison.
Formula:
Interpretation: How many SDs a value is from the mean.

Empirical Rule

Applies to normal distributions.
About 68% of data within ±1 SD, 95% within ±2 SD, 99.7% within ±3 SD.

Describing Two Quantitative Variables: Scatterplots and Correlation

Scatterplots

Graphical display of two quantitative variables.
Shows pattern, direction, strength, and outliers.

Correlation

Measures strength and direction of linear relationship.
Range: -1 to 1.
Formula:
Interpretation: Positive/negative, strong/weak.

Explanatory vs. Response Variables

Explanatory (predictor) variable on x-axis; response variable on y-axis.

Simple Linear Regression Model

Equation:
Slope (b1): Change in y per unit change in x.
Intercept (b0): Predicted y when x = 0.
Find predicted value by substituting x.

Regression Model Variation and Diagnostics

Least Squares Method

Estimates slope and intercept by minimizing sum of squared residuals.
Residual:

Analysis of Variance (ANOVA) Table

Summarizes sources of variation in regression.
Components: Regression, Residual/Error, Total.
Generic ANOVA table:

Source	DF	SS	MS	F
Regression	1	SSR	MSR	F
Residual	n-2	SSE	MSE
Total	n-1	SST

Quantity Minimized

Least Squares minimizes

R2 (Coefficient of Determination)

Proportion of variance in y explained by x.
Formula:
Interpretation: Higher R2 means better fit, but does not guarantee best model.

Standard Error of Residuals (se)

Measures typical size of residuals.
Compare se to sy (SD of y) to assess prediction quality.

Regression Model Conditions and Diagnostics

Model Conditions

Check graphs for linearity, constant variance, independence, and normality of residuals.
Meeting conditions is distinct from model's predictive ability.

Regression Diagnostics: Unusual Observations

Y-axis (R flags): Outliers in response variable.
X-axis (X flags): Outliers in predictor variable.
Highly Influential Observations: Points that greatly affect regression estimates.

Outlier Boundaries and Labels

Moderate outliers: Beyond ±2 SD.
Serious outliers: Beyond ±3 SD.

Summary and Integration

Bringing Concepts Together

Apply concepts of descriptive statistics and regression to analyze real datasets.
Use worksheets and practice problems to reinforce understanding.

Example: Given a dataset of heights and weights, describe the distribution of heights using histograms, five-number summary, and boxplots. Then, analyze the relationship between height and weight using scatterplots, correlation, and linear regression.

Additional info: Academic context and formulas have been expanded for clarity and completeness.