Skip to main content
Back

Stat 240/250 Midterm Study Guide: Describing and Analyzing Quantitative Data & Regression

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Describing Data: The Five W's and Variable Types

Cases and Variables

Understanding the structure of a dataset begins with identifying the cases (the subjects or units of analysis) and the variables (the characteristics measured).

  • Who: Refers to the cases or subjects being studied.

  • What: Refers to the variables measured for each case.

Variable Types

  • Quantitative Variables: Variables that represent numerical values and can be measured or counted (e.g., height, weight).

  • Categorical Variables: Variables that represent categories or groups (e.g., gender, color).

Case: An individual unit or subject in the dataset.

Describing Quantitative Variables: Graphs and Numerical Summaries

Graphical Representations

  • Histogram: A graphical display of the distribution of a quantitative variable. Shows shape, center, and spread.

Measures of Center

  • Mean: The arithmetic average. Sensitive to outliers.

  • Median: The middle value when data are ordered. Resistant to outliers.

Percentiles and Five-Number Summary

  • Percentiles: Values below which a certain percent of data fall. The five-number summary uses specific percentiles.

  • Five-Number Summary: Minimum, Q1 (25th percentile), Median (50th percentile), Q3 (75th percentile), Maximum.

Measures of Spread

  • Standard Deviation (SD): Measures average distance from the mean. Nonresistant to outliers.

  • Interquartile Range (IQR): Difference between Q3 and Q1. Resistant to outliers.

Choosing Statistics

  • Use mean and SD for symmetric distributions without outliers.

  • Use median and IQR for skewed distributions or with outliers.

Resistant vs. Nonresistant Statistics

  • Resistant: Not affected by extreme values (e.g., median, IQR).

  • Nonresistant: Sensitive to extreme values (e.g., mean, SD).

Identifying Shape

  • Four ways: symmetry/skewness, modality (number of peaks), outliers, spread.

Boxplots and Outlier Detection

Boxplots

  • Visualize the five-number summary.

  • Can compare distributions side-by-side.

  • Show shape, center, spread, and outliers.

Outlier Identification

  • Outliers are values that fall outside typical boundaries.

  • Common boundaries: values beyond ±2 SD (moderate outliers), beyond ±3 SD (serious outliers).

Explanatory vs. Response Variables

  • Explanatory (Predictor): Variable used to explain or predict another.

  • Response: Variable being predicted or explained.

Z-Scores

  • Standardizes values for comparison.

  • Formula:

  • Interpretation: How many SDs a value is from the mean.

Empirical Rule

  • Applies to normal distributions.

  • About 68% of data within ±1 SD, 95% within ±2 SD, 99.7% within ±3 SD.

Describing Two Quantitative Variables: Scatterplots and Correlation

Scatterplots

  • Graphical display of two quantitative variables.

  • Shows pattern, direction, strength, and outliers.

Correlation

  • Measures strength and direction of linear relationship.

  • Range: -1 to 1.

  • Formula:

  • Interpretation: Positive/negative, strong/weak.

Explanatory vs. Response Variables

  • Explanatory (predictor) variable on x-axis; response variable on y-axis.

Simple Linear Regression Model

  • Equation:

  • Slope (b1): Change in y per unit change in x.

  • Intercept (b0): Predicted y when x = 0.

  • Find predicted value by substituting x.

Regression Model Variation and Diagnostics

Least Squares Method

  • Estimates slope and intercept by minimizing sum of squared residuals.

  • Residual:

Analysis of Variance (ANOVA) Table

  • Summarizes sources of variation in regression.

  • Components: Regression, Residual/Error, Total.

  • Generic ANOVA table:

Source

DF

SS

MS

F

Regression

1

SSR

MSR

F

Residual

n-2

SSE

MSE

Total

n-1

SST

Quantity Minimized

  • Least Squares minimizes

R2 (Coefficient of Determination)

  • Proportion of variance in y explained by x.

  • Formula:

  • Interpretation: Higher R2 means better fit, but does not guarantee best model.

Standard Error of Residuals (se)

  • Measures typical size of residuals.

  • Compare se to sy (SD of y) to assess prediction quality.

Regression Model Conditions and Diagnostics

Model Conditions

  • Check graphs for linearity, constant variance, independence, and normality of residuals.

  • Meeting conditions is distinct from model's predictive ability.

Regression Diagnostics: Unusual Observations

  • Y-axis (R flags): Outliers in response variable.

  • X-axis (X flags): Outliers in predictor variable.

  • Highly Influential Observations: Points that greatly affect regression estimates.

Outlier Boundaries and Labels

  • Moderate outliers: Beyond ±2 SD.

  • Serious outliers: Beyond ±3 SD.

Summary and Integration

Bringing Concepts Together

  • Apply concepts of descriptive statistics and regression to analyze real datasets.

  • Use worksheets and practice problems to reinforce understanding.

Example: Given a dataset of heights and weights, describe the distribution of heights using histograms, five-number summary, and boxplots. Then, analyze the relationship between height and weight using scatterplots, correlation, and linear regression.

Additional info: Academic context and formulas have been expanded for clarity and completeness.

Pearson Logo

Study Prep