BackMultiple Regression: Estimation, Interpretation, and Application
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Multiple Regression Analysis
Introduction to Multiple Regression
Multiple regression is a statistical technique used to model the relationship between a single response variable and two or more explanatory (predictor) variables. This method allows for improved predictions and a better understanding of how several factors simultaneously influence the response variable.
Response Variable (Y): The main variable of interest, which we aim to predict or explain (e.g., asking price of homes).
Explanatory Variables (X1, X2, X3): Variables used to predict the response (e.g., square footage, number of bedrooms, number of bathrooms).
Purpose: To account for more variation in the response variable by including multiple predictors.
Example: Predicting the asking price (in thousands of dollars) of homes in Greenville, SC, using square footage, number of bedrooms, and number of bathrooms as predictors.
Data Table: Home Prices and Features
The following table summarizes the data for 13 homes, including the response and explanatory variables:
Home | Asking Price (Y, $1000s) | Square Footage (X1) | Bedrooms (X2) | Baths (X3) |
|---|---|---|---|---|
1 | 498 | 3800 | 4 | 3.5 |
2 | 449 | 2600 | 4 | 3.0 |
3 | 435 | 2600 | 5 | 3.5 |
4 | 400 | 2250 | 4 | 4.0 |
5 | 379 | 3300 | 4 | 3.0 |
6 | 375 | 2750 | 3 | 2.5 |
7 | 356 | 2200 | 3 | 2.5 |
8 | 350 | 3000 | 4 | 2.5 |
9 | 340 | 2300 | 3 | 2.0 |
10 | 332 | 2600 | 4 | 2.5 |
11 | 298 | 2300 | 4 | 2.0 |
12 | 280 | 2000 | 4 | 3.0 |
13 | 260 | 2200 | 3 | 2.5 |
Multiple Regression Model and Estimation
The general form of the multiple regression equation is:
Regression Equation:
Estimated Equation (from data):
Interpretation of Coefficients:
b1 = 0.0719: For each additional square foot (holding bedrooms and baths constant), the predicted asking price increases by thousand ($71.90).
b2 = -0.8: For each additional bedroom (holding square footage and baths constant), the predicted asking price decreases by thousand ($800).
b3 = 55.3: For each additional bathroom (holding other variables constant), the predicted asking price increases by thousand.
b0 = 25.6: The predicted price when all predictors are zero (not meaningful in this context).
Example: For a home with 3800 sq ft, 4 bedrooms, and 3.5 baths:
Regression Output Summary
Term | Coefficient | SE Coef | T-Value | P-Value | VIF |
|---|---|---|---|---|---|
Constant | 25.6 | 98.2 | 0.26 | 0.800 | - |
Square Footage | 0.0719 | 0.0276 | 2.61 | 0.028 | 1.10 |
Bedrooms | -0.8 | 27.3 | -0.03 | 0.977 | 1.50 |
Baths | 55.3 | 27.5 | 2.02 | 0.075 | 1.50 |
Interpretation of P-Values: Lower p-values (typically < 0.05) indicate statistical significance. Here, only square footage is significant at the 0.05 level.
VIF (Variance Inflation Factor): Indicates multicollinearity; values near 1 suggest low multicollinearity.
Analysis of Variance (ANOVA) Table
Source | DF | Seq SS | Seq MS | F-Value | P-Value |
|---|---|---|---|---|---|
Regression | 3 | 36675 | 12225 | 5.72 | 0.018 |
Error | 9 | 19247 | 2139 | - | - |
Total | 12 | 55921 | - | - | - |
F-Value and P-Value: The overall model is significant (p = 0.018), indicating that at least one predictor is useful.
Model Summary Statistics
S | R-sq | R-sq(adj) | R-sq(pred) |
|---|---|---|---|
46.2441 | 65.58% | 54.11% | 41.13% |
R-squared (R2): Proportion of variance in the response explained by the predictors. Here, 65.6% of the variability in asking price is explained by the model.
Adjusted R-squared: Adjusts for the number of predictors; useful for comparing models with different numbers of predictors.
S: Standard error of the regression (estimate of the typical size of residuals).
Correlation Coefficients
Pearson correlation between Asking Price and Square Footage: 0.665 (p = 0.013)
Correlation with Bedrooms: 0.409
Correlation with Baths: 0.626
These are simple correlations, not accounting for other variables. In multiple regression, R2 reflects the combined effect.
Fitted Values and Residuals
For each observation, the fitted value (prediction) and residual (difference between observed and predicted) are calculated:
Fitted Value (\( \hat{y} \)): The predicted value from the regression equation.
Residual (e): The difference between the observed and predicted value.
Example: For house 1, observed price = 498, predicted = 489.17, so residual = 8.83 (in thousands of dollars).
Residuals represent unexplained variation, possibly due to omitted variables or random noise.
The Least Squares Principle
The regression coefficients are chosen to minimize the sum of squared residuals (errors):
Least Squares Solution: The set of coefficients (b0, b1, b2, b3) that minimizes SSE provides the best fit to the data.
For this data, SSE = 19247 (from the ANOVA table).
Interpretation and Application
Prediction: The regression equation can be used to predict asking price for any combination of square footage, bedrooms, and baths within the range of the data.
Coefficient Interpretation: Each coefficient represents the expected change in the response variable for a one-unit increase in the predictor, holding other variables constant.
Intercept Interpretation: The intercept is the predicted value when all predictors are zero; often not meaningful if zero is outside the data range.
Comparison to Simple Regression: Coefficients in multiple regression differ from those in simple regression due to the adjustment for other variables.
Summary Table: Key Concepts in Multiple Regression
Concept | Definition/Interpretation |
|---|---|
Regression Coefficient | Change in predicted Y for a one-unit increase in X, holding other variables constant |
Intercept | Predicted Y when all X's are zero (may not be meaningful) |
Residual | Observed Y minus predicted Y |
R-squared | Proportion of variance in Y explained by the model |
SSE | Sum of squared residuals; minimized in least squares regression |
Fitted Value | Predicted value from the regression equation |
Additional info: The ANOVA table and inference for individual coefficients will be discussed in a later module. The example demonstrates the process of fitting, interpreting, and applying a multiple regression model using real data and statistical software output.