Multiple Regression: Estimation, Interpretation, and Application

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Multiple Regression Analysis

Introduction to Multiple Regression

Multiple regression is a statistical technique used to model the relationship between a single response variable and two or more explanatory (predictor) variables. This method allows for improved predictions and a better understanding of how several factors simultaneously influence the response variable.

Response Variable (Y): The main variable of interest, which we aim to predict or explain (e.g., asking price of homes).
Explanatory Variables (X1, X2, X3): Variables used to predict the response (e.g., square footage, number of bedrooms, number of bathrooms).
Purpose: To account for more variation in the response variable by including multiple predictors.

Example: Predicting the asking price (in thousands of dollars) of homes in Greenville, SC, using square footage, number of bedrooms, and number of bathrooms as predictors.

Data Table: Home Prices and Features

The following table summarizes the data for 13 homes, including the response and explanatory variables:

Home	Asking Price (Y, $1000s)	Square Footage (X1)	Bedrooms (X2)	Baths (X3)
1	498	3800	4	3.5
2	449	2600	4	3.0
3	435	2600	5	3.5
4	400	2250	4	4.0
5	379	3300	4	3.0
6	375	2750	3	2.5
7	356	2200	3	2.5
8	350	3000	4	2.5
9	340	2300	3	2.0
10	332	2600	4	2.5
11	298	2300	4	2.0
12	280	2000	4	3.0
13	260	2200	3	2.5

Multiple Regression Model and Estimation

The general form of the multiple regression equation is:

Regression Equation:

Estimated Equation (from data):

Interpretation of Coefficients:
- b1 = 0.0719: For each additional square foot (holding bedrooms and baths constant), the predicted asking price increases by thousand ($71.90).
- b2 = -0.8: For each additional bedroom (holding square footage and baths constant), the predicted asking price decreases by thousand ($800).
- b3 = 55.3: For each additional bathroom (holding other variables constant), the predicted asking price increases by thousand.
- b0 = 25.6: The predicted price when all predictors are zero (not meaningful in this context).

Example: For a home with 3800 sq ft, 4 bedrooms, and 3.5 baths:

Regression Output Summary

Term	Coefficient	SE Coef	T-Value	P-Value	VIF
Constant	25.6	98.2	0.26	0.800	-
Square Footage	0.0719	0.0276	2.61	0.028	1.10
Bedrooms	-0.8	27.3	-0.03	0.977	1.50
Baths	55.3	27.5	2.02	0.075	1.50

Interpretation of P-Values: Lower p-values (typically < 0.05) indicate statistical significance. Here, only square footage is significant at the 0.05 level.
VIF (Variance Inflation Factor): Indicates multicollinearity; values near 1 suggest low multicollinearity.

Analysis of Variance (ANOVA) Table

Source	DF	Seq SS	Seq MS	F-Value	P-Value
Regression	3	36675	12225	5.72	0.018
Error	9	19247	2139	-	-
Total	12	55921	-	-	-

F-Value and P-Value: The overall model is significant (p = 0.018), indicating that at least one predictor is useful.

Model Summary Statistics

S	R-sq	R-sq(adj)	R-sq(pred)
46.2441	65.58%	54.11%	41.13%

R-squared (R2): Proportion of variance in the response explained by the predictors. Here, 65.6% of the variability in asking price is explained by the model.
Adjusted R-squared: Adjusts for the number of predictors; useful for comparing models with different numbers of predictors.
S: Standard error of the regression (estimate of the typical size of residuals).

Correlation Coefficients

Pearson correlation between Asking Price and Square Footage: 0.665 (p = 0.013)
Correlation with Bedrooms: 0.409
Correlation with Baths: 0.626
These are simple correlations, not accounting for other variables. In multiple regression, R2 reflects the combined effect.

Fitted Values and Residuals

For each observation, the fitted value (prediction) and residual (difference between observed and predicted) are calculated:

Fitted Value ($ \hat{y} $): The predicted value from the regression equation.
Residual (e): The difference between the observed and predicted value.

Example: For house 1, observed price = 498, predicted = 489.17, so residual = 8.83 (in thousands of dollars).

Residuals represent unexplained variation, possibly due to omitted variables or random noise.

The Least Squares Principle

The regression coefficients are chosen to minimize the sum of squared residuals (errors):

Least Squares Solution: The set of coefficients (b0, b1, b2, b3) that minimizes SSE provides the best fit to the data.
For this data, SSE = 19247 (from the ANOVA table).

Interpretation and Application

Prediction: The regression equation can be used to predict asking price for any combination of square footage, bedrooms, and baths within the range of the data.
Coefficient Interpretation: Each coefficient represents the expected change in the response variable for a one-unit increase in the predictor, holding other variables constant.
Intercept Interpretation: The intercept is the predicted value when all predictors are zero; often not meaningful if zero is outside the data range.
Comparison to Simple Regression: Coefficients in multiple regression differ from those in simple regression due to the adjustment for other variables.

Summary Table: Key Concepts in Multiple Regression

Concept	Definition/Interpretation
Regression Coefficient	Change in predicted Y for a one-unit increase in X, holding other variables constant
Intercept	Predicted Y when all X's are zero (may not be meaningful)
Residual	Observed Y minus predicted Y
R-squared	Proportion of variance in Y explained by the model
SSE	Sum of squared residuals; minimized in least squares regression
Fitted Value	Predicted value from the regression equation

Additional info: The ANOVA table and inference for individual coefficients will be discussed in a later module. The example demonstrates the process of fitting, interpreting, and applying a multiple regression model using real data and statistical software output.