12. Regression

Multiple Regression - Excel

12. Regression

Multiple Regression - Excel: Videos & Practice Problems Bonus

Topic summary

Multiple regression analyzes how several independent variables (e.g., floor area, age, apartment number) affect a dependent variable like monthly rent. Using tools like Excel's data analysis add-in, one can derive the regression equation and coefficients. The coefficient of determination, R2, measures explained variance but can be misleading with many variables. Adjusted R2 accounts for variable count, penalizing irrelevant predictors. Variables must show logical, predictive relationships with the outcome to be relevant. Comparing models via adjusted R2 helps identify the best fit, avoiding overfitting and ensuring meaningful predictive analytics.

Downloads & Resources

concept

Multiple Regression - Excel

Video duration:

Multiple Regression - Excel Video Summary

Multiple regression is a powerful statistical method used to model the relationship between one dependent variable and multiple independent variables. Unlike simple linear regression, which examines the relationship between two variables, multiple regression allows for the analysis of how several factors simultaneously influence an outcome. For example, the monthly rent of an apartment can be affected by its floor area, the age of the building, and the apartment number. In this context, the monthly rent is the dependent variable y, while floor area, age, and apartment number serve as independent variables, often denoted as x₁, x₂, and x₃ respectively.

To construct a multiple regression model, data for all variables must be organized systematically, typically with each independent variable in its own column. Using tools like Excel’s Data Analysis Toolpak simplifies the process by automating the calculation of regression coefficients. These coefficients quantify the impact of each independent variable on the dependent variable. For instance, if the coefficient for floor area (x₁) is 1.675, it means that for each additional unit increase in floor area, the monthly rent increases by approximately 1.675 units, holding other variables constant. Similarly, a negative coefficient for the age of the building (x₂) indicates that older buildings tend to have lower rents, while a positive coefficient for apartment number (x₃) suggests a slight increase in rent with higher apartment numbers. The regression equation takes the form:

\[ y = b_0 + b_1 x_1 + b_2 x_2 + b_3 x_3 \]

where b₀ is the y-intercept (baseline rent when all independent variables are zero), and b₁, b₂, and b₃ are the coefficients for each independent variable.

Evaluating the quality of a multiple regression model involves examining the coefficient of determination, denoted as R². This statistic measures the proportion of variance in the dependent variable explained by the independent variables collectively. An R² value of 0.797, for example, indicates that approximately 79.7% of the variation in monthly rent can be explained by the combined effects of floor area, building age, and apartment number. However, a limitation of R² is that it does not penalize the addition of irrelevant independent variables; adding more variables can artificially inflate R² even if those variables do not meaningfully contribute to the model.

To address this, the adjusted R² statistic is used. Adjusted R² modifies the R² value by accounting for the number of independent variables in the model, providing a more accurate measure of model quality. It typically has a value less than or equal to R². For example, an adjusted R² of 0.763 suggests that after considering the number of predictors, about 76.3% of the variation in rent is explained by the model, reflecting a more realistic assessment of its explanatory power.

In summary, multiple regression enables the prediction of a dependent variable based on several independent variables, with coefficients indicating the strength and direction of each relationship. The model’s effectiveness is evaluated using R² and adjusted R², ensuring that the model is both accurate and parsimonious. This approach is essential for analyzing complex real-world data where multiple factors influence outcomes simultaneously.

example

Multiple Regression - Excel Example 1

Video duration:

Multiple Regression - Excel Example 1 Video Summary

In multiple regression analysis, the goal is to predict a dependent variable based on several independent variables. In this example, the dependent variable (y) is the number of riders on a bus, while the independent variables (x₁, x₂, x₃) include the bus delay time in minutes, the temperature at the start of the ride in degrees Fahrenheit, and the length of the bus route in miles, respectively.

To build the multiple regression model, the data analysis tool pack in Excel can be used. After selecting the regression option, the dependent variable data (number of riders) is input as the y-range, and the independent variables (delay, temperature, and route length) are input as the x-range. Ensuring that labels are included helps Excel correctly identify each variable.

The regression output provides coefficients for each independent variable and the y-intercept, which together form the multiple regression equation:

\[y = b_0 + b_1 x_1 + b_2 x_2 + b_3 x_3\]

where $b_0$ is the y-intercept, and $b_1$, $b_2$, and $b_3$ are the coefficients for delay, temperature, and route length, respectively. In this case, the coefficients are:

$b_1 = -0.016$ (delay in minutes)
$b_2 = -0.396$ (temperature in °F)
$b_3 = 1.33$ (route length in miles)
$b_0 = 29.118$ (y-intercept)

Thus, the regression equation becomes:

\[y = 29.118 - 0.016 x_1 - 0.396 x_2 + 1.33 x_3\]

To predict the number of riders for a bus delayed by 10 minutes, with a temperature of 40°F, and a route length of 7 miles, substitute these values into the equation:

\[y = 29.118 - 0.016(10) - 0.396(40) + 1.33(7)\]

Calculating this yields:

\[y = 29.118 - 0.16 - 15.84 + 9.31 = 22.428\]

This prediction indicates that approximately 22.4 riders are expected under these conditions. Understanding how to interpret regression coefficients and apply the regression equation is essential for making informed predictions based on multiple variables, a fundamental skill in data analysis and statistical modeling.

concept

Irrelevant Variables - Excel

Video duration:

Irrelevant Variables - Excel Video Summary

Multiple regression analysis is a powerful statistical method used to model the relationship between a dependent variable and multiple independent variables. When building such models, the adjusted R squared value plays a crucial role in evaluating model quality. Unlike the regular R squared, which measures the proportion of variance in the dependent variable explained by the independent variables, the adjusted R squared accounts for the number of predictors included in the model. This adjustment penalizes the addition of irrelevant variables, ensuring that only meaningful predictors improve the model’s explanatory power.

For example, when predicting monthly rent (the dependent variable) based on factors like floor area, age of the apartment building, and apartment number, the adjusted R squared helps determine which variables truly contribute to explaining rent variation. A relevant independent variable must have a logical, predictive relationship with the dependent variable. Floor area typically has a positive correlation with rent because larger apartments generally command higher prices. Similarly, the age of the building can influence rent, often negatively, as older buildings might be less desirable. However, apartment number usually lacks a meaningful connection to rent, making it an irrelevant variable in this context.

To verify relevance, one can examine scatter plots or simple linear regressions between each independent variable and the dependent variable. A clear correlation indicates a useful predictor, while the absence of correlation suggests irrelevance. Additionally, it is important to avoid redundancy or "double dipping" by including variables that provide overlapping information, such as floor area measured in both square feet and square meters.

When comparing multiple regression models, the adjusted R squared value serves as a key metric. Removing irrelevant variables, like apartment number in this case, can increase the adjusted R squared, indicating a better-fitting model despite a possible slight decrease in the regular R squared. For instance, excluding apartment number raised the adjusted R squared from 0.763 to 0.773, confirming an improved model fit. The regression equation for the refined model can be expressed as:

\[\hat{y} = 486.99 + 1.661 \times \text{(floor area)} - 8.751 \times \text{(age of building)}\]

Here, the coefficients quantify the expected change in monthly rent for each unit change in the predictors, holding other variables constant. The positive coefficient for floor area indicates rent increases with larger apartments, while the negative coefficient for age suggests rent decreases as the building gets older.

In summary, constructing an effective multiple regression model involves selecting independent variables that meaningfully explain variation in the dependent variable, confirmed through logical reasoning and statistical evidence. The adjusted R squared is an essential tool for balancing model complexity and explanatory power, guiding the inclusion or exclusion of variables to optimize predictive accuracy.

Problem

A student wants to create a multiple regression model to predict the height of a child with the $x$ variables age, length of foot, and shoe size. Using logic, are any of the variables irrelevant, and, if so, which one(s)?

All variables are relevant.

Length of foot and shoe size.

Shoe size.

All variables are irrelevant.

Problem

A student wants to create a multiple regression model to predict the price of a used car with the variables number of prior accidents, original sales price of car, and number of miles on the car. Using logic, are any of the variables irrelevant, and, if so, which one(s)?

All variables are relevant.

All variables are irrelevant.

Number of miles on the Car.

Original sales price of car.

example

Irrelevant Variables - Excel Example 2

Video duration:

Irrelevant Variables - Excel Example 2 Video Summary

In analyzing the relationship between a machine's ID number, internal temperature (in Fahrenheit), operator experience, and the number of defects produced per day, it is essential to identify which variables serve as independent predictors and which is the dependent outcome. Here, the number of defects is the dependent variable (y), while the machine ID number, internal temperature, and operator experience are independent variables (x₁, x₂, and x₃ respectively).

An independent variable may be considered irrelevant if it either duplicates information already captured by other variables or lacks a meaningful connection to the dependent variable. For instance, operator experience (x₃) logically influences defects since more experienced operators are likely to produce fewer defects. Similarly, internal temperature (x₂) affects defect rates because machines running at higher temperatures may generate more defects. However, the machine ID number (x₁) does not have an obvious causal link to defect counts, making it a candidate for irrelevance.

To rigorously determine the relevance of each independent variable, the adjusted coefficient of determination, known as the adjusted R-squared (\(R_{adj}^2\(), is used. This metric adjusts the traditional R-squared value to account for the number of predictors in the model, penalizing unnecessary variables and helping to avoid overfitting. The formula for adjusted R-squared is:

\[R_{adj}^2 = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right)\]

where \)R^2\) is the coefficient of determination, $n$ is the number of observations, and $p$ is the number of predictors.

By comparing two multiple regression models—one including all three independent variables (x₁, x₂, x₃) and another excluding the machine ID number (x₁)—we can evaluate which model better explains the variability in defects. The first model yields an adjusted R-squared of approximately 0.909, while the second model, which only includes internal temperature and operator experience, has a slightly higher adjusted R-squared of about 0.912.

Since the model excluding the machine ID number has a higher adjusted R-squared, it indicates that removing x₁ improves the model’s predictive power. This confirms that the machine ID number is an irrelevant variable in predicting the number of defects. Therefore, focusing on internal temperature and operator experience provides a more efficient and accurate regression model for forecasting defects.

Do you want more practice?

Here’s what students ask on this topic:

Multiple regression is a statistical method used to analyze the relationship between one dependent variable and two or more independent variables. In business, it helps understand how several factors simultaneously affect an outcome, such as how floor area, age of a building, and apartment number influence monthly rent. Excel facilitates multiple regression through its Data Analysis Toolpak add-in, which automates the calculation of regression coefficients, R-squared, and adjusted R-squared values. By inputting the dependent variable data and multiple independent variables, Excel generates a regression equation of the form $y = b_{0} + b_{1} x_{1} + b_{2} x_{2} + b_{3} x_{3}$ , which can be used for prediction and decision-making in business analytics.

In a multiple regression model, each coefficient represents the expected change in the dependent variable for a one-unit increase in the corresponding independent variable, holding other variables constant. The intercept is the predicted value of the dependent variable when all independent variables are zero. For example, if the coefficient for floor area ( $b_{1}$ ) is 1.675, it means that for each additional unit of floor area, monthly rent increases by 1.675 units, assuming other factors remain unchanged. A negative coefficient, like for building age, indicates an inverse relationship. These coefficients are obtained from Excel's regression output under the 'Coefficients' column and are essential for constructing the regression equation used for predictions.

R-squared measures the proportion of variation in the dependent variable explained by the independent variables in the model. However, it always increases or stays the same when more variables are added, even if those variables are irrelevant. Adjusted R-squared adjusts for the number of independent variables, penalizing the addition of irrelevant predictors. It can decrease if a new variable does not improve the model sufficiently. Therefore, adjusted R-squared provides a more accurate measure of model quality in multiple regression by balancing model fit and complexity. In Excel, both values are provided in the regression output, helping you select the best model.

An independent variable is relevant if it has a logical, predictive relationship with the dependent variable and improves the model's explanatory power. In Excel, you can check this by examining the variable's coefficient and its statistical significance (p-value) in the regression output. Additionally, you can observe changes in adjusted R-squared when including or excluding the variable. If adjusted R-squared increases after adding the variable, it suggests relevance. Visualizing the relationship with scatter plots can also help confirm correlation. Irrelevant variables often have coefficients close to zero, high p-values, and their removal improves adjusted R-squared.

To compare two multiple regression models in Excel, focus on the adjusted R-squared values from each model's output. The model with the higher adjusted R-squared is generally better because it explains more variation in the dependent variable while accounting for the number of predictors. Also, check the significance of coefficients and ensure no irrelevant variables are included. You can run separate regressions with different sets of independent variables using the Data Analysis Toolpak, then compare their adjusted R-squared values and coefficients. A small increase in adjusted R-squared can still indicate a better model, as it balances fit and complexity.

Your Statistics for Business tutors

Patrick Ford

Physics and Math Lead Instructor

Colleen Daly

Math Instructor

Multiple Regression - Excel: Videos & Practice Problems Bonus

Downloads & Resources

Multiple Regression - Excel

Multiple Regression - Excel Video Summary

Multiple Regression - Excel Example 1

Multiple Regression - Excel Example 1 Video Summary

Irrelevant Variables - Excel

Irrelevant Variables - Excel Video Summary

A student wants to create a multiple regression model to predict the height of a child with the $x$ variables age, length of foot, and shoe size. Using logic, are any of the variables irrelevant, and, if so, which one(s)?

A student wants to create a multiple regression model to predict the price of a used car with the variables number of prior accidents, original sales price of car, and number of miles on the car. Using logic, are any of the variables irrelevant, and, if so, which one(s)?

Irrelevant Variables - Excel Example 2

Irrelevant Variables - Excel Example 2 Video Summary

Do you want more practice?

Here’s what students ask on this topic:

What is multiple regression and how is it used in Excel for business data analysis?

How do you interpret the coefficients and intercept in a multiple regression model in Excel?

What is the difference between R-squared and adjusted R-squared in multiple regression analysis?

How can you determine if an independent variable is relevant in a multiple regression model using Excel?

How do you compare two multiple regression models in Excel to select the better one?

Your Statistics for Business tutors

Multiple Regression - Excel: Videos & Practice Problems Bonus

Multiple Regression - Excel

Multiple Regression - Excel Video Summary

Multiple Regression - Excel Example 1

Multiple Regression - Excel Example 1 Video Summary

Irrelevant Variables - Excel

Irrelevant Variables - Excel Video Summary

A student wants to create a multiple regression model to predict the height of a child with the xx variables age, length of foot, and shoe size. Using logic, are any of the variables irrelevant, and, if so, which one(s)?

A student wants to create a multiple regression model to predict the price of a used car with the variables number of prior accidents, original sales price of car, and number of miles on the car. Using logic, are any of the variables irrelevant, and, if so, which one(s)?

Irrelevant Variables - Excel Example 2

Irrelevant Variables - Excel Example 2 Video Summary

Do you want more practice?

Here’s what students ask on this topic:

What is multiple regression and how is it used in Excel for business data analysis?

What is multiple regression and how is it used in Excel for business data analysis?

How do you interpret the coefficients and intercept in a multiple regression model in Excel?

How do you interpret the coefficients and intercept in a multiple regression model in Excel?

What is the difference between R-squared and adjusted R-squared in multiple regression analysis?

What is the difference between R-squared and adjusted R-squared in multiple regression analysis?

How can you determine if an independent variable is relevant in a multiple regression model using Excel?

How can you determine if an independent variable is relevant in a multiple regression model using Excel?

How do you compare two multiple regression models in Excel to select the better one?

How do you compare two multiple regression models in Excel to select the better one?

Your Statistics for Business tutors

A student wants to create a multiple regression model to predict the height of a child with the $x$ variables age, length of foot, and shoe size. Using logic, are any of the variables irrelevant, and, if so, which one(s)?