16.8 Compare models - Video Tutorials & Practice Problems
Video duration:
7m
Play a video:
<v Voiceover>When building a model,</v> you rarely build just one. You build many models en route to the model you decide upon. And when you judge a model, you're not really judging it against itself. You're comparing it to other models. So let's go back to the housing data and build a few models for that so we can compare them. The first one will be our old model, value per square foot onto units, plus square feet, plus Boro. And we will build four more. So I'll copy and paste. For the second model I'll do units times square feet. For the third model I'll do square feet times Boro plus Class. For the fourth model I'll do square feet times Boro plus square feet times Class. And for the last model I'll do Boro plus Class. We now run each of these. And then to see them all together, we will use the multiplot function from co-eff plot. And we will say point size equals two, just to change the size of the points since we will be plotting so many of them. Now we see that the coefficients, or models, sort of tend to come together. In all the models, being in Manhattan was the biggest effect. And over here, being in Queens was this middle effect. Housing one has the fewest variables, so up here where it does not have those coefficients it's just not present. While seeing this plot gives us a sense for which variables are significant for which model, we really want a numeric comparison. For that we'll use the anova. While I generally do not like the anova for comparing the means of different groups, it does a good job for comparing models. So we call the anova function and feed it each of the models. This comes back with all these different fits and the residual sum of squares. That is a measure of our error. Ideally, you want the lowest residual sum of squares, RSS. And in this case, that is model four. However, one issue of RSS is that whenever you add extra variables, RSS is going to get lower. And that can lead to over-fitting. Over-fitting your data can be pretty bad. It could lead to bad predictions later on. Fortunately, there are a few metrics that take complexity into account and penalizes over-complexity. Two of the most popular are AIC and BIC. AIC is the Akaike Information Criterion. I might be pronouncing that wrong, but it's pretty close. The formula for AIC is as shown. It takes negative two times the law of likelihood, that is a measure of how good it's fit, and then plus two times p, where p is the number of coefficients. The smaller the AIC the better your model. So by putting in more coefficients, you might make your model better, but you're adding a penalty term, thus saying the complexity is getting too high. BIC is very similar, but instead of multiplying the coefficients by a penalty of two, it multiplies them by the natural log of the number of rows. Just a different way of penalizing. In art, we can call it AIC by using the AIC function with all capital letters. When we run this, the model with the lowest AIC is house four. We will do the same thing with BIC. And once again, the model with the lowest BIC is model four. So the anova, BIC, and AIC all agree that right now model four is the best. When dealing with GLMs, you don't use RSS anymore, you use something called deviance which is another measure of error. To illustrate this, we will create a new variable in the housing data called high value. That is a yes or no if the value per square foot is greater than 150. So let's create this doing housing, dollar, high value. This gets housing, dollar, value per square foot is greater than or equal to 150. And this will return a series of trues and falses. The models I'm fitting for these GLMs will have a different response, but the same predictor as the previous models. So I'm going to go copy and paste them to save a little bit of typing. A number of changes need to be made. We won't call it house one because we don't want to override the previous model. We will call it high one. We'll do that for each of the models. Also, the response should now be high value not value per square foot. And instead of using LM, we are using GLM. GLM has the capability of running an LM. The family of generalized linear models includes a standard LM. If you want to use GLM to do that, you just set the family to be calcien instead of binomial. That said, you have the LM function already. You might as well stick with it. For this logistic regression, we need to set the family to be binomial. And optionally, it's a good habit to set link equals logit. We will copy that and put it in each of the models. And now we can run the models. To compare them, we will do anova of each of the models. And here we see it reports back the deviance. The general rule of thumb about deviance is that for every additional coefficient, you want the deviance to drop by at least two. So if you had three coefficients, and this counts for the dummy variables in the factor variable. If, for instance, Boro creates four dummy variables, you want the deviance to drop by eight at least. This is called the drop in deviance test. Again, the lowest deviance is the better, and we see that model four has the lowest deviance. Now this might just be because model four is the most complicated. So we will also check the AIC and BIC to control for complexity. So we will copy this line since it's almost the same to get the AIC. And here, once again, the fourth model appears to be the best. And for BIC, in this situation actually, the fifth model is the lowest. This is gonna happen sometimes due to the different complexity parameters where AIC, BIC, and anova will disagree. That is why you have multiple tests. And you can go with whichever measurement you like best, or go with the consensus. If two of them agree on one model, maybe the third one is just wrong. You really have to have a feel for your data. Comparing models is an important part of the model building process. And it's a good idea to have numeric summaries for which one is best. Some of the most popular are anova, AIC, and BIC.