16.9 Judge accuracy using cross-validation - Video Tutorials & Practice Problems
Video duration:
9m
Play a video:
<v ->A popular form of model assessment,</v> is K-Fold Cross-Validation. This is where you take your data, and break it into k-folds. Let's say 10 folds. You fit a model on nine of those folds, then make predictions on the 10 fold. And you keep track of the errors how well it did. Then you fit the model on a different nine folds, and make predictions on the remaining 10 fold. And again, keep track of the errors. You do this again and again, until each observation has been in the testing fold one time, and in the training folds nine times. Doing this in R, is fairly simple, using the CV dot GLM function, which is available in the boot package. So we load that and this function, while it works for linear models, it requires that we use the GLM function, rather then the LM function in fitting those models. So, we will go ahead and refit our housing models, reusing this function. The only difference is the way we call the function, otherwise, the results will be the same. We go ahead and fit the model. It is houseG1, gets glm, value per square foot on units plus square feet, plus Boro, data equals housing and, we're going to specify the family. Now by default, glm fits a linear model, a Gaussian model, but I'm going to include it, just for completeness. So after running that, we are ready to go ahead and do the cross-validation. So I will say houseCV1, gets cv dot glm. We feed it the data set, and we feed it the model that we fit, and we tell it k equals five, for five folds. There is a lot of research that goes into determining the best number of folds. Some say 10 folds, some say three, some say leave-one-out cross-validation, which is where you train the data on all but one data point, then predict the remaining data point. Then you do it again on all but one data point, and predict the remaining data point. This is the most accurate but, it is also the most computationally intensive. There is research to suggest that five is a good number, because it's enough folds to get accurate results, but not so many that you're going to bog down the computer. We run that, and we can check the results. In particular, we care about what it calls Delta. This gives us the cross-validated error, and it give us the adjusted cross-validated error. The adjusted means they have taken the error that was calculated, and compensated for the fact that this wasn't a leave-one-out cross-validation. If we had done a leave-one-out cross-validation, our error would've been more like this. So, since it's going to report a number that's similar to the leave-one-out cross-validation, we don't really need to go ahead and spend all those computational resources doing it. The error that this is reporting in this case, is called mean squared error. So let's take a look at that. It is simply the average of the residuals. That is, the sum of the predicted value for that testing set, minus the actual value for that testing set squared. It's the mean squared error. Using a metric like cross-validation, is only relevant if you're comparing one model to another model. So let's go back and fit all of our housing models, and do the comparison the best way. So houseG2, gets glm, value per square foot, on units times square feet, plus Boro, data equals housing, houseG3 gets, glm. Now notice here, we didn't specify family, because again, glm by default specifies family. So for the third model we say, value per square foot on units, plus square feet times Boro, plus class. For the fourth model we say, houseG4, gets glm, value per square feet on, units plus square feet times Boro, plus square feet times class. And lastly for model five, houseG5, gets glm, value per square feet, on Boro plus class. Now that we have fit all of our models, we could run cross-validation on all of them. So houseCV2, gets cv dot glm, housing, houseG2, k equals five. Now I'm going to copy and paste this, since it's gonna be very similar each time. Here we'll use model five, model four, model three, and make sure we rename the objects appropriately. And we can run each of these. So, now that we have them, we want to put them all into a data frame to check them out. We will do cv results, gets as dot data frame, and we are going to rbind all of these together. We want houseCV1, delta, cuz that's where the information is stored. HouseCV2, delta, houseCV3, delta, houseCV4, delta, and lastly houseCV5. So we run this line. And then we will go ahead and give this good names. We will say, names, cv results, gets. The first column will be error. Second column, adjusted error. And lastly, we need to make sure we know, which model each row came from, so we will create a new variable, cv results, dollar model, gets, and here we will say sprint f, houseG percent s, one colon five. And that will insert that vector one through five, into this variable. So now we can check out the results. We can see that the fourth model, seems to be the best. Remember, the lower the mfc the better, and here model four is lower then all the rest. We wanna see how these results agree with what a nova and aic tell us. So let's go ahead and run a nova for each of the models and aic for each of the models. We will say cv a nova, it gets a nova, houseG1, houseG2, houseG3, houseG4, and houseG5. And we will again, do a similar thing for aic. We'll say cv, aic gets, aic, and I'm just going to go ahead and copy and paste this, since it's very similar. So what we will do now is, put all of these into the data frame, with the cross-validation results. So we say, cv results, a nova gets, cv a nova. And we want to get just one column, the resid dot dev column. But the thing is, this column has a space in the name. So we need to do the dollar sign, then use a back tic to enclose this name. Resid dot space dev. And then for the aic, we want to do cv results, dollar aic, gets cv aic, dollar, and if we look at this, we want the aic column. Now if we look at the results, we can see all the models in one place. And remember, for all of these, the smaller, the better. In this case, we can see that the a nova, the fourth model, is the lowest, and for aic, the fourth model is the lowest. All agreeing with cross-validation having the fourth model as the lowest. The cv dot glm function, requires us to fit our models using glm. It's a very powerful tool, but it is a bit restrictive. For a more general framework to build a cross-validation by yourself, see the Model Diagnostics chapter of the book. Modern Statisticians and Data Scientists really like using cross-validation. Using predictive power, which is what cross-validation measures, is a great way to assess model quality. And fortunately today's computers, have enough horse power, to run these k-fold cross-validations again and again, without missing a beat.