17.1 Select variables and improve predictions with the elastic net
17: Other Models
17.1 Select variables and improve predictions with the elastic net - Video Tutorials & Practice Problems
Video duration:
14m
Play a video:
<v Voiceover>In today's</v> preponderance of big data, we often have high dimensional data. That is data with a lot of variables. Sometimes more variables than rows. This is sometimes called the curse of dimensionality, because having too many variables can lead to ill fitting models. Some people like Andy Galming call it the blessing of dimensionality, 'cause he says it adds more structure. But a lot of machine learning experts go with curse. Ten to fifteen years ago, dealing with so many variables and statistics would be nearly impossible, 'cause if you have more columns than rows, it violates certain properties of matrix algebra. Thankfully, we have modern methods that can deal with this. One in particular is the elastic net, invented by Tibshirani, Hasty, and Freedman, out of Stanford University. It is a dynamic combination of ridge and lasso regression that does of a fantastic job of dealing with high dimensional data. The elastic net is a method of solving this formula. Now it's a crazy looking formula, we'll take it in pieces. The goal is to minimize this formula up here. The first part of it is just standard Lee Squares regression. It's just the responses minus the predicted values, squared and summed. That's all that is. Problem is when you have too many coefficients, they can be ill-fit, and blow up. And they get really out of control. That's where this penalty term comes in. The penalty term prevents over fitting. It forces the coefficients to stay small or near zero. This penalty term is made up of the ridge penalty, otherwise known as L2, and the lasso penalty, otherwise known as L1. They're very similar but have subtle differences. Now again, we won't get too much into the details about all the notation, but just understand that ridge regression, it takes a number of variables that are highly correlated and divides the coefficient amongst them. Lasso regression takes a number of variables that are highly correlated and gets rid of all but one of them, and gives the entire coefficient to one of them. So because of this, ridge regression is pretty good at smoothing out predictions and making sure you have a well fitted predictor matrix, and lasso regression is great at doing variable selection. It's a great way of saying, hey, these variables are important, those variables aren't. Having a combination of the two of them in one algorithm is a great step forward. The elastic net is implemented in R, using a package called glmnet. It was written by the inventors of the elastic net, Hasty, Tibshirani, Freedman, and the underlying code is in four-tran, so it really flies. It is incredibly fast. The data we'll be working with for this example is the ACS data. So let's take a look at that to remind ourselves. It's a bunch of census-like information for people in New York. Let's load up the glmnet package. glmnet is different from other model functions in that you can't feed it a formula. You need to feed it a predictor matrix, and a response matrix. This is because it's all about speed and actually using the formula interface to build up the matrices slows things down. So luckily we have some helper functions that will let us do this. So we say, "require useful". And useful's actually just a package of little tools I've accumulated over the years to make things a little easier here and there. The first thing we will do is build an X-Matrix and we have a function for doing just that. So "acsx", we'll call it "build.x". And we can put in the formula right here as if this was an LM model. We are going to fitting income on number of bedrooms, number of children, number of people, number of rooms, number of units, vehicles, workers, own rent, year built, electrical bill, food stamp, heating fuel, insurance, and language. A lot of variables in there because this is a high dimensional problem. And we're just using a few variables compared to what glmnet can handle. You can feed 50,000 variables into glmnet. It will take it's time, but it will churn through it. Another issue with glmnet is that you cannot just feed it a matrix that has any categorical data. You need to have the dummy variables pre-built, and that is exactly what build.x will do. So let's build out this formula. Then we tell it, the data comes from a CS. Then we say contrasts equals false. Now we did a few things here, and let's think about it all. We built up all the predictors, that's standard. Then we took away the intercept. That's because, glmnet will automatically include an intercept, so we don't need it. Also, we said contrast equals false. Normally when you have a formula, and you have categorical data, and let's say you have a variable that has five levels. You get four dummy variables. Well in this case, we don't have to worry about multi co-linearity. A, because we got rid of the intercept, and B, because glmnet gets rid of variables and it shrinks things down. The multi co-linearity's no longer an issue, since we're not dealing with standard linear algebra. Let's run this, and let's take a look at the class of this object. It is a matrix like we wanted, and let's check out the dimension of it. There's 22 thousand rows, by 44 columns. Since this might be a little too much to view on the screen, Let's just do topleft, another handy function and useful of acsx. Let's say, give us the first six columns. And it automatically gives the first five rows since we didn't specify. And we see we have the number of bedrooms, number of children, number of people, number of rooms. Then we got to our categorical variable, number of units, and mobile home's and option, single detached is an option. And there's more, but automatically created all the dummy variables for us. We're already ahead of the game right now. We can also look at the top right of acsx. Over here we have things such as the language. Spanish, european, pacific, other, and insurance, which is a numeric variable. So this is a very well build out matrix. In a similar fashion, we'll build the Y matrix, which takes almost the same exact notation of build.x, so we will just copy it and make some changes. First, we don't need to worry about this contrast argument anymore. And we have to make it build.y instead. And we better name the variable correctly. We run this, and we check out what it looks like, and we see it is a bunch of true, falses, cause if you recall, income was a binary variable we created that indicates if a family has more than 150 thousand dollars in earnings. We can also look at it's tail, and see those are where the trues are. An issue with the elastic net, is you have to choose this lambda parameter. That is the amount of penalty you apply. If you have more penalty, you shrink everything down. If you have less penalty, you don't shrink as much. The best way of choosing the possible parameter is through cross validation. Fortunately, glmnet has that built in. So we will do set.seed, because this is a random process, due to the way the k-tholds are spread out, and give it the seed, 1863561. Now we can call the glmnet function. So we'll do acsCV1 gets CV.gmlnet. X equals acsx, Y equals acsy, and family equals binomial, because we're doing a logistic regression here. Gmlnet, just like it's name implies, can do the the whole family of generalized linear models. That's what the gmlnet stands for, and the net stands for elastic net. We're gonna say do five fold cross validation because that will speed things up. We run this, and we wait. When it's done, we can check out what it found as the optimal lambda. Now it returns two optimal lambdas. And it has lambda as a vector of all the lambdas, but lambda.min is a lambda that minimizes mean squared error. That is 0.00525 and onwards. ascCV1, running this again and looking for lambda.1sc gives us the lambda that doesn't quite minimize mean squared error, but is the largest lambda that has the mean squared error within one standard deviation of the minimum. 'Cause if theory goes, this will be a simpler model and that makes things a little better. You always prefer simple over complex. So, why don't we look at a plot of this information to help us figure it out. So, each of the points along the x-axis is a value of lambda. It's actually log lambda. On the top it tells you, for that value of lambda, how many variables were included? And each of the points represents the deviants in the case, not the mean squared error, because, since we're doing logistics regression, it's deviants. And the error bars around it is the uncertainty. So, over here, at roughly 37 variables, that is where the minimum deviants occurs. Now we could pick that model, or we could go with the lambda that has the highest deviants, still within one standard deviation of the minimum point, and that's right here, with somewhere around 19 variables it looks like. So, we can go ahead and see what our coefficients look like. So if we coef of ascCV1, and we tell it, use a lambda that is is lambda.1sc, it gives us back all the coefficients. The ones that have numbers are coefficients that got accepted. The ones with just a dot are coefficients that got thrown out and are ignored. Now it might seem out of first that something such as num units, there were three levels of it, and one of them got included, the others didn't. That actually makes a lot of sense. 'Cause what this does is, it finds variables that are highly correlated, and tossed them out. Even though they were one variable, they because three dummy variables, so two of them got tossed out. Another great plot we can see, is the coefficient path as the fitting took place. So we'll do plot of ascCV1 dollar glmnet.fit. And then we can say, x-var equals lambda, so we can watch it as lambda changes. It helps if you specify the correct slot. So I tried doing auto-complete, and sometimes when you have a generic function like plot, the auto-complete doesn't always work so well. So we try this again, and we get this beautiful looking plot. And what this shows is as lambda changes, what variables get included? If you start here with all variables being included, as you get towards a less negative log lambda, they all start going towards zero, and getting tossed out. Once they hit zero, they're out of the model, and they're good. Any given point you can draw a vertical line, indicating the lambda you chose. Where that intersects these coefficient paths, gives you the value of that coefficient lambda. So why don't we actually go ahead and draw these lines in. We'll say abline, v equals log of a vector, so we're going to do two of them. Lambda.min, and lambda.1sc. And we'll close the vector, and we'll make them line type of two. And we forgot the close off the log, so we'll escape and run it again. And now if we zoom in, we could see the two chosen lambdas. The minimum, and the 1sc. Let's go take a look at some of the arguments of glmnet. This one in particular, alpha. Alpha controls whether it's a lasso regression, a ridge, or some combination of the two. If you set alpha to one, which is the default, it's completely at lasso, so you're really doing variable selection. If you set alpha to zero, it's completely ridge regression. If you set it somewhere between, it's a combination of the two. There's been studies that show, and alpha somewhere 0.7 and one usually will get you the best results from cross validation. To illustrate the difference, we will go ahead and fit this model again. This time setting alpha to zero. So I'll copy this, come down here, and call two. And we'll call it ascCV2. And we'll say alpha equals zero, meaning it's completely a ridge regression. When it is done, we're going to draw that coefficient plot again. We'll just steal the code we wrote earlier, and this time we'll make sure we're doing the second model. And notice the change now. Instead of the lines coming in from the top and descending downwards, they arch right in and then go in towards zero. That's because ridge variables never quite go to zero. They hover around zero, they never get singled out, they just get squished down. So that's the difference in this plot. And it really makes a marked difference. While glmnet offers a cross validation to choose the optimal lambda, it does not offer any sort of cross validation for choosing an optimal alpha. To do that, you have to build your own cross validation info-structure, preferably with parallelazation to make things faster. To see this done, go to the regularization and shrinkage chapter of the book. The elastic net is probably one of the most exciting tools to come out in statistics lately, and really facilitates us working in the age of big data. There are alternatives to glmnet, such as a piece of software from Dave Amad again, that does the same thing, but does the computations on the GPU to speed things up. There is a bit of an arms race going on to see who can come up with the fastest algorithm, because when you're dealing with lots of data, speed really matters. Thankfully, the ridge, the lasso, and the elastic net are all very fast to compute.