16.5 Fit Poisson regression - Video Tutorials & Practice Problems
Video duration:
7m
Play a video:
<v Voiceover>Another glm family, and my</v> particular favorite, is the Poisson regression. It models count data, such as: number of children, number of accidents, number of ratings. It is based on the Poisson distribution and the regression is based on this formula: y-i that is an individual response is distributed as a Poisson random variable with mean theta i. That means each observation has its own mean. That mean is based on e to the x-i beta. That's our familiar x-i times beta, the linear predictors. x-i is this observation's predictors, and beta are the corresponding coefficients. So you get the da product between x-i and beta, raise e to that power and put that into a Poisson distribution, and that's how you get your regression. Now again, this is done using iteratively reweighted least squares, and again, you do not need to worry about that. The glm function will take care of all of it. We're going to use the acs data once again. We're going to look at the number of children in a household. So, let's plot this. We're gonna make just one aesthetic. That's gonna be number of children, and we make it a histogram, and we'll give a binwidth of one. We can see here we get a nice histogram. It's not quite Poisson, but it's going to be close enough for our modeling. So fitting this model is done just like fitting a logistic regression, just with a different family. So we will say children1 is glm, NumChildren, and we are regressing that on FamilyIncome, FamilyType, and OwnRent. And the family this time is gonna be Poisson, and the link will be log. The log link is how you transform it back to the original scale. You can run this and check out the summary. And we get all the usual information here. Particularly the coefficients and the deviants, and something called a dispersion parameter. And that's gonna be very important and we'll look at it in a little bit, but first, let's look at the coefplot of this model. Here this just gives us the display showing the relative value of the coefficients, and it's just easier to look at than the numbers. We can see here that owning a house outright, that means not renting and not having a mortgage, means you are likely to have fewer children. Now back to that dispersion parameter. A Poisson distribution is assumed to have the same mean and variance. That means there's only one parameter to estimate, just the mean, and a spread should be just about the same as the mean. However, in reality, especially in these regressions, the data is far more spread out than would be indicated by the mean. So we need to check for overdispersion. Overdispersion is defined as such: it is the sum of the squares of the standardized residuals divided by the number of rows minus the number of coefficients. Now, the standardized residuals is simply the actual responses minus the predicted responses divided by the standard deviation of the predicted responses. So going back to our model, we can do some quick math to test this. We'll say z gets acs NumChildren, that is the true response, minus children1 fitted values, that is the predicted response, and we'll divide that by the square root of children1 fitted values. Now again, in this case, the reason we're dividing by the square root of the fitted values, the fitted values are taken to be the mean. In a Poisson regression, the mean is supposed to be the same as the variance. So that's how we get these standardized residuals. We run this, and that will build us a nice vector, and now we do the sum of the overdispersion and divide it by the degrees of freedom, which in our formula was the number of rows minus the number of parameters. If this number is greater than two, it's considered overdispersed. We do the sum of the square of the studentized residuals divided by children1 df dot residual, the degrees of freedom, which is n minus p. Now we see here, the overdispersion factor is 1.46. It's not quite over two, but it is greater than one, and one is considered no overdispersion. This is greater than one, so we want to do one more check to see if we should check for overdispersion. What we're gonna do now is we're gonna do a quick chi squared test, which will tell us, "Hey, should we consider these overdispersed?" So do p of chi squared, this is the probability function for the chi squared distribution, and in there we're going to but the sum of the squared studentized residuals. We're going to say that the degrees of freedom is children one dollar df that residual. And we run this and we get a p value of one. And this suggests that the data is overdispersed. We're gonna play it safe. Even though our overdispersion factor wasn't greater than two, the p value indicates that there is overdispersion, so we will play it safe and fit the model again, taking into account overdispersion. To do this, we again use glm, and we fit the same formula, so I will go grab that formula to make sure I get the same exact one. But this time, instead of using the Poisson family, we will use the quasi-Poisson family. Again, the link will still be log. Now the quasi-Poisson family is actually using a negative binomial distribution to fit the data. Don't worry about that, you just know you need to use quasi-Poisson, and that takes care of overdispersion. We run this, and let's check out the results compared to the standard model. So to do that we will use multiplot from the coefplot package, and we'll feed it children1 and children2. We zoom in, and we see what happened is the point estimates of the coefficients are still the same, as they're expected to be. Taking overdispersion into account increases the uncertainty, so it's a little hard to see, but all the confidence intervals are a little wider now. Over here it's easy to see, you can sort of see it on this one and on this one. Overdispersion says, "Listen, things are "a little more spread out than we intended, "so we're a little less certain about them." Poisson regression has many applications, from the number of admissions to a hospital, to number of car accidents for an insurance company. The important thing to remember about Poisson, is that your data should be counted, and that you should check for overdispersion, because in reality data is quite often overdispersed.