Skip to main content

My Channels

16: Linear Models

16.6 Analyze survival data

16: Linear Models

16.6 Analyze survival data - Video Tutorials & Practice Problems

Video duration:

12m

<v Voiceover>Medicine</v> is full of statistics, ranging from clinical trials of drugs to observational studies. There's a whole branch of statistics called survival analysis. In survival analysis, you have to deal with censored data. We'll explain that in a little bit, but that has to be taken into account, and fortunately there's a coxph function in the survival package to do just that. So let's load up the package, and let's take a look at some example data. To fully understand what this data is saying, we will do ?bladder. This is a dataset about bladder cancer recurrences. It has information on a number of patients, and the type of treatment they received, and the number of tumors, the size of tumors, and whether or not they had an event. So we look here, id is the patient ID, rx is the treatment versus placebo, number is the number of initial tumors when they first entered the study, size is the size of the largest initial tumor when they first entered the study, and enum is which recurrence this is. They could have up to four recurrences because they can get cancer repeatedly. Event is whether or not there was a recurrence. Stop is the time when it happened. Now stop is tricky, because it can be a censoring time. Now with survival analysis, we have to be careful about censored time, and the stopping variable can be either when an event occurred, or when the study ended, or when a patient dropped out of the study. 'Cause let's say the study went for 60 days, and a patient dropped out after 50 days without having a recurrence, you can't just assume he was safe. During those next 10 days, he might have had a recurrence, so you need to take that into account when doing survival data. So let's look in on a place where maybe an event was a one instead of just zero. For that we will come here and look at bladder rows 100 through 105. And this shows us we have some events are one, some events are zero. Now the stopping time, for one patient it was 31, one patient it was 32, some patients it was 12. So patient 25, their event occurred on day 12. Same with patient 26. Now patient 26 had another occurrence on day 15, and again on day 24. This shows us a few patients and when their event occurred, or if they dropped out of the study. For instance, patient 25 had an event on day 12. Patient 26 had multiple events, there is day 12, day 15, day 24, and finally a drop out on day 31. Patient 27 just dropped out on day 32. So we need to take into account these drop outs versus completing the study. That's very important to do because you don't want to bias the data. This censoring data, in conjunction with the event data, is taken into account by a special Surv function. So for this part of the data, let's see what that survObject would look like. So we say with the bladder data just to make our typing a little bit easier. We just do rows 100 through 105. Then we're using the command Surv and then stop and event. Now Surv is a weird function. It could take on either two to four arguments, but you can't give them names. It can only do it based on position, so it's very important you get it right. In this case the first argument is the stopping time, or the censoring time, and the second argument is whether or not an event occurred, that's yes or no. Sort of similar to logistic regression. If we look at the survObject, we see it gives us this very special type of vector. When there's actually an event, it just gives us the time of the event, 12, 12, 15, 24. When there's no event, we have to assume it could be censored data, so it says 31+, 32+, this special data format that shows hey, this could be censored. Underneath the covers, this is just stored as a matrix. To access it, we use the Square Bracket notations, say grab columns one and two. Here we can see, it's really a two column matrix. The first column is the time, 12, 12, 15, 24, 31, 32. The second column is the status. If there was an event, there's one. If there's no event and it could be a censoring time, it's zero. Fortunately, you don't need to worry about that, R takes care of it for you. What we want to do is fit a model to see if the treatment was significant. So we say cox1 gets coxph. This uses a formula interface, but the response variable has to be one of these survObjects. So we go ahead and create it, Surv of stop Comma event. And we use for predictors rx Plus number Plus size Plus enum. And the data is coming from bladder. If we look at the summary for this, and it is cox1, not coxph1. We see it returns all sorts of information about coefficients, in fact it returns two of them. That's because with survival analysis, you're dealing with odds ratios, and risk differences, we could just look for now at the coef and the standard error of the coef. And like with lm and glm, we can look at the coefficient plot of this model. And this gives us a quick sense of the variables and their confidence intervals. But typically when doing survival analysis, you actually want to look at a plot of the survfit function. So we'll go ahead and do this, we'll say plot of survfit of cox1, and we'll give the xlabel days, and we'll say ylab equals survival rate, and we'll build confidence intervals. This is much more telling than the coefficient plot. This is your probability of surviving however many days from the start of the study. So as days go on, you have a lower probability of surviving. And this shows, as of day 50, what your probability of surviving is, and the confidence interval, because remember, it's still statistics, and we need to be able to see a confidence interval to get a spread for the data. Now this is sort of lumping both the treatment and placebo group together. It is possible to fit a cox model that will break it off for the two groups individually. To do this, we'll build cox2, it's a coxph model, and the response is still a Surv object, with stop and event, and we're going to predict it based on strata of rx, it's gonna stratify the data according to rx, Plus number Plus size Plus enum. Say data equals bladder, and fit the model. We can look at this now, and see how rx is no longer listed as a coefficient, because it's now a stratification variable. We can go ahead and plot the survfit object, here we'll put in cox2, and again we'll give it nice labels because we might as well make it look nice. And we're gonna color code it because it will be drawing two lines, one for each strata. Before we run it, I know we're going to want a legend, so I'll build that out so it all happens in one phase. We're gonna put it in the bottom left of the graph. The legend will just be one and two, group one, group two. The line type will be one, the colors will be one through two. The text color of the legend will also be one through two. And the title of the legend will be rx. We run these two lines of code, and we get this nice plot that shows your survival rate depending on which group you are in. Cox proportional hazards is all about proportionality. You have some sort of baseline risk, and the predictor should have some sort of proportional effect based on that risk. So to test proportionality, we use cox.zph. Put in cox1, to see that the p values are all small, meaning each of these predictors should be significant. Likewise, we could do the same thing for cox2, and see that they are also significant. Survival analysis traditionally has just been time to an event, time to a heart attack, time to an event, time to recovery. There is another side of survival analysis that is done through an Andersen Gill model. That's where it says number of events. It's the number of heart attacks you're going to have, the number of admissions to a hospital. While this actually involves more complex math, and actually a lot of work getting the data ready, R makes it fairly easy to build this model just by using coxph. So we'll look at the bladder data, but it's slightly different. This data has start and stop times, it is periods when you're at risk. When you walk out of the hospital, you're at risk, and the stop time is when you come back into the hospital. You're good again until you leave the hospital, and then you have another interval of risk. That is the type of data that coxph requires. So to fit the model, we'll do ag1 gets coxph. This time, the Surv object will take three variables, start, stop, and event. They have to be in that order, and you cannot provide names, just for some reason it will not work with names. And here we'll fit it on rx Plus number Plus size Plus enum Plus cluster of id, that's because you need to group the patients together. This one patient needs to be considered again and again. If he's in there again and again, he has multiple intervals, you need to take those all into account together. And the data this time comes from bladder2. And we'll do this a second time while stratifying an rx. So I will just copy this code and run it again, but this time we'll do strata of rx. And I'll break up the formula just to make it easier to see. And we've better give this model a different name, otherwise it will overwrite the first model. We run both of these, and now we can look at the plots. So let's do plot of survfit ag1, and we won't bother giving it nice fancy labels, we'll just say give us a confidence interval. These are the survival curves, that is the probability of not having an event or not being readmitted, based on this Andersen Gill model. And we'll build a similar plot for the stratified example. I'll just copy this line, paste it down, and I'll change it from ag1 to ag2, and we'll take the legend we built earlier and just reuse it. So we'll go up here and grab it, and copy it. The only difference is this time I don't want it to be on the bottom left, I want it to be in the top right. So when we run these two lines of code, we get a nice plot here that is color-coded, that we forgot to do one very important thing. The plot does not automatically put in colors for the lines, you need to specify that by coming right here and saying col equals one through two. Running it now, gets us a nice color-coded line and shows your probability of survival based on the different groups, and we can see here there is significant overlap between group one and group two. So in this case, the drug might not have had an effect on your readmission rate. When doing a medical study, great care must be taken to use the proper methods. Not only are lives on the line, but the methods are more complicated and you need to account for your data in a different fashion. Fortunately, R makes this easy with the survival package written by Terry Therneau, one of the luminaries of survival analysis.