8.8 Apply arbitrary functions with do - Video Tutorials & Practice Problems
Video duration:
6m
Play a video:
<v Voiceover>The mutate</v> and summarize functions are really great when you want to apply a single transformation or a single summary statistic on individual columns. But sometimes, you want to use an arbitrary function that returns multiple rows, multiple columns, or maybe even different types of objects. For that, we can use the do function. In order to use this, let's create our own trivial, arbitrary function. We will say, topN gets function of x, comma, n equals five. This function will simply return the rows associated with the end biggest prices. So we say right in here, using myGridder, x, pipe, arrange, descending of price, pipe, slice, one through N. Let's run this function. And then, if we do topN, of dia, we see it gives us the rows with the five highest prices for the diamonds. Instead of doing topN of dia, we could have said, dia, pipe, do, topN of dot. This might seem like it's more typing, and it is, but this will be useful later on in more complicated data flows. So we look at this, we get the same answer as expected. The dot is a placeholder. That is saying where to place the information being piped into the function. Because it can go to the second argument or the first argument. So it's important to have that. We can use multiple arguments by saying dia, pipe, do, topN of dot comma, N equals seven. This gives us the top seven rows. Now, let's say I want to get the top two rows for each level of cut. So we say, dia, pipe, group by, cut, pipe, do, topN of dot comma, N equals two. And now, went through, broke up the data according to cut, and returned the top two rows of each. Of course, this example could have been done straight without the function, using all of Rubular dplyr notation by saying dia, pipe, group by, cut, pipe, arrange, descending of price, pipe, slice, since it's still grouped by, one through two. Gets us the same results, but I wanted to illustrate applying an arbitrary function. So far, we have specified do without supplying a name. When we don't supply a name, do expects a data frame to be returned, and it combines all the data frames into one. But, if you name the argument inside do, it behaves differently. We say dia, pipe, group by, cut, pipe, do, This, equals topN, parenthesis, dot comma, N equals two. And now we see it looks like a five row data frame, but though this column, itself, is a data frame stored in there. It looks a little odd, but let's illustrate this further by applying a linear model to each level of cut. We will simply break up the data by cut, and then regress price onto carat for each of them. We'll save this in the model's object, and say dia, pipe, group by, cut, pipe, do, Model equals lm. We're fitting a linear model. Of price, tilde, carat. Don't worry if you're not familiar with lm at this point, you can learn that in other lessons. For now, this is just to see how do handles arbitrary functions. Don't forget, data equals dot because we're piping the cut-up data frames into the do function and that passes the current set of data as a dot right here to the data argument. Let's run this, and let's look at the result. We get a data frame with only five rows in it. The first column is the levels of cut, and the second column looks like it's storing lm objects. The class of models is a row wise data frame, which just has to do with the fact that we used the group by operation. Let's check out the class of cut. Class of models, dollar, cut. It is an order factor, and that's what we expect from this type of categorical variable. But let's check out the class of model. Class, models, dollar, model. It appears that the entire column is being treated as a list. Let's check the class of the first element. Class of models, dollar, model, double square bracket one. It's an lm object. An entire lm object is being stored as a cell of this column. Let's print it out. Models, dollar, model, double square bracket one. Gives us the results like an lm model. We can do summary of models, dollar, model, square bracket one. And you see, it is exactly treated just like an lm model. The do function lets you apply arbitrary functions to a data frame, and it's very powerful when you use the group by clause. If you leave it unnamed, it expects to return a data frame and stack them together. However, supplying a name for the expression in the do function causes the results to be stored as an element of a column, not to be returned as an entire data frame that gets stacked.