Skip to main content

My Channels

7: Data Munging

7.10 Get faster group operations

7: Data Munging

7.10 Get faster group operations - Video Tutorials & Practice Problems

Video duration:

21m

<v Voiceover>People often</v> say, or even complain, that 80% of a data scientists job is data munching. A big part of data munching is group by operations. People familiar with SQL will call this group by or aggregate. People familiar with R will call this split, apply, combine. It all means the same thing. You take your data set, you break it up according to some variable, you apply a function or measure to it, then you combine the results back together. Sort of similar to map reducing. For this example, let's take a look at the diamonds data set from GG plot two. To load that, we run the data function. We say diamonds, package equals GG plot two. You can now see the head of diamonds. This is a common data set we've looked at before that contains information about diamonds, such as carat, cut, color, clarity, price. It isn't a particularly big data set so we won't see a lot of speed differences between some of these functions, but it will be good for illustrating how they work. For our purposes, let's say we want to find out the average price of a diamond for each type of cut. That is the average price for ideal cut, for premium cut, for good, for very good. A very popular way of doing this is to use the aggregate function built right into R. There are two ways to use this. We'll start with the less convenient way. If you say aggregate diamonds, dollar sign, price, by equals and you have to make a list of diamonds dollar sign cut. And lastly you tell R that you want to use the mean function. Run that and you can see that it grouped the data by fair, good, very good, premium, ideal and it calculated the mean price for each group. This can get unwieldy when you want to group by multiple variables or you want to aggregate multiple variables so most people use the formula interface to aggregate. Which looks like aggregate, price on cut, diamonds, mean. We're saying take the diamonds data set, break it up according to cut, and calculate the mean of the price. This is what we get. The numbers are the same, the labels come out a little differently, but it was much easier to type. This was a great convenient way of aggregating data, but it can kind of slow, especially once your data gets bigger. Another way built into base R, which is much faster is to use T apply. It's quite simple. It's very similar to L apply, S apply, apply. But it's meant for aggregating data. We say T apply diamonds, dollar, price, and we say the index is diamonds, dollar, cut, and we apply the mean to it. Look at that. Gave us the exact same numbers, but instead of returning a data frame, it returned a vector. You might want a vector, you might not want a vector. It's a different type of functionality. Plus we've now had to resort to going back to the diamonds, dollar sign, price, diamonds, dollar sign, cut. It's not ideal, but it gets the job done and it is much faster than aggregate. You couldn't see that here in this test because the data was so small, but it does have a significant difference in speed. Aggregating data is essentially implementing the split, apply, combine paradigm which was popularized by Hadley Wickham and his prior package. This package is covered extensively in previous live lessons and in the book, R For Everyone, but it's good to have a refresher. To use the package we have to load it, require plier. And in this package is a whole suite of functions, such as DD ply, which goes from a data frame to a data frame. L apply which goes from a list to a list. DL apply which goes from a data frame to a list. AD apply which goes from an array to a data frame. So to remind ourselves, let's look at the data set, head of diamonds. Then we want to calculate the average price for each cut. We want to split it up by cut and calculate the mean price. For our purposes, we are going to use DD apply to go from a data frame to a data frame. We pass in our data set diamonds and we're going to break it up. We are going to split it according to the cut column, which we pass as a string. We then use the summarize function, which comes with plier, which itself takes additional arguments. So we will say return the value price, which is calculated as the mean of price. Notice the capitalization. We are taking the price column from the data frame, which is a lower case P and are results are going to come back with a capital P. I could've named this shoe shine, it'll come back as that, but price seems to make more sense. You run this and we get the results for each cut of the diamonds we get a price back. Nice, simple, exactly what we wanted, and comparing this to the aggregate function, this will be much faster. Let's say we wanted a vector to come back, just like with T apply, we could've done DA apply, go from a data frame to an array and in this case an array is similar to vector. And the rest of this will be exactly the same. It's a named array, so it doesn't quite look like a vector, but it will be treated the say way. The numbers are the same, it just comes back in a different format. This is a power of the plier package, which has really revolutionized the way people use R. Aggregating data, like with the diamonds data set, is an embarrassingly parallel process and is quite easy to work with. Looking at the diamonds data set, we can see that we want to separately and independently calculate the mean price for each level of cut. So there's no reason we can't spin this off and do each one in parallel. Fortunately, the plier package makes it real easy to turn your code into parallel code. To do this, we first need to load in the do parallel package. So we say require do parallel. This brings up a bunch of information about the package and loads in some other packages as needed. Things do get a little more complicated when running a parallel, so just a few more steps for me to do. The first is to make a cluster. Doesn't have to be a real cluster computers, it can be using the multiple cores you have on your individual machine. So we say CL gets make cluster two. The reason I'm choosing two is that my computer only has two cores. If your computer has four cores, you could use four, or eight, 16, whatever you may have. Now that we have this cluster, we need to register the parallel back end. This has become really easy in recent years. In the past, depending on the operating system, you would need to register different (mumbles). But now we can just use the generic function register do parallel. So we say register do parallel and we specify the cluster we just created. We can now use all of our cores together. Running DD apply in parallel is pretty simple and more or less only requires one extra argument. We still say DD apply and we still pass up the data set. We then tell it which column to split up the data by. Before we used the summarize function for calculating the average price of the data. We can't use the summarize function anymore, so we'll just build a quick little in line function. Function of X is mean of X dollar price. It's maybe a little in-eloquent, but it works. We then just put in the argument, .parallel equals true. We run this and it comes back with the exact same results. That's very important. The same numbers get calculated, it was just done in parallel. Now, before you get excited and turn all of your code into parallel, it is important to think about what you're doing. Depending on the size of the data and how many cores you have on your machine, or how many machines you can put together, the overhead of parallel processing might undo any speed gains you get my parallelizing. You just need to be careful and think about when you want to use parallel and when you don't and make sure the speed gains will be worth while. Realizing the need for speed, Matt Dow, based in the UK, wrote the data table package. For years it has been providing our users fast data manipulation and aggregation. First up, let's load the package. We are once again going to be working with the diamonds data set. We want to an aggregation of the average price for each cut. Before I could use the data table functionality, we need to create a data table. We will say (mumbles) DT gets data table diamonds. If we print this out, even without saying head, it automatically only prints the first five rows and the last five rows. Because it knows you probably don't want to print out an entire 50,000 row data set. In case you were thinking that you first need to have your data in a data frame and then put it into a data table, that's not the case. Use the F read function, which will actually reading your data fast. But our data was already there for us so we can just continue on. Data table uses different syntax then apply or aggregate. What you do is you use the square bracket like sub setting and use that to do the aggregation. So we will say, dia DT, square brackets, we're going to leave the first argument blank because we're not doing any row selection. Then for the second argument we will say, mean of price. There is a third special argument we say by equals cut. So we are saying take the dia DT data table, break it up by cut, then calculate the mean of price for each of those. There you have it. The exact same numbers we get using aggregate, T apply, and plier, you get for data table. Of course the math works. The thing about data table is that it is incredibly fast. On such a small data set you won't notice, but data table is one of the fastest, if not the fastest, way to do aggregation in R. Hadley Wickham is one of the most prolific authors of R packages. He is famous for packages such as GG plot two, and plier. His latest package is called D plier. It's the next generation of the plier package. Underneath the hood it's written mostly in C plus plus and it uses many techniques to provide incredibly fast data aggregation. While introducing his new package, he also helped popularize the concept of pipes in R. Now, these pipes are a way of stringing together functions instead of nesting them together, as is traditionally done in R. It's actually made possible through the (mumbles) package, which I know I am probably not pronouncing correctly. So, as we do for everything else, let's load up the D plier package. It is important to note that if you will be using both the plier package and D plier package in the same session, load up plier first. So we will say require plier and require D plier. This is do to name space issues, as you can see from this message the D plier overwrites a number of functions from the plier package. Before we get into D plier itself, let's just learn a little bit about pipes. Let's say we want to see the head of the diamonds data set. You can do that. We also check out the dimension of the diamonds data set. This is the way you traditionally do it in R. But the piping functionality allows you to pass information from the left hand side of the pipe to the right hand side of the pipe. Diamonds, percent, right angle, percent. Head. That accomplishes the same thing as head of diamonds. Likewise, diamonds, pipe, din accomplishes the same thing. Now there's sure to be much great debate about which paradigm is better. Should we have nested function calls? Should we have piped function calls? That is not something we're here to discuss today, except that there are speed differences. Recent benchmarks have shown that nested function calls actually run faster than the piped calls. Will that always be the case? Uncertain. But it's also important to remember that the human time of writing the code and reading the code later might be more valuable in the slight gains and processing time from using nested functions. To that end, in addition to the mag ridder package, which has the percent angle percent pipe, there's also the pipe R package, which uses percent angle angle percent. And according to their benchmarks, that code runs faster. We won't get into benchmarking now, but it's something important to consider. So, using D plier, you can use nested functionality, but you might as well go ahead and use these pipes 'cause it makes life easy. The first thing you do is write the name of your data set, which for us is diamonds. To do split, apply, combine in D plier, you group by some variable and that group by is it's own function. So you pipe diamonds into the group by functions, which takes us it's argument the column or columns that you want to group by. In our case, just cut. Then, you could pipe that to another function, which for us will be the summarize function. Normally you can just type in summarize, but since we have both plier and D plier loaded, to play it safe I'll explicitly say D plier summarize. And here I will say that price will be the end result is equal to the mean of price. I run this, you can see we get the same exact results yet again. You can debate whether this is harder or easier to write than plier or harder or easier to write than data table, but this does come with really fast results. Is it faster than data table? Maybe, maybe not. That's another bench mark and bench marks are always a little difficult to say which one's correct, but that is something that's very important to understand. There's human time, there's computer time, and sometimes one matters more than the other. As we can see here, this is a very easy way to write some code that operates very quickly. Sometimes your data is just too big to even fit in memory and your only real options is to leave it in the database. This used to mean you would have to go and use SQL. Now, SQLs great and all, but many people prefer staying in R. Fortunately, D plier has been written so that it can work with databases very easily and the user just writes R code, doesn't have to worry about SQL. To do this, first we need a database to work with. To illustrate running D plier on the database, I have stored the diamonds data set in a SQL database. Of course this example is a bit silly because it is a small data set, you don't need to leave it in a database. It's good for illustration purposes. I have put this database up on my website and first thing we need to do is download it. So we say download .file then we point to the address, which is http://www.jaredlandard.com/data/diamonds.db. Our destination file is just going to be diamonds.db and the method, this is very important, is curl. If you use a different method, such as W get, or the default, then you get a corrupted database. So make sure you use curl. We're downloading it, as is evidenced in our get pane because we have a new file here. For this example, I used a SQL database because it was easy. But D plier works with a number of databases and they're constantly adding more. They definitely support (mumbles) my SQL, SQL lite, Big cray and 180B. Now that we have our file we need to create a source so we can access the file. To do that we should load the D plier package. We will say require D plier. And to create the source we say dia DB source, gets, SRC SQL lite. And here we point to the database. This function also loads other packages needed for communicating with the database, our SQL lite DBI and our SQL lite .extfunds. Now that we have the source, we need to create a table 'cause this table is how R sees things. So we'll say diaDB gets TBL. That's creating a table object, sort of think of it like a data frame. We point to the database source. And databases can have multiple tables, we want to grab just one, we can only grab just one. Because a table now is like a data frame and only has one logical table. In this case our database only has one table called diamonds so we grab it. If we want to view it, we could just type in diaDB and treat it like a data frame except that it's intelligent and only prints out the first 10 rows. Why print out only 53,000 rows if we don't need them all? To run D plier on a database, you write code that is virtually identical to the code you would write for running it on a data frame. So you say diaDB to group by 'cause you still need to group the data by cut and pipe that to the summarize function. Most functions you could just write the function name, plier and plier has some conflicts. It's smart to specify D plier summarize. And summarize takes the argument price equals mean of price. You run that and you see you get back the same exact results again, showing the average price for fair, good, ideal, premium, very good. Its' important to remember that running D plier on data that's in a database will be slower than running it on data that's in a data frame in memory. However, you would be doing this when the data is too big to fit in memory so you couldn't get it into a data frame anyway and speed doesn't matter at that point 'cause you couldn't do anything. Having the ability to run the code on data in the database is hugely helpful when dealing with big data.