20.2 Mine text with RTextTools - Video Tutorials & Practice Problems
Video duration:
9m
Play a video:
<v Voiceover>With the</v> explosion of social media, blogs, online news, others of written word, text mining is becoming hugely popular. It can be an intensive process, but fortunately R makes it easy. These being many different packages you'd have to use, one function of one package, another function of another to get text mining done. But thankfully, many of them have been rolled up into the RTextTools package. So if you don't have it, go ahead and install it. Otherwise, let's load it right now. So we say require RTextTools. That loads up the RTextTools package, and allows us to use all the packages it depends upon. For this example, the RTextTools package has a nice dataset based on New York Times articles. So we can load that by saying data, and in quotes, NYTimes, and just be certain we say package="RTextTools". We want to take a look at the data. We type in head of New York Times, and we can see it has information such as the article ID, the date, the title, the subject, and the topic code. What we want to do is create a document term matrix. That is a matrix where the rows of the matrix represents a different article, and the columns of the matrix represents a different word. While doing this, we want to remove numbers, we want to stem words such that running becomes run, cooking becomes cook, and we want to remove sparse terms. This is simple enough using functionality of the RTextTools package. We can create a new object, timesMat gets create matrix. The information we're going feed into it is the title of the article. We're going to try to see if we can figure out the topic of the article based on the title. So we put in NYTimes$title. We say we want to remove numbers. We want to stem words. I'll put this on a new line. We're gonna put the cut off for when you remove sparse terms. We'll run this code. And the computer has to go through and do all the processing, so it does take a bit of time, depending on your computer this will be slower or faster. And if we want to look at it we type in timesMat and instead of printing out an entire matrix, it's a sparse matrix, so it gives us information. For instance, there are 3,104 documents, which means there would be 3,104 rows, and there's 613 terms, or columns. This is a sparse matrix. That means that it only stores cells that have information. So since most of it will be zeros, those cells are just not stored. This is both more efficient in terms of memory, and in terms of computations, so sparse matrices are incredibly important. You can see here that we have about 99% sparsity. It's actually very important. When you're doing text mining, most documents don't use most words, so most entries will be zeros and we take advantage of this by using sparse matrices for the previously mentioned savings in both memory and computation. To continue with the process of text mining, we need to create a container. This is something special to the text mining process. Basically we need a training set, and we need a testing set. This way we can train them all on one, and test it on the other. Again, RTextTools makes this very easy. We simply say container gets create container. We are feeding it the timesMat that we just created. The labels, this is what we're training on, this is the response variable. That is New York Times, dollar sign, topic code. The train size is maybe a bit inappropriately named, it actually takes in a vector of indices. So for this we'll say one through 2500. Then we specify the test size, which in this case will be 2501 through N row New York Times. This way, whatever's after 2501, becomes the testing set. We will also specify virgin equals false. Now that we have both our training set and our testing set, is time to fit some models. We could go ahead and pull out the training set from the container, fit a model in our favorite algorithm, or we could just use the built-in functionality to use our favorite algorithms. So let's go of that. For this purposes, we will build both a support vector machine and my favorite algorithm, the elastic net. So to start with, let's do the SVM. So SVM gets train_model, we pass to the container, and we tell it to use SVM. And it says container's not found. Now this is a very a good example of a problem I see again and again and again when I teach R. People will write a line of code, and then they never actually run it. When they get to the next line of code that depends upon it, it airs out. So it's very important that if you want your object to exist, you have to run it first. So now we can go ahead and do the support vector machine. Now that that is done, we will go ahead and fit the elastic net, which in R is done through the GLMNET package. Fortunately, the RTextTools package has GLMNET functionality included. So we will say GLMNET gets train_model, again feed it container, and this time tell it we will be using the GLMNET model. Some of these models can take a while to fit, but their code is often written in either C, C++, or FORTRAN, so they actually are moving quite quickly. So first up let's do a classification based on the SVM model. so we're going to create a new object, call it SVM CLASSIFY, and that would be the result of running the classified model function. We will pass to the container, which is where it will get the testing set. We will tell it to use the SVM model that we previously built. In a similar fashion, we will do it for GLMNET, which is the elastic net implementation in R. Now that both of these models have been fit and tested, let's run some analytics on it. Let's say analytics gets create analytics, pass to the container object to get all of our data, and we will cbind the results of the SVM classification, and the GLMNET classification. Again, on your computer this could take some time, because it is going through and calculating the accuracy of the models. We can now run a summary of this object, and see how well our models did. We can see that, in terms of precision, the GLMNET did better, and also in terms of recall. So based on this dataset, the GLMNET is better than SVM. Now I don't want to make a sweepy statement and say one algorithm is always better than the other, it's on a case by case basis, and we only tried two different models. But this is a nice illustration of how easy it is to do text mining, whether talking about creating the document term matrix, fitting the models, or testing the models. RTextTools makes it very easy.