4.1 Read a CSV into R - Video Tutorials & Practice Problems
Video duration:
5m
Play a video:
<v Voiceover>Even with all its great</v> statistical capabilities, R would be useless if there wasn't an easy way to get data into it. Fortunately, there are many different ways to get all types of data into R. The most common way people have received data, is through a CSV or a comma-separated values file. So let's look at loading one of those into R. I've made this file available at my website, http://www.jaredlander.com/data/Tomato%20, now that %20 is because it's a URL, so it needs to be formatted properly, First.csv. You can go ahead, download that to your computer, I'll show you how to use that in a second directly from R. Before we load in the data, let's take a look at it. We're going to open up the CSV in Excel, which is still the world's most widely used analysis tool. We see it's a nice rectangular data set. There are a few columns, some of them are numeric, some of them are character, and they're stored underneath as a text file with commas separating each of the values. There's even some blanks, and that is important. Let's go back to R and let's start reading it in. Let's say we want to save it to a variable named tomato. We use read.table. At this point, since people are reading a CSV, they're tempted to use a function, read.csv. Truth be told, read.csv is just a wrapper function to read.table, and can actually introduce a number of annoyances, so I tend to avoid read.csv in favor of read.table. The first argument is the name of the file. This is the location and name of the file, including its extension. In this case, it's stored in data then Tomato First.csv. I am doing this from a file on my hard drive. It is entirely possible to take this URL and put that in this place of the file, and it will work just perfectly fine. That's a great feature. Going back to our data set, we see the first row is all the column names, so we want to tell R that, "hey, the first row are column names." We say, "header equals TRUE." And even though we can't see it in Excel, we know that all the values are separated by commas, so we say "sep equals comma inside quotes." Running this line reads the data into R. We can see this by typing in head of tomato. We get the first six rows and all of the columns. When has Y data sets, R will wrap the columns around. This information down here really belongs on the right hand side of the rest of the data, it just wouldn't fit on the screen. There are a number of options to read.table, the most important of which, in my mind at least, is strings as factors equals false. If we have a very large data set, converting all that character information into factors can be computationally expensive. When R reads in a file from a CSV using one of these functions such as read.table, it automatically converts character data into factors. For instance, class of tomato, and again you could use tab completion to fill out the names of the variables, and we'll do dollar sign Tomato, and I used tab completion, and that's interesting. When you have a data frame, and you do tab completion after the dollar sign, it shows you the columns that match what you have typed. We will hit enter for Tomato, we'll check it out, and we'll see it is indeed a factor. When reading data into R, this process of converting all the characters into factors can be computationally expensive. If you have a very large data set, that can really slow you down. One of the best tricks for speeding up data read into R, is by setting strings as factors equals false. Not only does this speed up computation time, it saves you headaches down the road. Let's copy line two, paste it in, and set that extra argument stringsAsFactors equals false. When we run that and then do class of tomato dollar tomato, we see it is now a character. That makes things just a little bit easier for us. If instead of having a comma separated value, we for instance, had a tab separated value, read.table can still handle that. The difference is, instead of saying sep equals comma, you would say sep equals /t. It would look like this, sep="\t" and that's how you would tell R to look for tab separating columns instead of commas. This argument can take any arbitrary separator, so if your file has semicolons as a separator, you can simply put sep equals semi-colon in quotes. Whatever is separating your data, that's the symbol you want to use. While earlier I discouraged the use of read.csv, there are sometimes when a csv file is not formatted correctly, it's ill formed. For instance, if instead of using periods to separate decimal numbers and using commas, that can trick R and think there's more columns than there really are. In that case, there's a handy function called read.csv2. This does a much better job of going through the data and picking out the rows and columns and finding out the correct relationships. It's slower than read.table, but it's useful in these situations when you have ill formed data. Read.table really is our workhorse for getting data into R, because typically, in the past at least, data came in through CSVs, and read.table is just great and easy for pulling in CSVs and similar files.