4.3 Read from databases - Video Tutorials & Practice Problems
Video duration:
5m
Play a video:
<v Voiceover>As the world is</v> generating more and more data, otherwise known as big data. These data are being stored in databases. As such, the R Programmer is increasingly being required to pull data from these databases. Now, R can do this easily, but in Windows we need to run R as administrator to be able to access all the ODBC tools we need. So to do this, we go to the start menu. We find R Studio, we right click and then select run as administrator. And this brings up R Studio just like any other time, except we now have more permissions. The first thing needed to pull from databases is the R ODBC package, which is included with the base installation of R. First things first, let's clear out the console just so we have more room. The first thing needed to read from databases in R is to load the R ODBC package. Now if you don't have that installed, go ahead and install it using the install packages interface. Once you have it we'll load it up by using require. And remember this is case sensitive. And now we're ready to go. If you're working a database, you would've already created an ODBC connection. And while that's beyond the scope of this course, we'll assume that you or your system administrator can get you set up using that. So the first thing we need to do is create a pipeline to this ODBC connection. We'll assign this to a variable called db and we'll use the function odbcConnect. I'll use the tab button to pull up the options, it's the very first option we want. And I'll give it the name of my DSN, however, the DSN is the connection string that lets you talk to your database. Putting that in, it sits there it thinks, and it creates a connection to the database. We can view this connection by typing in db and see it has the DSN to QV training, there's no user ID. Not a lot of information here. This does accept passwords, but it's not something we really need to worry about right now. So let's clear the console again. And let's go ahead and make our first query to the database. This data schema has a number of tables including an orders table and we're just going to pull every column from that orders table. So we'll assign this table to the variable orders table. And we will use the function sqlQuery. The nice thing about hitting tab to get the function name is that it shows you the arguments needed for that function. Likewise once you already have the parentheses, you could hit tab to find the arguments you need. The first argument is channel. That's what we created up here with odbcConnect. So I'll put in db. The second argument is your sqlQuery. In the R case that's select, star, from orders. And the optional argument strings as factors equals false. Make sure characters are not converted to factors, this will speed up the process. Running this lets R talk to the database and perform the query. And why don't we look at this now. Head of orders table. This shows us we have an order ID, order date, customer ID, employee ID, shipper ID, and freight. Likewise we can pull from the order details table. You'll save this to the variable details table. Again, we will use sqlQuery. The first argument is the channel, second argument is our query which is select all from order details, close. Now while this is not intended to be a SQL lesson, most often SQL is not case sensitive. If your table or column name has a space in it, you need to enclose it with some sort of delimiter, such as square brackets. And again we'll do strings as factors equals false, to speed up processing. We'll look ahead of this table. And see we have a nice table. The queries you run in your sqlQuery can be arbitrarily complex. If the database can execute that query, we can run it. So let's clear the screen and take a look at one of those. So for this we're first going to store our query into a text variable, 'cause that'll be easier to work with it rather than putting it in the middle of a function column. So we will call that long query. And we will say select all, we're going to select all the columns from both orders and order details. And now we're going to give it a where clause, so it only pulls certain rows. So where orders.orderID is equal to order details.orderID. So let's run that and take a look at it. Long query. And we see it's saying pull all the columns from both tables where the order ID from the orders table matches the order ID from the order details table. We can then go ahead and execute this command. We'll call it the joined table. And it will get sqlQuery, db, long query, and we'll say strings as factors equals false. The head of joined table looks like this. It is the combination of all the columns from both tables joined appropriately. With all of the data being stored in databases, it's important to be able to pull from that and R exposes the full capabilities of any sqlQuery so as long as you can write the query, you can pull from the database.