16.2 Explore the data - Video Tutorials & Practice Problems
Video duration:
8m
Play a video:
<v Voiceover>Multiple</v> regression is an extension of simple regression. Whereas in simple regression, you're predicting one variable on one other variable, multiple regression lets you predict one variable on many variables. It's a natural extension, and we will explore the math. But first, let's take a look at some data. We're going to use some information about the New York City housing market. The data is available on my website at www.jaredlander.com /data/housing.csv. Let's load that in. We use read.table. And remember, the separators are commas. There is a header, and stringsAsFactors equals false. We run that, and now we can take a look. It's a lot of information in here, and the names aren't necessarily the best names. For instance, Building.Classification is kind of long. We can just change that to Class. So what we will do is we'll use the names function to clean this up a bit. It's a bit of typing, but it will make things easier later on. If we run this line, and argument seven is empty. That's not a problem. Sometimes you might stick in an extra little comma. You need to be careful not to have those. And remember, in RStudio, if we want to rerun the same line of code, Control + Shift + P. If we look at the data now, we can see there are much nicer names. In fact, the names are so much shorter it takes up less space on the screen. We are interested in a number of these variables, particularly the value per square foot. The value per square foot is a very important measure, and I want to see how the other variables explain it. So let's visually explore it. One of the most important parts of doing any analysis is the exploratory data analysis which essentially means visualizing your data, plotting it in many different ways, and seeing what's going on. So to do this, we will load ggplot and then we will do a histogram of the value per square foot. So we do ggplot housing. The aesthetic is x=ValuePerSqFt. And we say geom_histogram(binwidth=10). Now binwidth is something you need to play with. You might not get it right exactly the first time. You need to mess around a bit. You need to experiment a lot and see which one is best. There are some formulas for deducing which one's best, but you've really got to play around. And lastly, we will give it some labels. We'll say x is the Value per Square Foot. We run both of these lines, and we get a histogram. Let's look at this for a second, and we can see it definitely is not normally distributed. There are two modes here. It seems to be two giant peaks. That's a bit disturbing when you're doing regression. Ideally, your response variable should be somewhat normally distributed. So we need to figure out what's going on with this and figure out why. Now, I have a hunch, knowing the New York City housing market, it could be that the very high value per square foot is in desirable neighborhoods. So let's go ahead. We will take this line of code and copy it, these two lines, technically, and we will modify it slightly. In addition to saying the x-axis will be the value per square foot, we will say fill=Boro. That way, it will be broken out by Manhattan, Brooklyn, the Bronx, Staten Island, and Queens. When we run this, it becomes very clear very quickly that this mound is the Bronx, Brooklyn, Queens, and Staten Island, and this mode here is Manhattan. That does make a good amount of sense. Manhattan typically has much higher pricing than the other boroughs though Brooklyn is about to catch up. In case this display doesn't quite really show it to you, I happen to like the way the data's overlaid. Other people don't see it this way. So one other option, if we were to copy this line of code, and then we can add in facet_wrap on Boro. Running this changes the display a little bit. And now, it just puts each borough in its own plane. This is called small multiples. Both displays get the point across. I personally prefer the other display. Some people prefer this display. So already, and I'll go back to the other graph and look at it. Already, we can tell just visually that borough is going to be an important variable and we're going to need to consider it. Let's continue our EDA and look at a histogram of square footage. So we will do ggplot(housing, aes(x=SqFt) and do geom_histogram. And we will just use the default, but we need to make sure we spell histogram correctly. We run this and we see a very weird graph. It's all squished up on the lefthand side and there's a big trail going on here, meaning there's some data out here that we really can't see. Something seems a bit amiss, so we've got to think about that in a little bit. Just to explore this a little more, let's make a scatterplot between square foot and value per square foot. Maybe something is up with that. Again, we see all the data squished up towards the left with a few stragglers out here with enormous square footage. But also bear in mind, this is building-level square footage. It's not for individual apartments. It's for buildings. But even though these are for buildings, these are still giant buildings and clearly bigger than everything else. So after exploring the data, we see there are a few buildings where there's thousands of units in them. These are big housing complexes in Tudor City that are not representative of the rest of the city. They are, by far, extreme examples of buildings. So let's go ahead and build these plots again. But this time, let's exclude some data. We are going to exclude any buildings where there are more than a thousand units. So we say, we're going to subset housing to be such that the units is less than a thousand. Running this, we see we get a much better histogram. It is right skewed but it actually fills the screen and we actually see all the data. Let's do the same thing with the scatterplot. We will come to housing, and we will subset it to make sure we are only looking at units where there are less than a thousand units. We run this, and again, our scatterplot now looks much more informative. It looks like we're already off to a good start. So let's see how many buildings we need to remove. We just simply do that by saying sum(housing$Units >= 1000). What this is doing is housing$Units greater than or equal to one thousand returns true or false repeatedly for every single row in the dataset. Since they're all true falses, they're all ones and zeroes. If we add them up, we get the total number. And it appears there are six units. That's all there are, so we can go ahead and remove them from our data. Doing this keeps just the good units, and we are now good to go. Visualization can be hard to do, but it is very important. Especially when doing a complicated analysis, it's good to see your data and see what's going on. If we hadn't done that, we wouldn't have noticed all of these buildings that were just so large that they threw off the scale completely. Exploratory data analysis can be very revealing and insightful and is a key step in the modeling process.