5.6 Plot histograms and densities with ggplot2 - Video Tutorials & Practice Problems
Video duration:
3m
Play a video:
<v Voiceover>Much like</v> with base graphics, ggplot offers an easy way to make histograms. So again, we will look at the carat variable of the diamonds data set. These graphics are initialized with a call to the ggplot function, so that is ggplot and then we say data equals diamonds. That's the very first thing we're going to do. Then we're going to add the histogram layer so that is plus geom underscore histogram. Now, since we haven't specified any aesthetic mappings yet, we need to tell ggplot what variable is going to be mapped to the x axis. We do that using aes, which is its own function, and say x equals carat. We close off that aes and we close off geom_histogram, and now we can run the plot. So a few things happened here. We did get our graph, which is amazing. We also got a little message saying that the bin width defaulted to the range of the data divided by 30. What that means is a histogram has to break up the x axis into discrete buckets. How big those buckets are is a variable you could tune. That right now is being done by default by taking the maximum x value minus the minimum x value and dividing by 30. Playing with that can make a big difference. We can adjust this bin width manually using some options that go into the geom histogram function. So I'm gonna take this line of code, copy it, and paste it on the next line. Inside geom_histogram, but not inside aes, going to make a new argument: binwidth. Right now, I'll set that to point five and see what we get. Running that, we get a plot that's much blockier, there is much less information, much less variation. So I'll copy this line again and this time I'll make a plot that has a bin width of point one and see what that looks like. This became much noisier, lots of peaks, lots of valleys. It is important when making a histogram to get the bin width just right. Too small of one, and the data's too noisy, doesn't really show you anything. Too big of one, the data's too smoothed out and you don't get any information either. You really want to find that sweet spot. Let's clear out the console and look at a similar graph, the density plot. A density is sort of a more continuous version of a histogram. It started in much the same way, with ggplot(data=diamonds), and then instead of geom_histogram, we will do geom underscore density. Again, the aesthetic mapping will still be the x axis getting carat, and we won't worry about bin width just now. Running this, we get this density plot. And it's only a line, doesn't really show that much, so it might be helpful if we fill in the color. So let's copy this line, paste it in, and in geom_density, let's add another argument. You might be tempted to use the color argument, but that, in actuality, would just control the color of the line. We want to fill in the color underneath the line, so we use fill, and this can take on a number of different values. I'll use the string grey50. Running this, now gives us a nice filled-in density plot. Both the histogram and the density plot are very useful tools for examining a single variable. They give you a sense of how spread out the data is. And using ggplot, we can quickly and easily build 'em up. Right now, it may seem like a lot of typing to accomplish what, in base graphics, could be done with much less typing, but as the graphs get more complicated, the ggplot syntax becomes much, much quicker to write.