5.8 Make boxplots and violin plots with ggplot2 - Video Tutorials & Practice Problems
Video duration:
4m
Play a video:
<v Voiceover>Even though</v> box plots can sometimes be much maligned, they're still a good tool to have. So, of course, ggplot2 let's us build them. Here we look at a box plot of the carat size of diamonds. So again we initialize our plot, ggplot diamonds, and here the aesthetic gets a little tricky. We want the y-axis to be mapped to carat. Now technically a box plot doesn't have an x-axis, but ggplot really wants one. We will just assign it to the constant one. Close these function calls, and then add the function geom_boxplot. Run this and we see we have a nice box plot here. It just looks more attractive than the base graphics. The margins are used better, the gray background, everything just looks a lot nicer. A nice graphic to show is drawing a box plot or pedalee for different levels of a variable. So, why don't we draw a box plot for the carat, broken up by the cut of the diamond. We start out in a very similar fashion, ggplot, diamonds, asethetic y still gets carat, but now x gets cut. Then we just finish it off with geom_boxplot. We now have a separate box plot for each level of cut. That is a very nice feature to have, so you can see how all the cuts compare to each other. Aside from a few outliers and fair the boxes all more or less line up with each other. Remember, one of the common complaints against the box plot is that all you get is this rectangle of data. You don't get a lot of information there. To remedy this, there is a violin plot, built in ggplot in a very similar fashion; ggplot, diamonds, aes is still y for carat and x for cut, but this time we do geom_violin. Running that we see that it's a similar idea as a boxplot but the box now has contours. We can see the dispersion of the data on how it changes. Using the violin plot as our basis, let's see how we can use multiple layers in a single plot to add greater effect. First, let's start off by saving the base of the plot to a variable so we don't have to type it repeatedly. We see a mistake here. It's that y was specified twice, whereas carat should go to y, cut should go to x, and ggplot is good at reminding you, because you cannot have two variables mapping to the same physical aesthetic. Now that this is working, let's add points to the graphic, and then the violins. So, we do g + geom_point + geom_violin. What this does, is plot the points overlaid with the violin. You can see here there are the points, and they're peeking through the thin parts in the violin, but the violin is definitely on top of it. If we were to reverse the order, such as g + geom_violin + geom_point, the violins will be plotted first and then the points. We can see this here where a long string of points are overlaid on top of the violin. The order that you add the geoms affects the order they appear on the plot. The first geoms go on the bottom, the latter geoms go on the top. There is another point geom, called geom jitter which spreads out the data a bit. It sort of adds a little bit of randomness to the points so they fit a little better. So let's see what that would look like. We do g + geom_jitter + geom_violin. In this case, the points with a little bit of noise added will be on the bottom and the violins on top. Here we see the points really got spread out and makes a nice silhouette plot with the violins on top. Using box plots and violin plots, we've seen a good way to see a repeated measure of the data, and how the ordering of layers makes a difference in the way they come up in the graph. Using these to your advantage can let you make some really amazing looking graphics.