6.4 Plot the data - Video Tutorials & Practice Problems
Video duration:
16m
Play a video:
<v ->And now this is my favorite part of doing data analysis,</v> plotting the data, doing some visualizations. So, the library we're gonna use today is called Seaborn, and it's a Python data visualization library based of off matplotlib and it provides a higher level interface for drawing attractive graphics and informative statistical graphics. So if we go to this website this is what it looks like and you can see some examples here and there are example galleries, installation instructions, tutorials and API references. So API references just kind of means the documentation of all the different features that are available for you to use, so we'll look at these features here. We want to do some scatter plots, so that will be in relational, and you could go through the tutorial, but this is your tutorial for now so we're gonna look at the API. And the types of relational plots they have are here. We're gonna use scatter plot. And this is saying that to create a scatter plot, we'll import the library called Seaborn and then call the scatter plot method and pass in a bunch of parameters into that method. So, if it has an equal sign that means that it's not required because there's a default value and what's on the right of the equal sign is what the default value is if you don't pass something in. And each of these parameters are listed here as well as what it is expecting and what that does. Let's go back to our notebook here. And the first thing we want to do is import the Seaborn library. I'm gonna press shift enter to run it. It didn't fail, so that's good because it means that our Anaconda distribution came with Seaborn installed already. And then I'm gonna create a plot. So this is gonna be seaborn.scatterplot and for now let's see what happens when I just say data equals tips. So if we want to see data here, it's accepting a dataframe and that's our Pandas Dataframe here. Each column is a variable and each row is an observation. So that's exactly what we're passing in. Check that out, I'm getting an error and it says a wide-form input must only have numeric values. Not really sure what that means, but one thing we can do is we're not specifying what values of the dataframe we want to have as x or y axis. So here it says names of variables in data, or vector data, they must be numeric and we can pass data directly or reference columns in data. So we are passing data, so we can use the reference columns. So let's do that. Let's say x equals, what columns do we have? We've got total bill and tip and these are the only, oh and size, those three are the only numerical columns we have, so we should use one of those, and the x axis, let's do total bill because we're interested in the tip value and so that would be our y axis. And the tip value is just called tip so let's put that in, run it and we get a scatter plot. That's cool. So this is all 244 data points, and we can see that as expected there's a somewhat linear pattern so that the more the bill was, the higher the tip amount was gonna be. And we've got some outliers here, so this person tipped quite well on a small tip and then down here we've got people who are not tipping very well on large bills. Now, if we wanted to use that other column, numerical column that we have, it is size, so let's see what that looks like. It looks less interesting because the size is a discreet integer number, but yeah. There is like a slight pattern where the higher the size the more the tip, but not as much as I would assume. So I don't think that one's as useful, so let's go back to this being totaled though. There's a few things we can do now. One is that each of these points have only one color and one size, so we could if we wanted to add some parameters in here to change the color and size based on other attributes. But actually, I think we care more about the tip percentage as opposed to the total tip amount because that'll even this out and it might be easier to find outliers or patterns. So, we're gonna have to add a new column to our dataframe and to do that, if you watched the lesson on dictionaries, the syntax here to add a new column to the dataframe is gonna be similar to adding a new value in a dictionary. So, we're gonna take the dataframe called tips and then use these square brackets and then put in a name. So, let's put in percent. So this again be the tip percent, so actually maybe we want to call it tip percent. And then equals and then now we can base that new column off of other columns. So we could just make everything 0.2 and that would be the percentage or proportion I guess. And then we can see what tips.head looks like and we've got a new column that's just 0.2 values. But we can base this off of other columns in the table, so I can say tips and then get the column for tip and divide it by the column for total bill. And so I've run both of these again and we can see the percentage here, this person only tipped 6%, this one 16, 14, 15, et cetera. So that's cool. Let's think of other columns we can add while we're at it. One we can do is bill per person. So I can add bill per person and that's gonna be based on the total bill as well at the size of the party. And so now we've got another column here. Okay, so now let's use this to create another scatter plot. So Seaborn scatter plot, I can press tab to finish that off and then x equals, let's keep it at total bill for now, and then y equals percent and data equals tips. So let's run that. Oh right, tip percent. Okay so we can see that there's an outlier here again and it's even more pronounced now that the relationship isn't linear, it's like a constant. And the average does look to be about 1.5. We'll look at, let's see, we can do tips.describe again and now we get some more statistics. So the mean is 16% and the max is 71% and the min is 3.6%. And the range is roughly from 12% to 20%. We can now change this up. Let's also add a hue, so this is gonna be the color. So what happens if we put hue as a numerical value? We can put size and now it's a number and there's different shades, so it looks like the total bill is higher generally when there are more people eating. That all makes sense, it's not very useful information necessarily, and let's put in something the sex column, and we can see the difference in tipping between men and women, the males and females, and then let's do day, so the data looks like Friday, Thursday, Saturday, Sunday, and time. Okay, so nothing is popping out yet, but we can see that the values of time are lunch or dinner. And so now let's change the hue to use a different kind of column, so like a categorical one. So let's look at sex. Okay, so we can see the breakdown between male and female patrons and it looks like we have a number of different types here instead of just female and male, we also have lowercase male and just like f for female. So it looks like we probably want to clean some of this data up so that it's consistent. And then let's see if anything changes. Let's use time, lunch and dinner that looks good, and then day and we're just looking at Thursday through Sunday. We can also add a hue order in here, hue order, if we wanted to make this appear in a certain order. Let's see, oh yeah we also have smoker. Okay so one thing that's standing out a little bit is some of these higher tippers were in smoking sections. But also, there are also some lower tippers. There's another type of plot that I waned to show you and it's in the categorical plots, so there's a few here like box plot, another example, yeah it looks like this. So you've got a few different categories on the bottom, so this is not a linear scale, and then you also have some statistics like the mean and the top 25 and 50% and then some outliers here. But what I wanted to show you was a different one and it's the violin plot. So they look something like this. They're a little more suggestive, but they are better at showing what the averages are and the spread. So, let's try that on some of these characteristics. So I can say plot equals seaborn.violinplot and the x axis now can be something non-numerical like day and then they y axis, let's go back to using percent. And then data equals tips. Oh yeah, tip percent. So we're gonna get an error here. Let's just ignore it for now. And you can see that Thursday and Friday are more even tippers, whereas Sunday and Saturday have a larger spread, so you could get less and you could get more. And let's try time, so lunch is also more even. The mean is about the same, dinner has a wider spread. And then let's also try sex and then we've got some of this data here that we need to clean up, and then some data here for female and male. Lastly we have smoker. Yeah, interesting, so I don't really know what it's saying but it does look like non-smokers are more consistent tippers as well, and smokers are more capricious, depends on how they're feeling that day, I don't know. So yeah next let's work on cleaning some of that data and then looking at these plots over again.