6.5 Clean some data - Video Tutorials & Practice Problems
Video duration:
16m
Play a video:
<v ->Okay, so in the last sub lesson,</v> we saw that not all of the data was exactly how we wanted it to be. So there's an important step. When you're doing data analysis, which is cleaning data. Why would you need to clean data? Well, sometimes you actually have missing data. And you have to figure out how to deal with those missing values. Do you create a default value? And say, if it's missing, then, assume this is the case. Or do you like ignore those. That row, ignore those instances? Because now they're not very useful. If there are a lot of missing data for a certain column, that you could even just delete that column, if you don't think that it will be very necessary. And very important. There's also like, kind of wrong off data. So if you've got unrealistic data in one of your columns, then maybe inserted in the wrong units. So like maybe the order of magnitude is off, maybe it's in temperature, like Celsius instead of Fahrenheit, things like that. So you also wanna look for that kind of thing. And then also, if there are some outliers, and they're just like, way out there, you may be don't want to have that skewing your results. Because that's going to affect your mean values, and whatnot. So, it might have an like too large of an impact on the kind of thing that you're looking for. Now, let's work on cleaning the data that we have in our tips.csv file. One thing that we can do is Get a column and the tips maybe the sex one, we know that that's a little off. And we can use this method called value counts. And we can see what the count of the different values there are in this file. And it's a big mess. Mostly because I edited it to make sure that there were some different values in here. And I wasn't actually expecting it to look like this. So you can see that there's like, some values that look similar, but some are in quotes, and some are not. Yeah, so we want to convert all of these into, just male and female. So now with this cleaning data, we're gonna be using pandas more to manipulate the values that we have in the data frame. And, let's look at the documentation a little bit. So, we've got this API reference. And, so tips itself is a data frame. So we can see the different things you can do on data frames. There is a lot, I will say I'm a little like overwhelmed with all of the different things that you can do. Because this isn't my field of expertise. But then also, if I get a column, this is now just a series instead of a data frame. A data frame would be like multi dimensional, so like have have more than one dimension. This one has just one dimension, just the sex and then a bunch of rows. So in that case, it would be series. So we can go in here and see what you can do on a series. And the one we're going to use let's see where did I find it, values counts. So it's in computations or descriptive stats. And, we can look and see how to use that, what parameters it takes. Yeah, we don't have to pass anything in if we don't want to. So let's quickly look, so we know we want to change this, these values and actually let's just insert a new cell above it and have some descriptive texts. So, we're now going to be cleaning some data clean the data, run it. And then let's add another one. And this is going to be like for column Sex. So there are a few ways we can do this. We can actually run some string methods on this column, so like string, and I can make everything lower. And then you'd get everything lowercase. And then I can save that. If I wanted. If I wanted to like actually save, that as the new value and change them in the data frame, then I could do this. I am instead going to do title, and that's going to make, turn it back into something that looks like this. So if they're all lowercase, then it'll capitalize the first letter. So we can look at, the head and just see, Yeah, these are the same, okay. Or I can look at this values counts again, and run that. And so now some of them are capitalized, or they're all capitalized. And then we want to search for if the string contains, a capital F, we can make it female. And if it has a capital M, we can make it male. So to do that, we can use this, LOC lock method, here to access a group of rows and columns by labels or a Boolean array. So I do find the documentation a little bit confusing. You can also search on Google and there are some good Stack Overflow responses to some of the basic questions you might have. So this you can actually access by saying tips sex and then we can put a, this can return a Boolean. And if it returns a Boolean, then it'll return just those rows. Okay, so in this case, if we were just to print this, then it'll return every row, and indicate whether it's true or false. But now if we put that inside of this lock method, then it only returns the rows that are true. And then if we wanted to change these values, we can put another value in here, that's the name of the column that you wanna change. Or, like this could be, a new column, but we want to change the same column. And then we can set the value that we want there. And we can run this again. And now all of the female ones are in the right, in the same format. So we can do the same thing. We're going to keep all of this here so that every time we run the code, it will change both of them. And we're looking for capital M. If it was a lowercase m, then we would return these as well. And we can do that and now the data is all clean. So, now we can look at this sex and, we only have two different values here. So, some other things that you can do are if there's some missing data you can fill NA which allows you to fill with a specified value. You can use this method to check for you can put in a different comparison here and check for if the values like out of a range that you would expect. Okay, and then let's also just quickly see if any of these other columns need changing. So, got day those all look good. Time it also good too, smoker. This is a little weird. We've got some quotes here. So we can do that we'll add some more markdown, column smoker. So now with the smoker field. We can change it in a different way, we can say that, that column is gonna equal the same column, but with the string replacement of a double quote. So it automatically completed this quote, but I want to remove one 'cause I want to just check for a single double quote. And I'll run that and then See what happens here. So now these no values still have a quote at the beginning. And it looks like this is actually a different kind of quote. So I'm gonna duplicate this and just copy and paste this in here. And, run that. And now that those are gone and they're added to the no pile. So now that our data is a little bit cleaner, what we can do is move them all. So I'm holding down shift, and I'm gonna move all this data cleaning to the top or not to the top, but like before the data plotting I'll add a new cell here. For plot, the data. And then if I wanted to, I could just look for a couple of stats. There are a lot of stats you can find in here, but I'm just gonna use that describe method again to, but like break it down by the different groupings. So to do that, I can look for just the columns that I want to find the stats for. So if I want to group by the day, and I want to see what the tip percentage is on those days I can do that. See what that looks like. So this just gives me the full data set with two columns. I can say, group by day. And now it creates a new data frame object. And, I can describe it. And here we can see what the mean tip percentages for the different days. And we can see that seventeen percent on Fridays, as opposed to fifteen point three percent on Saturdays. And I can do that for the different values too, and see the difference between these things. So this is like getting the kind of stats that you would in a box plot, like similar to this violin plot with the different categories on the x axis. Yeah. And then if we wanted to, we could compare and see, what the biggest difference was like what the, if like a single category made a big difference. And of course, there's like a lot more statistics, you could run on this or maybe even machine learning models, given like predicting how much a person is going to tip based on when they come in, and where they sit. And that's all that I have to show you from the data analysis. There is a lot more that you can do with and like getting into time series data, and geographical data are also things that you can do with Python. And there's a whole world out there, but hopefully this is enough to get you started.