15.2 Calculate averages, standard deviations and correlations
15: Basic Statistics
15.2 Calculate averages, standard deviations and correlations - Video Tutorials & Practice Problems
Video duration:
16m
Play a video:
<v Voiceover>For most people,</v> when they think of statistics, they think of averages. So of course, R can handle averages and much more. It can do means, standard deviations, correlations, and we'll take look at a lot of that. So to start, we need some sample data. So let's draw a bunch of random numbers, from one through 100. So we'll say X gets sample. Sample is a function that you feed it a vector and it randomly draws from it. So we'll feed it one through 100, and we want to draw a sample size 100, and since we don't wanna just reshuffle them, we actually want random draws with replacement, we say replaces equals true. Take a look at those. We just have randomly drawn numbers. So to take the average of these numbers, we just do mean of X. Now, this is just a simple arithmetic mean that we're all example with shown in this equation. This equation is saying that the average value is the sum of all the values, divided by the number of values. Just your simple, regular mean. Now, let's say we're dealing with some missing values, as happens a lot in statistics. So, let's first make a copy of X into Y, that way we don't mess with our original X, and let's randomly put some NAs in there. So we'll say, out of the numbers one through 100, cause we're choosing the indices randomly, we'll choose twenty of them. We will say replaces equals false because we don't want to set the same one twice. And we'll make them NA. If we look at Y, we see there are a number of NAs in here. If we were to take the mean now, we got an NA because if there's any NAs in the data, the whole thing has to be NA. Unless you explicitly say that NA dot remove equals true, meaning remove all the NAs and then calculate the mean. Now let's say we want to do a weighted average. Now a weighted average is what expected value actually means. That is shown in this equation. Right here we're saying each of the observations are weighted, so basically, it's the value you get times the probability of getting that value. It's a standard weighted mean, and it's at the heart of a lot of statistics. To do this, let's create some fake data. We'll say grades, gets, let's say you got a 95, then a 72, an 87, and a 66. The weights for those exams though, are half the grade, a quarter, an eighth, and an eighth. If we just did a straight up mean of the grades, let's see, you have an 80 average, but if we did a weighted mean, there's two entries, the numbers you want to average and then the weights, which in our case is called weights. If you add the weighted mean of these scores, you had an 84.6. That's a better score, I think that will make people happy. People can report means all they want, but if they don't return some sort of measure of uncertainty, such as variants, it doesn't really tell much, cause you could have an average of 80, but with variants that ranges from twent to a million, and that's not a very good result, so we want to know the variants. That's simple enough using the variants function such as this. So, while our average for X, which we'll pull up again to see, is 41.06, the variance is 826. Now, the variance is defined as follows: It is the sum of each observation minus the average value, squared over the number of observations minus one. This is a measure of how spread out the data are. Roughly, on average, how far each point is away from the center. The smaller the variance, the more certain you are about your estimate. We can confirm this formula by coding it straight into R. What we'll do is the sum of X minus the mean of X, and we will square that, and then close off that sum, and then divide by the length of x minus one. We get the same exact number. Remember this formula works because R is vectorized. X is 100 numbers. Mean of X is one number, but it does each of those 100 numbers, minus X, one at a time, squares all of them, then sums them all up. That's how that formula works. Standard deviation, perhaps a more understandable version of variance, is simply its square root. So, we can take the square root of the variance of X and find its standard deviation, or we can simply just use the standard deviation function, which gets the same exact results. Much like with mean, you have to worry about missing values, so the standard deviation of Y is NA, but if you tell it to remove the missing values, standard deviation works. Other measures you might want to use for your data is the sum of X or the minimum of X. Maximum. Also, you have to be careful, even with these, for missing values. Just like before, if you have a missing value in your data, and you want to have it removed, you do NA dot RM equals true. To find a lot of these all in one shot instead of having to calculate each individually, you can run summary of X, which will give you the minimum, the first quartile, the median, the mean, the third quartile and the maximum, just a bit of nice, summary information. Running summary on Y, which remember has missing values, gives you the same basic information plus the number of NA's. That's very nice to have. Now these quartiles represent the observation that's at the 25th lowest and the 75th lowest. And you can find that out yourself by using the quantile function. You say quantile of X, and you want to find at the point 25th point, and the point 75th point. So you call up quantile of X, and you want to find the 25th percentile and the 75th percentile. We run this, and we get the same numbers we saw in our summary. Once again, if you have missing data, use NA dot remove equals true. And you're not limited to just the 25th and 75th. We can do much more than that. We could do, let's say point one, point 25, point five, point 75 and point nine and get them all back. It even gives you the names as percentages. Correlation is a very important concept in statistics. It tells you how related one variable is to another. So, to look at that, let's load up the economics data set that Ggplot. So, require, ggplot2, and we'll look at economics just to see what we're getting into. And if we spell it correctly, it works much better. So, right here, we're probably interested in looking at the correlation between personal expenditures and personal saving rate. There's probably some sort of relationship there. So, to find the correlation between these two variables, we do cor, economics, dollar, pce, comma, economics, dollar, personal savings rate, and we see here, there's a negative point nine two correlation, meaning that, when people are saving, they're not spending, and when they're spending, they're not saving. That makes complete sense. The formula for correlation is like this. It's a bit more complex than what we've seen so far, but the correlation between X and Y is every value of X minus the mean of X times every value of Y minus the mean of Y all summed up, divided by the number of observations minus one times the standard deviation of X times the standard deviation of Y. That is what the cor function gets us in R. And the cor functions works on a lot more than just two variables. You could give it a whole matrix of variables. So, we will call cor on economics, and we'll subset this, and we'll just say we want columns two and four through six. Running this, we get a nice little correlation matrix. Along the diagonal, the correlations are all one because every variable is perfectly correlated with itself. The off-diagonals show you the correlation between variables, and it's symmetric, so the correlation between personal saving rate and pce is going to be the same as the correlation rate between pce and personal savings rate. That just makes sense. A pile of numbers like this can be hard to interpret, so sometimes it's better to visualize it. So, we can go ahead and make a heat graph. So, first thing we'll do is we will save this as a variable. EconCor, yes, and we'll just copy this line. Now, we want to melt this down, so we are going to use the reshape package. Require, reshape two, and we will create a melted variable. We'll call it econMelt. We'll say melt, econCore. We'll say the var names are X and Y, and the value dot name is correlation. You melt this, and we'll take a look. We can see; it gives us the correlation between every pair of variables. The first thing we want to do, we want to order this according to the correlation, so we do econMelt gets econMelt, order based on econMelt, correlation, and by default it is ascending, and now it's ready to be plotted. Since it is a heat map, it takes a lot more effort, but it shows off a pretty graph. We initialize it as always as econMelt, and we say the X axis gets mapped to the variable X, and the Y axis to the variable Y. We then add in geom tile. So we'll say geom tile, and its aesthetic will be fill equals correlation. And we add some things to make it look a little better, such as scale, fill, gradient two, and then we say that the low value should be muted red. Muted just makes it look a little tamer. The mid should be white, and the high should be steel blue. Yes, I tested these colors beforehand; that's how I know they are going to be good. We're going to give it a special guide. It's a guide bar that's going to make the legend look a lot better. It's called guide underscore colorbar. We'll tell it no ticks, but make the bar height to be 10, and we'll say the limits on this should be negative one to one, because correlation goes between negative one to one. We'll tell it to use a minimal theme, since we don't want that cluttering anything up, and we'll say the labels should not exist, so we will set them to null. Now, if we were to run this right now, with nothing else, it might night work, because we never loaded the scales package, so I would come back up here and do require scales. Now if we run this, it still didn't work, because we have a typo in correlation There are two r's in correlation. Now the nice thing about R Studio is, if you want to rerun the exact line of code again, you hit control, shift, P. As we see this when we zoom it in, we have a nice heat map showing us exactly what is going on. Of course, along the diagonal, there'd be perfect correlation, because every variable's correlated with itself. The redder a number is, the less correlated they are. As we said before, the savings rate is very negatively correlated with the expenditure rate. This is a good way to visualize things. Dealing with missing data in correlations is actually quite a bit more complicated than dealing with missing data in other venues. So, let's clear out the console, and build some vectors that will illustrate this nicely. I'm going to create five vectors, each with missing values in certain places. So, we run all these vectors, putting them into memory, and we're going to create a matrix out of these. We'll say theMat, cbind, m, n, p, q, r If we look at theMat, we see what they look like. If we were to call a correlation on this, we will get a lot of NA's. That's because correlation is a pair-wise operation. If it has one number going against a nonexistent number, it can't do the match. So, the cor function offers a number of different types of missing value handlers. So, the cor function offers a number of different ways to handle missing values. The default way, which we just showed, is called everything. These are all accessed by using the use argument. What this means is, any two vectors you're comparing, they all need to be free of NA's. That's why only the diagonal has one, because everything is perfectly correlated with itself, and the correlation between Q and R exists because neither of those vectors have any NA's. Another option is cor, theMat, use, all dot obs. This means all observations have to be there, or it will give you an error. This is the strictest. Another possibility is to keep all the rows that are considered clean. So, in this case we will say cor, theMat, use equals complete dot obs. In this case, it managed to go through and keep what records it could. In this situation, we'll look at the matrix again. It kept row one, row four, row seven, nine and 10, and it did the correlation for all of them based on that. If it can't find any rows without an NA, it returns an error. Similar to that is cor, theMat, use equals na or complete. It does the same thing and narrows down the data just to complete rows. The difference is though, that if this can't find any rows that aren't complete, it returns an na, and to test this is the same. We'll say cor, theMat, and we'll just take the rows one, four, seven, nine and 10, and it should be the same. Now, the last way to do this, which is the most permissive but also the most computationally intensive, it compares each column against every other column individually, and it keeps all rows of those two columns where there's a match. If there's one row where one's an na and one isn't, it throws out that row. That is cor, theMat, use, pairwise, complete, obs. Again, as the data set grows wider, as there's more columns, that can grow very computationally intensive. No conversation about correlation would be complete without expressing that correlation does not always imply causation. That is something that is very important to remember but shouldn't get in your way. These are the basic measures that most people are taught in their intro to statistics class and makes up the foundation of statistics, means, standard deviations and correlations.