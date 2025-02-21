In recent videos, we've talked independently about mean and median, which are measures of center. These are numbers that summarize the data in one value that's roughly around the center or the middle. Now I want to talk about these things a little bit more together in this video because clearly what we've seen is that mean and median have their own distinct pros and cons. So we're just going to jump right into this example, and I'm going to show you the advantages and disadvantages of using mean versus median in certain datasets. Alright?

Let's go ahead and get started, and we'll do a couple of examples together. So let's just go ahead and take a look right at our example. We have all these numbers here that we've seen before. These are sample data, and most of the calculations you've already done in other videos. So let's just go ahead and take a look.

So when we had the sample of five numbers and calculated the mean, add up all the values and divide by the total number, we would get a number that was 8.8. And then all of a sudden we introduced an outlier, an extreme value relative to everything else. What that made us do is that we had to add the 76 in and divide by the new number, then we got a mean that was equal to 20. Alright. So we used all of the values in the dataset to get these numbers, which is actually an advantage when it comes to the mean.

It's an advantage because you're using all of the values, and it helps you get a better picture of the whole entire dataset. However, what we can see here is that one clear disadvantage is that one extreme value, like this huge outlier of 76 over here, can change the mean by a lot. In this case, it changed it by 11.2, which is huge relative to the other numbers there. Alright? So one extreme value can change the mean by a lot.

Therefore, whenever you have these types of problems where they're asking you what measure of center is the best, the best situation to use the mean is where the data is symmetric and without outliers because those outliers are significantly going to change that mean. Let's move on to the median. Alright? So the median was pretty straightforward. You just organize the data from smallest to largest, and you look at the middle numbers.

In one case, we found that the sample of numbers had a median of 11. Then when we threw in this outlier of 76, it's still just the two middle numbers. We found that it was 10 plus 12 over two, which is 11. So in other words, we threw in an outlier, the median only increased by one. Alright.

So one disadvantage of using the median is that you're only really looking at maybe one or two values in the datasets, so that's kind of a disadvantage. You don't get a clear picture of all the data. However, we can see here is one clear advantage is that this outlier only changed the median by one. Alright? It didn't have such a drastic effect like the mean did.

Alright? And so what happens here is that this median is going to be the best measure of center whenever the data is roughly symmetric with outliers because it's basically just going to ignore some of those outliers. This idea that the mean and median can significantly change or not relative to outliers is called resistance. So sometimes what you may see in your books is the mean is not resistant because significant outliers can drastically change it, whereas the median is resistant because even an outlier like 76 only changed the median by one. Alright?

So that's the idea here. Those are some advantages and disadvantages to mean versus median. But let's take a look at our last problem over here, which is a little bit more of a problem that you may actually see, which is without calculating, we're gonna determine if the mean or median best represents the center of the graph data. In other words, which one of these things is the best measure of the center? Is it the mean or the median?

So let's take a look here. And again, we're not gonna calculate anything. So we've got this histogram, which tells us the salary of graduate students in the thousands. We've got most of the data that's clustered between thirty and sixty thousand, but we have a couple of drastic outliers out here. So what's going on?

Well, if I just imagine that these outliers just don't exist, then basically what we can say here is that the mean and the median would be roughly around the same point. Right? So the mean would probably be somewhere over here, that's x̄, and the median would also be somewhere around here. That's I'm just gonna call that a capital M. The problem is that when these outliers get introduced, they're gonna significantly pull that mean all the way to the right, such that the mean actually may end up somewhere over here.

I don't actually know because I can't calculate this. Right? So all we know here is that the median is much more resistant to extreme values, whereas the mean gets pulled a lot. So clearly, we can see here that because this data is roughly symmetric around most of the data points, but there are a couple of outliers, the median is going to be the best measure of center. Alright?

So the mean is definitely going to be the best here. By the way, this is why in the real world, when we talk about salaries, oftentimes we talk about the median salary. Most people in The US make somewhere in this range of income. However, the one or few people who make billions or billions of dollars significantly affect that mean, whereas the median is much more representative of what the typical citizen of The US actually makes. Alright.

So in this case, it would be the median. Alright, folks, that's it for this one. Let me know if you have any questions.