19.2 Robustly cluster, even with categorical data, with PAM
19: Clustering
19.2 Robustly cluster, even with categorical data, with PAM - Video Tutorials & Practice Problems
Video duration:
2m
Play a video:
<v Voiceover>A few drawbacks</v> of K means clustering is that it only handles numeric data, not categorical, and it's not very robust to outliers. So where K means finds actual centers of the data, sort of like an average, hence "means", there's another algorithm called "partitioning around medoids". This uses a geographic center sort of like a median. This helps both with categorical data, instead of finding a center of a cloud of points, which could be a numeric center that none of the points actually are, it finds an actual point to represent the center. So, much like a median also, it's robust to outliers, and it can handle categorical data. So to take a look at this, we will stick with the wine data, and we'll load in the cluster package. So to do this, we'll make winePam, we will use the PAM function. X is the wine training data, we'll tell it four clusters, and we'll say keep the dissimilarity measure true, because we will want that later for plotting, and keep that data equals true. We run this, and we get a result. Let's take a look at what it looks like. Much like K means, this provides a lot of information. Not the centers, but an actual point that represents the medoid, and a clustering vector to show in membership. A great way to visualize this is by making a silhouette plot. Say plot of winePam, and we'll say which dot plots equals two, we'll say main equals blank. And then when we view this, we get a silhouette plot. Now each distinct chunk represents a different clustering. Each of the lines in this chunk represents an observation. The longer the line, I'll make this a little wider for everybody, the longer the line the greater the event cluster similarity. So you want chunks with lots of long lines, that means those chunks, or those clusters, were fitted pretty well together. Shorter ones means they don't fit in as well, and you want fewer of those. So PAM clustering is a great alternative to K means, especially when you have either text data or outliers.