How to understand the drawbacks of K-means
K-means is a widely used method in cluster analysis. In my understanding, this method does NOT require ANY assumptions, i.e., give me a dataset and a pre-specified number of clusters, k, and I just apply this algorithm which minimizes the sum of squared errors (SSE), the within cluster squared error.
So k-means is essentially an optimization problem.
I read some material about the drawbacks of k-means. Most of them say that:
- k-means assumes the variance of the distribution of each attribute (variable) is spherical;
- all variables have the same variance;
- the prior probability for all k clusters is the same, i.e., each cluster has roughly equal number of observations;
If any one of these 3 assumptions are violated, then k-means will fail.
I could not understand the logic behind this statement. I think the k-means method makes essentially no assumptions, it just minimizes the SSE, so I cannot see the link between minimizing the SSE and those 3 "assumptions".
A good question. But did you try to look for the answers on this site? These issues have been raised in various questions.
The key assumptions of k-means are: 1. there _are_ k clusters. 2. SSE is the *right objective* to minimize. 3. all clusters have the *same* SSE. 4. all variables have the same importance for every clusters. These are pretty strong assumptions...
To your second question (posted as answer, then deleted): if you want to understand k-means as optimization problem similar to linear regression, understand it as **quantization**. It tries to find the least squares approximation of the data using $k$ instances. I.e. if you actually *replaced* every point by the nearest centroid.
@Anony-Mousse, I read some material and later on come up with the following thought: $k-$means as a statistical model (rather than optimization method) assumes that there are k clusters underlying and the dispersion of the data are purely due to normal random noise with equal variance. This is analogous to the assumption of simple linear regression model. Then (I believe, I haven't found a paper) by some version of Gauss-Markov theorem, $k-$means will give you consistent estimator of the mean of the underlying k clusters we assumed for our data.
With the additional assumption that the average error is smaller than the difference between the values you can expect k-means to work quite well. If you look at the classic A-sets for clustering, k-means usually fails for at least a few of the clusters. It does a decent job, but some of these clusters are too close, and unless you are really lucky, k-means will choose suboptimal initial centers, and get stuck in a local minimum (I'm not even sure the global minimum is the optimum solution!)
@Anony-Mousse, 'With the additional assumption that the average error is smaller than the difference between the values you can expect k-means to work quite well.', that makes sense. And I should say that under those assumptions, minimizing the sum of SSE, will result in a consistent estimator of the means of those k underlying cluster. But mathematically, this optimization problem is NP-hard and $k-$means is just a heuristic solution to this optimization problem, which as you mentioned, will very likely got trapped in a local optima rather than the global optima.
While I like David Robinson's answer here a lot, here's some additional critique of k-means.
Clustering non-clustered data
Run k-means on uniform data, and you will still get clusters! It doesn't tell you when the data just does not cluster, and can take your research into a dead end this way.
Sensitive to scale
Rescaling your datasets will completely change results. While this itself is not bad, not realizing that you have to spend extra attention to scaling your data is bad. Scaling factors are extra $d$ hidden parameters in k-means that "default" to 1 and thus are easily overlooked, yet have a major impact (but of course this applies to many other algorithms, too).
This is probably what you referred to as "all variables have the same variance". Except that ideally, you would also consider non-linear scaling when appropriate.
Also be aware that it is only a heuristic to scale every axis to have unit variance. This doesn't ensure that k-means works. Scaling depends on the meaning of your data set. And if you have more than one cluster, you would want every cluster (independently) to have the same variance in every variable, too.
Here is a classic counterexample of data sets that k-means cannot cluster. Both axes are i.i.d. in each cluster, so it would be sufficient to do this in 1 dimension. But the clusters have varying variances, and k-means thus splits them incorrectly.
I don't think this counterexample for k-means is covered by your points:
- All clusters are spherical (i.i.d. Gaussian).
- All axes have the same distribution and thus variance.
- Both clusters have 500 elements each.
Yet, k-means still fails badly (and it gets worse if I increase the variance beyond 0.5 for the larger cluster) But: it is not the algorithm that failed. It's the assumptions, which don't hold. K-means is working perfectly, it's just optimizing the wrong criterion.
Even on perfect data sets, it can get stuck in a local minimum
Below is the best of 10 runs of k-means on the classic A3 data set. This is a synthetic data set, designed for k-means. 50 clusters, each of Gaussian shape, reasonably well separated. Yet, it only with k-means++ and 100 iterations I did get the expected result... (below is 10 iterations of regular k-means, for illustration).
You'll quickly find many clusters in this data set, where k-means failed to find the correct structure. For example in the bottom right, a cluster was broken into three parts. But there is no way, k-means is going to move one of these centroids to an entirely different place of the data set - it's trapped in a local minimum (and this already was the best of 10 runs!)
And there are many of such local minima in this data set. Very often when you get two samples from the same cluster, it will get stuck in a minimum where this cluster remains split, and two other clusters merged instead. Not always, but very often. So you need a lot of iterations to have a lucky pick. With 100 iterations of k-means, I still counted 6 errors, and with 1000 iterations I got this down to 4 errors. K-means++ by the way it weights the random samples, works much better on this data set.
Means are continuous
While you can run k-means on binary data (or one-hot encoded categorical data) the results will not be binary anymore. So you do get a result out, but you may be unable to interpret it in the end, because it has a different data type than your original data.
Hidden assumption: SSE is worth minimizing
This is essentially already present in above answer, nicely demonstrated with linear regression. There are some use cases where k-means makes perfect sense. When Lloyd had to decode PCM signals, he did know the number of different tones, and least squared error minimizes the chance of decoding errors. And in color quantization of imaged, you do minimize color error when reducing the palette, too. But on your data, is the sum of squared deviations a meaningful criterion to minimize?
In above counterexample, the variance is not worth minimizing, because it depends on the cluster. Instead, a Gaussian Mixture Model should be fit to the data, as in the figure below:
(But this is not the ultimate method either. It's just as easy to construct data that does not satisfy the "mixture of k Gaussian distributions" assumptions, e.g., by adding a lot of background noise)
Too easy to use badly
All in all, it's too easy to throw k-means on your data, and nevertheless get a result out (that is pretty much random, but you won't notice). I think it would be better to have a method which can fail if you haven't understood your data...
K-means as quantization
If you want a theoretical model of what k-means does, consider it a quantization approach, not a clustering algorithm.
The objective of k-means - minimizing the squared error - is a reasonable choice if you replace every object by its nearest centroid. (It makes a lot less sense if you inspect the groups original data IMHO.)
There are very good use cases for this. The original PCM use case of Lloyd comes to mind, or e.g. color quanization (Wikipedia). If you want to reduce an image to k colors, you do want to replace every pixel with the nearest centroid. Minimizing the squared color deviation then does measure L2 optimality in image approximation using $k$ colors only.
This quantization is probably quite similar to the linear regression example. Linear regression finds the best linear model. And k-means finds (sometimes) the best reduction to k values of a multidimensional data set. Where "best" is the least squared error.
IMHO, k-means is a good quantization algorithm (see the first image in this post - if you want to approximate the data set to two points, this is a reasonable choice!). If you want to do cluster analysis as in discover structure then k-means is IMHO not the best choice. It tends to cluster when there are not clusters, and it cannot recognize various structures you do see a lot in data.
Fine print: all images were generated with ELKI. Data were generated using the
.xmldata generation format, but they are so basic it is not worth sharing them.
(Just to note - it's probably not a good idea to talk about the "above answer", since the answer order that a reader sees can be variable. For instance, if they set the display order to "active", then your answer is actually the one above!)
@Anony-Mousse This answer is really awesome. But until now, I kind of forget what do we usually mean by saying "k-means will work under some conditions and will fail under other conditions." What do the word "work" or "fail" mean in this context? Does "work" means the solution generated by k-means will visually 'looks reasonable'? This is kind of vague. Or 'work' mean if k-means provide solution which is the same as the 'standard solution' i.e., we pre-generate a data set and use k-means. In this context 'work' makes sense, but in reality, data are not pre-generated by some distribution.
Usually people refer to some ground truth, i.e. how the data was generated or to some label hidden from the algorithm. Comparing to generated data will prefer algorithms that optimize the model that was used for generation (e.g. GMM and k-means for Gaussians). And even on real and labeled data this evaluation is about reproducing a *known* result. When you consider the explorative / knowledge discovery aspect, where you want to learn something *new*. But it's all we've got.
Would it work better on the A3 data set if $k$ were adjusted to the number of effectively present clusters as determined a priori?
@TMOTTM this is with k chosen by prior knowledge. Best of 10 runs all with the "correct" k chosen a priori.
"If you want to do cluster analysis as in discover structure then k-means is IMHO not the best choice" -> Can you point out some good approaches/alternatives for structure discovery? Thanks :)