com.aliasi.cluster

## Class KMeansClusterer<E>

• Type Parameters:
`E` - the type of objects being clustered
All Implemented Interfaces:
Clusterer<E>

```public class KMeansClusterer<E>
extends Object
implements Clusterer<E>```
A `KMeansClusterer` provides an implementation of k-means(++) clustering based on vectors constructed by feature extractors. An instance fixes a specific value of `K`, the number of clusters returned. Initialization may be either by the traditional k-means random assignment, or by the k-means++ initialization strategy.

This clustering class is defined so as to be able to cluster arbitrary objects. These objects are converted to (sparse) vectors using a feature extractor specified during construction.

Feature Parsing to Cluster Objects

The elements being clustered are first converted into feature d-dimensional vectors using a feature extractor. These feature vectors are then evenly distributed among the clusters using random assignment. Feature extractors may normalize their input in any number of ways. Run time is dominated by the density of the object vectors.

Centroids

In k-means, each cluster is modeled by the centroid of the feature vectors assigned to it. The centroid of a set of points is just the mean of the points, computed by dimension:

` centroid({v[0],...,v[n-1]}) = (v[0] + ... + v[n-1]) / n`
Thus centroids are thus located in the same vector space as the objects, namely they are d-dimensional vectors. The code represents centroids as dense vectors, and objects as sparse vectors.

Euclidean Distance

Feature vectors are always compared to cluster centroids using squared Euclidean distance, defined by:

``` distance(x,y)2 =(x - y) * (x - y)
= Σi (x[i] - y[i])2)```
The centroid of a set of points is the point that minimizes the sum of square distances from the points in that set to the set.

K-means

The k-means algorithm then iteratively improves cluster assignments. Each epoch consists of two stages, reassignment and recomputing the means.

Cluster Assignment:   In each epoch, we assign each object to the cluster represented by the closest centroid. Ties go to the lowest indexed element.

Mean Recomputation:   At the end of each epoch, the centroids are then recomputed as the means of points assigned to the cluster.

Convergence:   If no objects change cluster during an iteration, the algorithm has converged and the results will be returned. We also consider the algorithm converged if the relative reduction in error from epoch to epoch falls below a set threshold. Also, the algorithm will return if the maximum number of epochs have been reached.

K-means as Minimization

K-means clustering may be viewed as an iterative approach to the minimization of the average sauare distance between items and their cluster centers, which is:

` Err(cs) = Σc in cs Σx in c distance(x,centroid(x))2`
where `cs` is the set of clusters and `centroid(x)` is the centroid (mean or average) of the cluster containing x.

Convergence Guarantees

K-means clustering is guaranteed to converge to a local mimium of error because both steps of K-means reduce error. First, assigning each object to its closest centroid minimizes error given the centroids. Second, recalculating centroids as the means of elements assigned to them minimizes errors given the clustering. Given that error is bounded at zero, and changes are discrete, k-means must eventually converge. While there are exponentially many possible clusterings in theory, in practice k-means converges very quickly.

Local Minima and Multiple Runs

Like the EM algorithm, k-means clustering is highly sensitive to the initial assignment of elements. In practice, it is often helpful to apply k-means clustering repeatedly to the same input data, returning the clustering with minimum error.

At the start of each iteration, the error of the previous assignment is reported (at convergence, this will be the final error).

Multiple runs may be used to provide bootstrap estimates of the relatedness of any two elements. Bootstrap estimates work by subsampling the elements to cluster with replacement and then running k-means on them. This is repeated multiple times, and the percentage of runs in which two elements fall in the same cluster forms the bootstrap estimate of the likelihood that they are in the same cluster.

Degenerate Solutions

In some cases, the iterative approach taken by k-means leads to a solution in which not every cluster is populated. This happens during a step in which no feature vector is closest to a given centroid. This is most likely to happen in highly skewed data in high dimensional spaces. Sometimes, rerunning the clusterer with a different initialization will find a solution with k clusters.

Picking a good `K`

The number of clusters, k, may also be varied. In this case, new k-means clustering instances must be created, as each uses a fixed number of clusters.

By varying `k`, the maximum number of clusters, the within-cluster scatter may be compared across different choices of `k`. Typically, a value of `k` is chosen at a knee of the within-cluster scatter scores. There are automatic ways of performing this selection, though they are heuristically rather than theoretically motivated.

In practice, it is technically possible, though unlikely, for clusters to wind up with no elements in them. This implementation will simply return fewer clusters than the maximum specified in this case.

K-Means++ Initialization

K-means++ is a k-means algorithm that attempts to make a good randomized choice of initial centers. This requires choosing a diverse set of centers, but not overpopulating the initial set of centers with outliers.

K-means++ has reasonable expected performance bounds in theory, and quite good performance in practice.

Suppose we have K clusters and a set X of size at least K. K-means++ chooses a single point c[k] in X as each initial centroid, using the following strategy:

``` 1.  Sample the first centroid c[1] randomly from X.

2.  For k = 2 to K

Sample the next centroid c[k] = x
with probability proportional to D(x)2
```
where D(x) is the minimum distance to an existing centroid:
` D(x) = mink' < k d(x,c[k'])`
After initialization, k-means++ proceeds just as traditional k-means clustering.

The good expected behavior form k-means++ arises from choosing the next center in such a way that it's far away from existing centers. Many nearby points in some sense pool their behavior, because the chance that one will be picked is the sum of the chance that each will be picked. Outliers are, by definition, points x with high values of D(x) but which are not near other points.

Relation to Gaussian Mixtures and Expectation/Maximization

The k-means clustering algorithm is implicitly based on a multi-dimensional Gaussian with independent dimensions with shared means (the centroids) and equal variances. Estimates are carried out by maximum likelihood; that is, no prior, or equivalently the fully uninformative prior, is used. Where k-means differs from standard expectation/maximization (EM) is that k-means reweights the expectations so that the closest centroid gets an expectation of 1.0 and all other centroids get an expectation of 0.0. This approach has been called "winner-take-all" EM in the EM literature.

Implementation Notes

The current implementation conserves some computations versus the brute-force approach by (1) only computing vector products in comparing two vectors, and (2) not recomputing distances if we know the

References

• MacQueen, J. B. 1967. Some Methods for classification and Analysis of Multivariate Observations. Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability. University of California Press.
• Andrew Moore's K-Means Tutorial including most of the mathematics
• Matteo Matteucci's K-Means Tutorial including a very nice interactive servlet demo
• Hastie, T., R. Tibshirani and J.H. Friedman. 2001. The Elements of Statistical Learning. Springer-Verlag.
• Wikipedia: K-means Algorithm
• Arthur, David and Sergei Vassilvitski (2007) k-means++: The Advantages of Careful Seeding. SODA 2007.
Since:
LingPipe2.0
Version:
4.0.0
Author:
Bob Carpenter
• ### Constructor Summary

Constructors
Constructor and Description
```KMeansClusterer(FeatureExtractor<E> featureExtractor, int numClusters, int maxEpochs, boolean kMeansPlusPlus, double minImprovement)```
Construct a k-means clusterer with the specified feature extractor, number of clusters, minimum improvement per cluster, using either traditional or k-means++ initialization.
• ### Method Summary

All Methods
Modifier and Type Method and Description
`Set<Set<E>>` `cluster(Set<? extends E> elementSet)`
Return a k-means clustering of the specified set of elements using a freshly generated random number generator without intermediate reporting.
`Set<Set<E>>` ```cluster(Set<? extends E> elementSet, Random random, Reporter reporter)```
Return the k-means clustering for the specified set of elements, using the specified random number generator, sending progress reports to the specified reporter.
`FeatureExtractor<E>` `featureExtractor()`
Returns the feature extractor for this clusterer.
`int` `maxEpochs()`
Returns the maximum number of epochs for this clusterer.
`double` `minRelativeImprovement()`
Returns the minimum reduction in relative error required to continue to the next epoch during clustering.
`int` `numClusters()`
Returns the maximum number of clusters this clusterer will return.
`Set<Set<E>>` ```recluster(Set<Set<E>> initialClustering, Set<E> unclusteredElements, Reporter reporter)```
Recluster the specified initial clustering, adding in the unclustered elements, reporting progress to the specified reporter.
• ### Methods inherited from class java.lang.Object

`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`
• ### Constructor Detail

• #### KMeansClusterer

```public KMeansClusterer(FeatureExtractor<E> featureExtractor,
int numClusters,
int maxEpochs,
boolean kMeansPlusPlus,
double minImprovement)```
Construct a k-means clusterer with the specified feature extractor, number of clusters, minimum improvement per cluster, using either traditional or k-means++ initialization.

If the kMeansPlusPlus flag is set to `true`, the k-means++ initialization strategy is used. If it is set to false, initialization of each cluster will be handled by a random shuffling of all elements into a cluster.

If the number of epochs is set to zero, the result will be a random balanced clusterings of the specified size.

Parameters:
`featureExtractor` - Feature extractor for this clusterer.
`numClusters` - Number of clusters to return.
`maxEpochs` - Maximum number of epochs during
`kMeansPlusPlus` - Set to `true` to use k-means++ initialization. optimization.
`minImprovement` - Minimum relative improvement in squared distance scatter to keep going to the next epoch.
Throws:
`IllegalArgumentException` - If the number of clusters is less than 1, if the maximum number of epochs is less than 0, or if the minimum improvement is not a finite, non-negative number.
• ### Method Detail

• #### featureExtractor

`public FeatureExtractor<E> featureExtractor()`
Returns the feature extractor for this clusterer.
Returns:
The feature extractor for this clusterer.
• #### numClusters

`public int numClusters()`
Returns the maximum number of clusters this clusterer will return. This is the "`k`" in "k-means".

Clustering fewer elements will result in fewer clusters.

Returns:
The number of clusters this clusterer will return.
• #### maxEpochs

`public int maxEpochs()`
Returns the maximum number of epochs for this clusterer.
Returns:
The maximum number of epochs.
• #### cluster

`public Set<Set<E>> cluster(Set<? extends E> elementSet)`
Return a k-means clustering of the specified set of elements using a freshly generated random number generator without intermediate reporting. The feature extractor, maximum number of epochs, number of clusters, and minimum relative improvement, and whether to use k-means++ initialization are defined in the class.

This is just a utility method implementing the `Clusterer` interface. A call to `cluster(elementSet)` produces the same result as `cluster(elementSet,new Random(),null)`.

Specified by:
`cluster` in interface `Clusterer<E>`
Parameters:
`elementSet` - Set of elements to cluster.
Returns:
Clustering of the specified elements.
• #### cluster

```public Set<Set<E>> cluster(Set<? extends E> elementSet,
Random random,
Reporter reporter)```
Return the k-means clustering for the specified set of elements, using the specified random number generator, sending progress reports to the specified reporter. The feature extractor, maximum number of epochs, number of clusters, and minimum relative improvement, and whether to use k-means++ initialization are defined in the class.

The reason this is a separate method is that typical implementations of `Random` are not thread safe, and rarely should reports from different clustering runs be interleaved to a reporter.

Using a fixed random number generator (e.g. by using the same seed for `Random`) will result in the same resulting clustering, which can be useful for replicating tests.

Parameters:
`elementSet` - Set of elements to cluster
`random` - Random number generator
`reporter` - Reporter to which progress reports are sent, or `null` if no reporting is required.
• #### minRelativeImprovement

`public double minRelativeImprovement()`
Returns the minimum reduction in relative error required to continue to the next epoch during clustering.
Returns:
The minimum improvement per epoch.
• #### recluster

```public Set<Set<E>> recluster(Set<Set<E>> initialClustering,
Set<E> unclusteredElements,
Reporter reporter)```
Recluster the specified initial clustering, adding in the unclustered elements, reporting progress to the specified reporter. The number of clusters is set by the size of the initial clustering as measured by the number of non-empty clusters it contains. This clusterer's maximum number of epochs and minimum improvement will be used.
Parameters:
`initialClustering` - Initialization of clustering.
`unclusteredElements` - Elements that have not been clustered.
`reporter` - Reporter to which to send progress reports, or `null` for no reporting.
Returns:
The reclustering.
Throws:
`IllegalArgumentException` - If there are empty clusters in the clustering or if an element belongs to more than one cluster.