com.aliasi.cluster
Class KMeansClusterer<E>

java.lang.Object
  extended by com.aliasi.cluster.KMeansClusterer<E>
All Implemented Interfaces:
Clusterer<E>

public class KMeansClusterer<E>
extends Object
implements Clusterer<E>

A KMeansClusterer provides an implementation of standard k-means clustering based on vectors constructed by feature extractors. An instance fixes a specific value of k, the number of clusters returned.

K-Means Clustering Algorithm

If fewer than the maximum number of elements, k, is clustered, a clustering with that number of singleton clusters is returned.

The elements being clustered are first converted into feature vectors using a feature extractor. These feature vectors are then evenly distributed among the clusters using random assignment.

Each cluster is represented by the centroid of the feature vectors assigned to it. Centroids are just dimension-wise averages:

     centroid(v[0],...,v[n-1]) = (v[0] + ... + v[n-1]) / n
and are thus vectors in the same space as the feature vectors being clustered.

Feature vectors are always compared to cluster centroids using Euclidean distance:

     distance(x,y) = sqrt( (x - y) * (x - y) )
                   = sqrt(Σi (x[i] - y[i])2)
Thus any normalization should be done as part of the feature extraction phase. A common normalization is to divide each feature vector by its length to produce normal vectors, which weights each dimension equally and results in distances monotonically related to cosines.

The algorithm then iteratively improves cluster assignments. Each iteration consists of walking through each feature vector and assigning it to the cluster with the closest centroid. Ties are determined non-deterministically. The centroids are then recomputed at the end of each iteration. When no elements change clusters during an iteration, the result is returned. The algorithm will also return if the maximum number of iterations is reached without arriving at a fixed point.

These iterations have some fancy names in the literature. The whole approach is known as "Lloyd's Algorithm", "Voronoi iteration", or simply "relaxation". The final centroids forms a so-called "Voronoi Tesselation" of the feature vector space of the objects being clustered. This tesselation is similar to k-nearest neighbors classification (see KnnClassifier).

K-means as Minimization

K-means clustering may be viewed as a gradient descent solution to the minimization of distances to cluster centroids, which we define here as error:

     error(cs) = Σc in cs Σx in c distance(x,centroid(C))
where cs is the set of clusters, c is a cluster in c, centroid(c) is the vector representing the centroid of the cluster c, and distance(x,centroid(x)) is the distance between an element x of cluster c and the cluster's centroid.

Convergence Guarantees

K-means clustering is guaranteed to converge to a solution because each increment reduces error as defined in the previous section. There are only finitely many clusterings, so eventually k-means will converge. The only problem is that the size of the finite set of clusterings is exponential in the number of input elements.

Relation to Gaussian Mixtures and Expectation/Maximization

The k-means clustering algorithm is implicitly based on a multi-dimensional Gaussian with independent dimensions with their own means and variances. Estimates are carried out by maximum likelihood; that is, no prior, or equivalently the fully uninformative prior, is used. Where k-means differs from standard expectation/maximization (EM) is that k-means reweights the expectations so that the closest centroid gets an expectation of 1.0 and all other centroids get an expectation of 0.0. This approach has been called "winner-take-all" EM in the EM literature.

Local Maxima and Multiple Runs

Like the EM algorithm, k-means clustering is highly sensitive to the initial assignment of elements. In practice, it is often helpful to apply k-means clustering repeatedly to the same input data, returning the clustering which optimizes some evaluation metric. This metric is typically within-cluster scatter, because the number of dimensions is fixed. Within-cluster scatter may be computed with the static method ClusterScore.withinClusterScatter(Set,Distance).

Degenerate Solutions

In some cases, the iterative approach taken by k-means leads to a solution in which not every cluster is populated. This happens during a step in which no feature vector is closest to a given centroid. In many of these cases, rerunning the clusterer will find a solution with k clusters.

Picking a good K

The number of clusters, k, may also be varied. In this case, new k-means clustering instances must be created, as each uses a fixed number of clusters.

By varying k, the maximum number of clusters, the within-cluster scatter may be compared across different choices of k. Typically, a value of k is chosen at a knee of the within-cluster scatter scores. There are automatic ways of performing this selection, though they are heuristically rather than theoretically motivated.

In practice, it is technically possible, though unlikely, for clusters to wind up with no elements in them. This implementation will simply return fewer clusters than the maximum specified in this case.

Multiple Runs and Bootstrap Relation Estimates

Multiple runs may be used to provide bootstrap estimates of the relatedness of any two elements. Bootstrap estimates work by subsampling the elements to cluster with replacement and then running k-means on them. This is repeated multiple times, and the percentage of runs in which two elements fall in the same cluster forms the bootstrap estimate of the likelihood that they are in the same cluster.

Efficiency

The main advantage K-means has over hierarchical clustering is that it only requires linear time in the number of elements being clustered. There's a constant factor for the maximum number of iterations allowed and another constant factor for the length of the feature vectors.

Implementation Notes

The current implementation is an inefficient brute-force approach that computes the distance between every element and every centroid on every iteration. These distances are computed between maps, which itself is not very efficient. In the future, we may substitute a more efficient version of k-means implemented to this same interface.

More efficient implementations could use kd-trees to index the points, could cache scores of input feature vectors against clusters, and would only compute differential updates when elements change cluster. Threading could also be used to compute the distances.

References

Since:
LingPipe2.0
Version:
3.1.1
Author:
Bob Carpenter

Constructor Summary
KMeansClusterer(FeatureExtractor<E> featureExtractor, int numClusters, int maxIterations)
          Construct a k-means clusterer with the specified feature extractor, number of clusters and limit on number of iterations to run optimization.
 
Method Summary
 Set<Set<E>> cluster(Set<? extends E> elementSet)
          Return a k-means clustering of the specified set of elements.
 FeatureExtractor<E> featureExtractor()
          Returns the feature extractor for this clusterer.
 int numClusters()
          Returns the number of clusters this clusterer will return.
 Set<Set<E>> recluster(Set<Set<E>> clustering, int maxIterations)
          Recluster the specified clustering using up to the specified number of k-means iterations.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

KMeansClusterer

public KMeansClusterer(FeatureExtractor<E> featureExtractor,
                       int numClusters,
                       int maxIterations)
Construct a k-means clusterer with the specified feature extractor, number of clusters and limit on number of iterations to run optimization. Initialization of each cluster is with random shuffling of all elements into a cluster.

If the number of iterations is set to zero, the result will be random balanced clusterings of the specified size.

Parameters:
featureExtractor - Feature extractor for this clusterer.
numClusters - Number of clusters to return.
maxIterations - Maximum number of iterations during optimization.
Throws:
IllegalArgumentException - If the number of clusters is less than 1, or if the maximum number of iterations is less than 0.
Method Detail

featureExtractor

public FeatureExtractor<E> featureExtractor()
Returns the feature extractor for this clusterer.

Returns:
The feature extractor for this clusterer.

numClusters

public int numClusters()
Returns the number of clusters this clusterer will return. This is the "k" in "k-means".

Returns:
The number of clusters this clusterer will return.

recluster

public Set<Set<E>> recluster(Set<Set<E>> clustering,
                             int maxIterations)
Recluster the specified clustering using up to the specified number of k-means iterations. This method allows users to specify their own initial clusterings, which are then reallocated using the standard k-means algorithm.

Parameters:
clustering - Clustering to recluster.
maxIterations - Maximum number of reclustering iterations.
Returns:
New clustering of input elements.

cluster

public Set<Set<E>> cluster(Set<? extends E> elementSet)
Return a k-means clustering of the specified set of elements. Note that this method is randomized and may return different results over different runs. See the class documentation for more details.

Specified by:
cluster in interface Clusterer<E>
Parameters:
elementSet - Set of elements to cluster.
Returns:
Clustering of the specified elements.