E
 the type of objects being clusteredpublic class KMeansClusterer<E> extends Object implements Clusterer<E>
KMeansClusterer
provides an implementation of
kmeans(++) clustering based on vectors constructed by feature
extractors. An instance fixes a specific value of K
, the
number of clusters returned. Initialization may be either by
the traditional kmeans random assignment, or by the kmeans++
initialization strategy.
This clustering class is defined so as to be able to cluster arbitrary objects. These objects are converted to (sparse) vectors using a feature extractor specified during construction.
Feature Parsing to Cluster Objects
The elements being clustered are first converted into feature ddimensional vectors using a feature extractor. These feature vectors are then evenly distributed among the clusters using random assignment. Feature extractors may normalize their input in any number of ways. Run time is dominated by the density of the object vectors.
Centroids
In kmeans, each cluster is modeled by the centroid of the feature vectors assigned to it. The centroid of a set of points is just the mean of the points, computed by dimension:
Thus centroids are thus located in the same vector space as the objects, namely they are ddimensional vectors. The code represents centroids as dense vectors, and objects as sparse vectors.centroid({v[0],...,v[n1]}) = (v[0] + ... + v[n1]) / n
Euclidean Distance
Feature vectors are always compared to cluster centroids using squared Euclidean distance, defined by:
The centroid of a set of points is the point that minimizes the sum of square distances from the points in that set to the set.distance(x,y)^{2} =(x  y) * (x  y) = Σ_{i} (x[i]  y[i])^{2})
Kmeans
The kmeans algorithm then iteratively improves cluster assignments. Each epoch consists of two stages, reassignment and recomputing the means.
Cluster Assignment: In each epoch, we assign each object to the cluster represented by the closest centroid. Ties go to the lowest indexed element.
Mean Recomputation: At the end of each epoch, the centroids are then recomputed as the means of points assigned to the cluster.
Convergence: If no objects change cluster during an iteration, the algorithm has converged and the results will be returned. We also consider the algorithm converged if the relative reduction in error from epoch to epoch falls below a set threshold. Also, the algorithm will return if the maximum number of epochs have been reached.
Kmeans as Minimization
Kmeans clustering may be viewed as an iterative approach to the minimization of the average sauare distance between items and their cluster centers, which is:
whereErr(cs) = Σ_{c in cs} Σ_{x in c} distance(x,centroid(x))^{2}
cs
is the set of clusters and centroid(x)
is the centroid (mean or average) of the cluster containing x.
Convergence Guarantees
Kmeans clustering is guaranteed to converge to a local mimium of error because both steps of Kmeans reduce error. First, assigning each object to its closest centroid minimizes error given the centroids. Second, recalculating centroids as the means of elements assigned to them minimizes errors given the clustering. Given that error is bounded at zero, and changes are discrete, kmeans must eventually converge. While there are exponentially many possible clusterings in theory, in practice kmeans converges very quickly.
Local Minima and Multiple Runs
Like the EM algorithm, kmeans clustering is highly sensitive to the initial assignment of elements. In practice, it is often helpful to apply kmeans clustering repeatedly to the same input data, returning the clustering with minimum error.
At the start of each iteration, the error of the previous assignment is reported (at convergence, this will be the final error).
Multiple runs may be used to provide bootstrap estimates of the relatedness of any two elements. Bootstrap estimates work by subsampling the elements to cluster with replacement and then running kmeans on them. This is repeated multiple times, and the percentage of runs in which two elements fall in the same cluster forms the bootstrap estimate of the likelihood that they are in the same cluster.
Degenerate Solutions
In some cases, the iterative approach taken by kmeans leads to a solution in which not every cluster is populated. This happens during a step in which no feature vector is closest to a given centroid. This is most likely to happen in highly skewed data in high dimensional spaces. Sometimes, rerunning the clusterer with a different initialization will find a solution with k clusters.
Picking a good K
The number of clusters, k, may also be varied. In this case, new kmeans clustering instances must be created, as each uses a fixed number of clusters.
By varying k
, the maximum number of clusters,
the withincluster scatter may be compared across different choices
of k
. Typically, a value of k
is chosen
at a knee of the withincluster scatter scores. There are automatic
ways of performing this selection, though they are heuristically
rather than theoretically motivated.
In practice, it is technically possible, though unlikely, for clusters to wind up with no elements in them. This implementation will simply return fewer clusters than the maximum specified in this case.
KMeans++ Initialization
Kmeans++ is a kmeans algorithm that attempts to make a good randomized choice of initial centers. This requires choosing a diverse set of centers, but not overpopulating the initial set of centers with outliers.
Kmeans++ has reasonable expected performance bounds in theory, and quite good performance in practice.
Suppose we have K clusters and a set X of size at least K. Kmeans++ chooses a single point c[k] in X as each initial centroid, using the following strategy:
where D(x) is the minimum distance to an existing centroid:1. Sample the first centroid c[1] randomly from X. 2. For k = 2 to K Sample the next centroid c[k] = x with probability proportional to D(x)^{2}
After initialization, kmeans++ proceeds just as traditional kmeans clustering.D(x) = min_{k' < k} d(x,c[k'])
The good expected behavior form kmeans++ arises from choosing the next center in such a way that it's far away from existing centers. Many nearby points in some sense pool their behavior, because the chance that one will be picked is the sum of the chance that each will be picked. Outliers are, by definition, points x with high values of D(x) but which are not near other points.
Relation to Gaussian Mixtures and Expectation/Maximization
The kmeans clustering algorithm is implicitly based on a multidimensional Gaussian with independent dimensions with shared means (the centroids) and equal variances. Estimates are carried out by maximum likelihood; that is, no prior, or equivalently the fully uninformative prior, is used. Where kmeans differs from standard expectation/maximization (EM) is that kmeans reweights the expectations so that the closest centroid gets an expectation of 1.0 and all other centroids get an expectation of 0.0. This approach has been called "winnertakeall" EM in the EM literature.
Implementation Notes
The current implementation conserves some computations versus the bruteforce approach by (1) only computing vector products in comparing two vectors, and (2) not recomputing distances if we know the
References
Constructor and Description 

KMeansClusterer(FeatureExtractor<E> featureExtractor,
int numClusters,
int maxEpochs,
boolean kMeansPlusPlus,
double minImprovement)
Construct a kmeans clusterer with the specified feature
extractor, number of clusters, minimum improvement per cluster,
using either traditional or kmeans++ initialization.

Modifier and Type  Method and Description 

Set<Set<E>> 
cluster(Set<? extends E> elementSet)
Return a kmeans clustering of the specified set of elements
using a freshly generated random number generator without
intermediate reporting.

Set<Set<E>> 
cluster(Set<? extends E> elementSet,
Random random,
Reporter reporter)
Return the kmeans clustering for the specified set of
elements, using the specified random number generator, sending
progress reports to the specified reporter.

FeatureExtractor<E> 
featureExtractor()
Returns the feature extractor for this clusterer.

int 
maxEpochs()
Returns the maximum number of epochs for this clusterer.

double 
minRelativeImprovement()
Returns the minimum reduction in relative error required to
continue to the next epoch during clustering.

int 
numClusters()
Returns the maximum number of clusters this clusterer will
return.

Set<Set<E>> 
recluster(Set<Set<E>> initialClustering,
Set<E> unclusteredElements,
Reporter reporter)
Recluster the specified initial clustering, adding in the
unclustered elements, reporting progress to the specified
reporter.

public KMeansClusterer(FeatureExtractor<E> featureExtractor, int numClusters, int maxEpochs, boolean kMeansPlusPlus, double minImprovement)
If the kMeansPlusPlus flag is set to true
, the
kmeans++ initialization strategy is used. If it is set to
false, initialization of each cluster will be handled by a
random shuffling of all elements into a cluster.
If the number of epochs is set to zero, the result will be a random balanced clusterings of the specified size.
featureExtractor
 Feature extractor for this clusterer.numClusters
 Number of clusters to return.maxEpochs
 Maximum number of epochs duringkMeansPlusPlus
 Set to true
to use kmeans++
initialization. optimization.minImprovement
 Minimum relative improvement in squared
distance scatter to keep going to the next epoch.IllegalArgumentException
 If the number of clusters is
less than 1, if the maximum number of epochs is less than 0, or
if the minimum improvement is not a finite, nonnegative number.public FeatureExtractor<E> featureExtractor()
public int numClusters()
k
" in
"kmeans".
Clustering fewer elements will result in fewer clusters.
public int maxEpochs()
public Set<Set<E>> cluster(Set<? extends E> elementSet)
This is just a utility method implementing the
Clusterer
interface. A call to
cluster(elementSet)
produces the same result
as cluster(elementSet,new Random(),null)
.
See the class documentation above for more information.
public Set<Set<E>> cluster(Set<? extends E> elementSet, Random random, Reporter reporter)
The reason this is a separate method is that typical
implementations of Random
are not thread safe,
and rarely should reports from different clustering runs
be interleaved to a reporter.
Using a fixed random number generator (e.g. by using the
same seed for Random
) will result in the same
resulting clustering, which can be useful for replicating
tests.
See the class documentation above for more information.
elementSet
 Set of elements to clusterrandom
 Random number generatorreporter
 Reporter to which progress reports are sent,
or null
if no reporting is required.public double minRelativeImprovement()
public Set<Set<E>> recluster(Set<Set<E>> initialClustering, Set<E> unclusteredElements, Reporter reporter)
initialClustering
 Initialization of clustering.unclusteredElements
 Elements that have not been clustered.reporter
 Reporter to which to send progress reports,
or null
for no reporting.IllegalArgumentException
 If there are empty clusters
in the clustering or if an element belongs to more than
one cluster.