|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectcom.aliasi.cluster.KMeansClusterer<E>
public class KMeansClusterer<E>
A KMeansClusterer provides an implementation of
standard k-means clustering based on vectors constructed by feature
extractors. An instance fixes a specific value of k,
the number of clusters returned.
If fewer than the maximum number of elements, k, is
clustered, a clustering with that number of singleton clusters is
returned.
The elements being clustered are first converted into feature vectors using a feature extractor. These feature vectors are then evenly distributed among the clusters using random assignment.
Each cluster is represented by the centroid of the feature vectors assigned to it. Centroids are just dimension-wise averages:
centroid(v[0],...,v[n-1]) = (v[0] + ... + v[n-1]) / n
and are thus vectors in the same space as the feature vectors
being clustered.
Feature vectors are always compared to cluster centroids using Euclidean distance:
distance(x,y) = sqrt( (x - y) * (x - y) )
= sqrt(Σi (x[i] - y[i])2)
Thus any normalization should be done as part of the feature
extraction phase. A common normalization is to divide each feature
vector by its length to produce normal vectors, which weights each
dimension equally and results in distances monotonically related to
cosines.
The algorithm then iteratively improves cluster assignments. Each iteration consists of walking through each feature vector and assigning it to the cluster with the closest centroid. Ties are determined non-deterministically. The centroids are then recomputed at the end of each iteration. When no elements change clusters during an iteration, the result is returned. The algorithm will also return if the maximum number of iterations is reached without arriving at a fixed point.
These iterations have some fancy names in the literature. The
whole approach is known as "Lloyd's Algorithm",
"Voronoi iteration", or simply "relaxation".
The final centroids forms a so-called "Voronoi Tesselation"
of the feature vector space of the objects being clustered. This
tesselation is similar to k-nearest neighbors classification (see
KnnClassifier).
K-means clustering may be viewed as a gradient descent solution to the minimization of distances to cluster centroids, which we define here as error:
error(cs) = Σc in cs Σx in c distance(x,centroid(C))
where cs is the set of clusters, c is a
cluster in c, centroid(c)
is the vector representing the centroid of the cluster c, and
distance(x,centroid(x)) is the distance between an
element x of cluster c and the cluster's
centroid.
K-means clustering is guaranteed to converge to a solution because each increment reduces error as defined in the previous section. There are only finitely many clusterings, so eventually k-means will converge. The only problem is that the size of the finite set of clusterings is exponential in the number of input elements.
The k-means clustering algorithm is implicitly based on a multi-dimensional Gaussian with independent dimensions with their own means and variances. Estimates are carried out by maximum likelihood; that is, no prior, or equivalently the fully uninformative prior, is used. Where k-means differs from standard expectation/maximization (EM) is that k-means reweights the expectations so that the closest centroid gets an expectation of 1.0 and all other centroids get an expectation of 0.0. This approach has been called "winner-take-all" EM in the EM literature.
Like the EM algorithm, k-means clustering is highly sensitive
to the initial assignment of elements. In practice, it is often
helpful to apply k-means clustering repeatedly to the same input
data, returning the clustering which optimizes some evaluation
metric. This metric is typically within-cluster scatter, because
the number of dimensions is fixed. Within-cluster scatter may
be computed with the static method ClusterScore.withinClusterScatter(Set,Distance).
In some cases, the iterative approach taken by k-means leads to a solution in which not every cluster is populated. This happens during a step in which no feature vector is closest to a given centroid. In many of these cases, rerunning the clusterer will find a solution with k clusters.
KThe number of clusters, k, may also be varied. In this case, new k-means clustering instances must be created, as each uses a fixed number of clusters.
By varying k, the maximum number of clusters,
the within-cluster scatter may be compared across different choices
of k. Typically, a value of k is chosen
at a knee of the within-cluster scatter scores. There are automatic
ways of performing this selection, though they are heuristically
rather than theoretically motivated.
In practice, it is technically possible, though unlikely, for clusters to wind up with no elements in them. This implementation will simply return fewer clusters than the maximum specified in this case.
Multiple runs may be used to provide bootstrap estimates of the relatedness of any two elements. Bootstrap estimates work by subsampling the elements to cluster with replacement and then running k-means on them. This is repeated multiple times, and the percentage of runs in which two elements fall in the same cluster forms the bootstrap estimate of the likelihood that they are in the same cluster.
The main advantage K-means has over hierarchical clustering is that it only requires linear time in the number of elements being clustered. There's a constant factor for the maximum number of iterations allowed and another constant factor for the length of the feature vectors.
The current implementation is an inefficient brute-force approach that computes the distance between every element and every centroid on every iteration. These distances are computed between maps, which itself is not very efficient. In the future, we may substitute a more efficient version of k-means implemented to this same interface.
More efficient implementations could use kd-trees to index the points, could cache scores of input feature vectors against clusters, and would only compute differential updates when elements change cluster. Threading could also be used to compute the distances.
| Constructor Summary | |
|---|---|
KMeansClusterer(FeatureExtractor<E> featureExtractor,
int numClusters,
int maxIterations)
Construct a k-means clusterer with the specified feature extractor, number of clusters and limit on number of iterations to run optimization. |
|
| Method Summary | |
|---|---|
Set<Set<E>> |
cluster(Set<? extends E> elementSet)
Return a k-means clustering of the specified set of elements. |
FeatureExtractor<E> |
featureExtractor()
Returns the feature extractor for this clusterer. |
int |
numClusters()
Returns the number of clusters this clusterer will return. |
Set<Set<E>> |
recluster(Set<Set<E>> clustering,
int maxIterations)
Recluster the specified clustering using up to the specified number of k-means iterations. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public KMeansClusterer(FeatureExtractor<E> featureExtractor,
int numClusters,
int maxIterations)
If the number of iterations is set to zero, the result will be random balanced clusterings of the specified size.
featureExtractor - Feature extractor for this clusterer.numClusters - Number of clusters to return.maxIterations - Maximum number of iterations during
optimization.
IllegalArgumentException - If the number of clusters is
less than 1, or if the maximum number of iterations is less
than 0.| Method Detail |
|---|
public FeatureExtractor<E> featureExtractor()
public int numClusters()
k" in "k-means".
public Set<Set<E>> recluster(Set<Set<E>> clustering,
int maxIterations)
clustering - Clustering to recluster.maxIterations - Maximum number of reclustering iterations.
public Set<Set<E>> cluster(Set<? extends E> elementSet)
cluster in interface Clusterer<E>elementSet - Set of elements to cluster.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||