E
 the type of object being clusteredpublic class SingleLinkClusterer<E> extends AbstractHierarchicalClusterer<E>
SingleLinkClusterer
implements standard singlelink
agglomerative clustering. Single link clustering is a greedy
algorithm in which the two closest clusters are always merged
up to a specified distance threshold. Distance between clustes for
single link clustering is defined to be the minimum of the distances
between the members of the clusters. See CompleteLinkClusterer
for a clusterer that takes the maximum rather than the minimum in
making clustering decisions.
For example, consider the following proximity matrix
representing distances between pairs of elements (the same example
is used in CompleteLinkClusterer
):
The result of singlelink clustering is the following dendrogram (pardon the ASCII art):
A B C D E A 0 1 2 7 5 B 0 3 8 6 C 0 5 9 D 0 4 E 0
A B C D E      1     2    3    4   5 
First, the objects A
and B
are merged
at distance one. Then object C
is merged into the
cluster at distance 2, because that is its distance from
A
. Objects D
and E
are
merged next at distance 4, and finally, the two big clusters are
merged at distance 5, the distance between A
and
E.
The various clusters at each proximity bound threshold are:
The intervals show the clusters returned for thresholds within the specified interval. As usual, square brackets denote inclusive range bounds and round brackets exclusive bounds. For instance, if the distance threshold is 1.5, four clusters are returned, whereas if the threshold is 4.0, two clusters are returned.
Threshold Range Clusters [Double.NEGATIVE_INFINITY,1) {A}, {B}, {C}, {D}, {E} [1,2) {A,B}, {C}, {D}, {E} [2,4) {A,B,C}, {D}, {E} [4,5) {A,B,C}, {D,E} [5,Double.POSITIVE_INFINITY] {A,B,C,D,E}
Note that this example has the same clusters as that in CompleteLinkClusterer
. A simple example where the results are
different is the result of clustering four points on the xaxis,
x_{1}=1
, x_{2}=3
,
x_{3}=6
, x_{4}=10
.
The distance is computed by subtraction, e.g. d(x_{2},
x_{4}) = x_{4} x_{2}.
This leads to the following distance matrix:
x_{1} x_{2} x_{3} x_{4}
x_{1} 0 2 5 9
x_{2} 0 3 7
x_{3} 0 4
x_{4} 0
For this example, the complete link clustering is
{{x_{1},x_{2}},{x_{3},x_{4}}}
whereas the singlelink clustering is:
{{{x_{1},x_{2}},x_{3}},x_{4}}
Implementation Note: This clusterer is implemented using
the minimumspanningtreebased algorithm described in:
Jain, Anil K. and Richard C. Dubes. 1988. Algorithms for Clustering
Data. PrenticeHall.
The minimum spanning tree is constructed using Kruskal's algorithm,
which is described in:
Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
2001.
Introduction to Algorithms. MIT Press.
This algorithm requires an efficient lookup of whether two
nodes are already in the same connected component, which
is implemented by the standard disjoint set algorithm with
path compression (also described in Cormen et al.).
In brief, the algorithm sorts all pairs of objects in order of
increasing distance (with n
objects that involves
sorting n^{2}
pairs of objects,
leading to an initial step with worst case time complexity bound
O(n^{2} log n)
. The pairs are then
merged in order of decreasing distance. The sets involved in
Kruskal's algorithm translate directly into dendrograms with their
dereferencing.
 Since:
 LingPipe2.0
 Version:
 4.1.0
 Author:
 Bob Carpenter


Constructor Summary
Constructors
Constructor and Description
SingleLinkClusterer(Distance<? super E> distance)
Construct a singlelink clusterer with no distance bound.
SingleLinkClusterer(double maxDistance,
Distance<? super E> distance)
Construct a singlelink clusterer with the specified distance
bound.

Method Summary
All Methods Instance Methods Concrete Methods
Modifier and Type
Method and Description
Dendrogram<E>
hierarchicalCluster(Set<? extends E> elementSet)
Return the array of clusters derived from the specified
distance matrix by performing singlelink clustering up to the
specified maximum distance bound.

Methods inherited from class com.aliasi.cluster.AbstractHierarchicalClusterer
cluster, distance, getMaxDistance, setMaxDistance


Constructor Detail

SingleLinkClusterer
public SingleLinkClusterer(double maxDistance,
Distance<? super E> distance)
Construct a singlelink clusterer with the specified distance
bound.
 Parameters:
maxDistance
 Maximum distance for clusters.
distance
 Distance measure between objects to cluster.
 Throws:
IllegalArgumentException
 If the specified bound is not
a nonnegative number.

SingleLinkClusterer
public SingleLinkClusterer(Distance<? super E> distance)
Construct a singlelink clusterer with no distance bound.
The distance bound is set to Double.POSITIVE_INFINITY
,
which effectively removes the distance bound.
 Parameters:
distance
 Distance measure between objects to cluster.

Method Detail

hierarchicalCluster
public Dendrogram<E> hierarchicalCluster(Set<? extends E> elementSet)
Return the array of clusters derived from the specified
distance matrix by performing singlelink clustering up to the
specified maximum distance bound. Every pair of elements in
the matrix that are within the distance bound will fall in the
same dendrogram in the result. Setting the maximum distance
to Double.POSITIVE_INFINITY
results in a complete
clustering.
 Specified by:
hierarchicalCluster
in interface HierarchicalClusterer<E>
 Specified by:
hierarchicalCluster
in class AbstractHierarchicalClusterer<E>
 Parameters:
elementSet
 Set of elements to cluster.
 Returns:
 Clustering in the form of a set of sets of elements.
 Throws:
IllegalArgumentException
 If the set of elements is empty.