E
 the type of objects being classifiedpublic class KnnClassifier<E> extends Object implements ScoredClassifier<E>, ObjectHandler<Classified<E>>, Compilable, Serializable
KnnClassifier
implements knearestneighor
classification based on feature extraction and a vector proximity
or distance. Knearestneighbor classification is a kind of
memorybased learning in which every training instance is stored
along with its category. To classify an object, the k nearest
training examples to the object being classified are found.
Each of the k nearest neighbors votes for its training category.
The resulting classification scores are the result of voting.
It is possible to weight the votes by proximity.
Knearestneighbor classifiers are particularly effective for highly irregular classification boundaries where linear classifiers like the perceptron have a hard time discriminiting instances. For instance, it's possible to learn checkerboard patterns in 2D space using knearestneighbor classification.
A knearest neighbor classifier is constructed using a feature extractor, the number of neighbors k to consider, a vector distance or proximity and a boolean indicator of whether to weight results by proximity or treat the nearest neighbors equally.
Training simply involves storing the feature vector for each
training instance along with its category. The vectors are stored
as instances of SparseFloatVector
for
space efficiency. They are constructed using a symbol table for
features based on the specified feature extractor for this
classifier.
Nearness is defined by the proximity or distance functions over vectors supplied at construction time. As objects move nearer to one another, their distance decreases and their proximity increases. Distance measures are converted into proximities behind the scenes by inversion, as it leaves all proximities positive and finite:
proximity(v1,v2) = 1 / (1 + distance(v1,v2))This will scale distance functions that return results between 0 and positive infinity to return proximities between 0 and 1.
Warning:
Distance functions used for knearestneighbors classification
should not return negative values; any zero or negative values
will be converted to Double.POSITIVE_INFINITY
.
Classification involves finding the k nearest neighbors to a query point. Both training instances and instances to be classified are converted to feature mappings using the specified feature extractor, and then encoded as sparse vectors using an implicitly managed feature symbol table.
The first step in classification is simply collecting the k nearest neighbors. That is, the training examples that have the greatest proximity to the example being classified. Given the set of k training examples that are closest to the test point, one of two strategies is used for determing scores. In the simple, unweighted case, the score of a category is simply the number of vectors in the k nearest neighbors with that category. In the weighted case, each vector in the k nearest neighbors contributes its proximity, and the final score is the sum of all proximities.
In most cases, it makes sense to try to optimize for the number of neighbors k using crossvalidation or heldout data.
In the weighted case, it sometimes makes sense to take the
maximum number of neighbors k to be very large, potentially even
Integer.MAX_VALUE
. This is because the examples are
weighted by proximity, and those far away may have vanishingly
small proximities.
There is no compilation required for knearestneighbors classification, so serialization and compilation produce the same result. The object read back in after serialization or compilation should be identical to the one serialized or compiled.
This is a brute force implementation of knearest neighbors in the sense that every training example is multiplied by every object being classified. Some knearestneighbor implementations attempt to efficiently index the training examples, using techniques such as KD trees, so that search for nearest neighbors can be more efficient.
Knearestneighbor classification is widely used, and well described in several texts:
Constructor and Description 

KnnClassifier(FeatureExtractor<? super E> featureExtractor,
int k)
Construct a knearestneighbor classifier based on the
specified feature extractor and using the specified number of
neighbors.

KnnClassifier(FeatureExtractor<? super E> featureExtractor,
int k,
Distance<Vector> distance)
Construct a knearestneighbor classifier based on the
specified feature extractor, specified maximum number of
neighbors, and specified distance function.

KnnClassifier(FeatureExtractor<? super E> extractor,
int k,
Proximity<Vector> proximity,
boolean weightByProximity)
Construct a knearestneighbor classifier based on the
specified feature extractor, specified maximum number of
neighbors, specified proximity function, and boolean flag
indicating whether or not to weight nearest neighbors by proximity
during classification.

Modifier and Type  Method and Description 

List<String> 
categories()
Returns a copy of he current list of categories for
this classifier.

ScoredClassification 
classify(E in)
Return the knearestneighbor classification result for the
specified input object.

void 
compileTo(ObjectOutput out)
Compiles this knearestneighbor classifier to the specified
object output stream.

FeatureExtractor<? super E> 
featureExtractor()
Returns the feature extractor for ths KNN classifier.

void 
handle(Classified<E> classifiedObject)
Handle the specified classified object as a training instance.

int 
k()
Returns the number K of neighbors used by this Knearest
neighbors classifier.

Proximity<Vector> 
proximity()
Returns the proximity measure for this KNN classifier.

boolean 
weightByProximity()
Returns
true if resposes are weighted by proximity. 
public KnnClassifier(FeatureExtractor<? super E> featureExtractor, int k)
EuclideanDistance
. The nearest neighbors will not be
weighted by proximity, and thus all have equal votes.featureExtractor
 Feature extractor for training and
classification instances.k
 Maximum number of neighbors to use during
classification.public KnnClassifier(FeatureExtractor<? super E> featureExtractor, int k, Distance<Vector> distance)
featureExtractor
 Feature extractor for training and
classification instances.k
 Maximum number of neighbors to use during
classification.distance
 Distance function to use to compare examples.public KnnClassifier(FeatureExtractor<? super E> extractor, int k, Proximity<Vector> proximity, boolean weightByProximity)
extractor
 Feature extractor for training and
classification instances.k
 Maximum number of neighbors to use during
classification.proximity
 Proximity function to compare examples.weightByProximity
 Flag indicating whether to weight
neighbors by proximity during classification.public FeatureExtractor<? super E> featureExtractor()
public Proximity<Vector> proximity()
public List<String> categories()
public boolean weightByProximity()
true
if resposes are weighted by proximity.public int k()
public void handle(Classified<E> classifiedObject)
handle
in interface ObjectHandler<Classified<E>>
classifiedObject
 Classified Object to use for training.public ScoredClassification classify(E in)
If this classifier does not weight by proximity, the resulting score for a category will be the number of nearest neighbors of the specified category. That is, it will be a straight vote.
If the classifier does weight by proximity, the resulting
score for a category will be the sum of the proximity scores
for the nearest neighbors of a given category. Instances with
no near neighbors will be scored zero (0
). Thus
proximities should be configured to return positive values.
classify
in interface BaseClassifier<E>
classify
in interface RankedClassifier<E>
classify
in interface ScoredClassifier<E>
in
 Object to classify.public void compileTo(ObjectOutput out) throws IOException
This is only a convenience method. It provides exactly the same function as standard serialization.
compileTo
in interface Compilable
out
 Output stream to which this classifier is written.IOException
 If there is an underlying I/O exception
during compilation.