public class JaccardDistance extends TokenizedDistance
JaccardDistance
class implements a notion of
distance based on token overlap. The tokens are generated
from the character sequences being compared by a tokenizer
factory that is supplied at construction time. A distance of
zero (0
) is a perfect match, a distance of
one (1
0 a perfect mismatch.
Suppose termSet(cs)
is the set of tokens extracted from
the character sequence cs
. With these terms,
the proximity underlying Jaccard distance is defined
as the percentage of tokens that appear in both
character sequences:
Proximities run between 0 and 1. A proximity of 0 means the character sequences share no terms in common and a proximity of 1 means the character sequences share all of their terms.proximity(cs1,cs2) = size(termSet(cs1) INTERSECT termSet(cs2)) / size(termSet(cs1) UNION termSet(cs2))
Distance is then defined in terms of proximity by subtraction.
Distances also run between 0 and 1. A distance of 0 means the character sequences share all of their terms, whereas a distance of 1 means they have no terms in common.distance(cs1,cs2) = 1  proximity(cs1,cs2)
Constructor and Description 

JaccardDistance(TokenizerFactory factory)
Construct an instance of Jaccard string distance using
the specified tokenizer factory.

Modifier and Type  Method and Description 

double 
distance(CharSequence cSeq1,
CharSequence cSeq2)
Returns the Jaccard distance between the specified character
sequence.

double 
proximity(CharSequence cSeq1,
CharSequence cSeq2)
Returns the proximity between the specified character
sequences.

termFrequencyVector, tokenizerFactory, tokenSet, tokenSet
public JaccardDistance(TokenizerFactory factory)
factory
 Tokenizer factory for distance.public double distance(CharSequence cSeq1, CharSequence cSeq2)
cSeq1
 First character sequence.cSeq2
 Second character sequence.public double proximity(CharSequence cSeq1, CharSequence cSeq2)
cSeq1
 First character sequence.cSeq2
 Second character sequence.