public class JaccardDistance extends TokenizedDistance
JaccardDistanceclass implements a notion of distance based on token overlap. The tokens are generated from the character sequences being compared by a tokenizer factory that is supplied at construction time. A distance of zero (
0) is a perfect match, a distance of one (
10 a perfect mismatch.
termSet(cs) is the set of tokens extracted from
the character sequence
cs. With these terms,
the proximity underlying Jaccard distance is defined
as the percentage of tokens that appear in both
Proximities run between 0 and 1. A proximity of 0 means the character sequences share no terms in common and a proximity of 1 means the character sequences share all of their terms.proximity(cs1,cs2) = size(termSet(cs1) INTERSECT termSet(cs2)) / size(termSet(cs1) UNION termSet(cs2))
Distance is then defined in terms of proximity by subtraction.
Distances also run between 0 and 1. A distance of 0 means the character sequences share all of their terms, whereas a distance of 1 means they have no terms in common.distance(cs1,cs2) = 1 - proximity(cs1,cs2)
|Constructor and Description|
Construct an instance of Jaccard string distance using the specified tokenizer factory.
|Modifier and Type||Method and Description|
Returns the Jaccard distance between the specified character sequence.
Returns the proximity between the specified character sequences.
termFrequencyVector, tokenizerFactory, tokenSet, tokenSet
public JaccardDistance(TokenizerFactory factory)
factory- Tokenizer factory for distance.
public double distance(CharSequence cSeq1, CharSequence cSeq2)
cSeq1- First character sequence.
cSeq2- Second character sequence.