public class TfIdfDistance extends TokenizedDistance implements ObjectHandler<CharSequence>
TfIdfDistanceclass provides a string distance based on term frequency (TF) and inverse document frequency (IDF). The method
distance(CharSequence,CharSequence)will return results in the range between
0(perfect match) and
1(no match) inclusive; the method
proximity(CharSequence,CharSequence)runs in the opposite direction, returning
0for no match and
1for a perfect match. Full details are provided below.
Terms are produced from the character sequences being compared by a tokenizer factory fixed at construction time. These terms form the dimensions of vectors whose values are the counts for the terms in the strings being compared.
The raw term frequencies are adjusted in scale and by inverse
document frequency. The resulting term vectors are then compared
by one minus their cosine. Because the term vectors contain only
positive values, the result is a distance between zero
0), for completely dissimilar strings, to one
1), for character-by-character identical strings.
The inverse document frequencies are defined over a collection
of documents. The collection of documents must be provided to this
class one at a time through the
Note that there are a range of different distances called "TF/IDF" distance. The one in this class is defined to be symmetric, unlike typical TF/IDF distances defined for information retrieval. It scales inverse-document frequencies by logs, and both inverse-document frequencies and term frequencies by square roots. This causes the influence of IDF to grow logarithmically, and term frequency comparison to grow linearly.
Suppose we have a collection
strings, which we will call documents in keeping with tradition.
df(t,docs) be the document frequency of
t, that is, the number of documents in which the
t appears. Then the inverse document frequency
t is defined by:
idf(t,docs) = sqrt(log(n/df(t,docs)))
If the document frequency
df(t,docs) of a term is
idf(t,docs) is set to zero. As a result,
only terms that appeared in at least one training document are
used during comparison.
The term vector for a string is then defined by its term
count(t,cs) is the count of term
t in character sequence
the term frequency (TF) is defined by:
tf(t,cs) = sqrt(count(t,cs))
The term-frequency/inverse-document frequency (TF/IDF) vector
tfIdf(cs,docs) for a character sequence
over a collection of documents
ds has a value
tfIdf(cs,docs)(t) for term
t defined by:
tfIdf(cs,docs)(t) = tf(t,cs) * idf(t,docs)
The proximity between character sequences
cs2 is defined as the cosine of their TF/IDF
dist(cs1,cs2) = 1 - cosine(tfIdf(cs1,docs),tfIdf(cs2,docs))
Recall that the cosine of two vectors is the dot product of the vectors divided by their lengths:
where dot products are defined by:
cos(x,y) = x . y / ( |x| * |y| )
and length is defined by:
x . y = Σi x[i] * y[i]
|x| = sqrt(x . x)
Distance is then just 1 minus the proximity value.
distance(cs1,cs2) = 1 - proximity(cs1,cs2)
|Constructor and Description|
Construct an instance of TF/IDF string distance based on the specified tokenizer factory.
|Modifier and Type||Method and Description|
Return the TF/IDF distance between the specified character sequences.
Returns the number of training documents that contained the specified term.
Add the specified character sequence as a document for training.
Return the inverse-document frequency for the specified term.
Returns the total number of training documents.
Returns the number of terms that have been seen during training.
Returns the TF/IDF proximity between the specified character sequences.
Returns the set of known terms for this distance.
termFrequencyVector, tokenizerFactory, tokenSet, tokenSet
public TfIdfDistance(TokenizerFactory tokenizerFactory)
tokenizerFactory- Tokenizer factory for this distance.
public void handle(CharSequence cSeq)
public double distance(CharSequence cSeq1, CharSequence cSeq2)
public double proximity(CharSequence cSeq1, CharSequence cSeq2)
public int docFrequency(String term)
term- Term to test.
public double idf(String term)
term- The term whose IDF is returned.
public int numDocuments()
public int numTerms()