public class TfIdfDistance extends TokenizedDistance implements ObjectHandler<CharSequence>
TfIdfDistance
class provides a string distance
based on term frequency (TF) and inverse document frequency (IDF).
The method distance(CharSequence,CharSequence)
will return
results in the range between 0
(perfect match) and
1
(no match) inclusive; the method proximity(CharSequence,CharSequence)
runs in the opposite
direction, returning 0
for no match and 1
for a perfect match. Full details are provided below.
Terms are produced from the character sequences being compared by a tokenizer factory fixed at construction time. These terms form the dimensions of vectors whose values are the counts for the terms in the strings being compared.
The raw term frequencies are adjusted in scale and by inverse
document frequency. The resulting term vectors are then compared
by one minus their cosine. Because the term vectors contain only
positive values, the result is a distance between zero
(0
), for completely dissimilar strings, to one
(1
), for characterbycharacter identical strings.
The inverse document frequencies are defined over a collection
of documents. The collection of documents must be provided to this
class one at a time through the handle(CharSequence)
method.
Note that there are a range of different distances called "TF/IDF" distance. The one in this class is defined to be symmetric, unlike typical TF/IDF distances defined for information retrieval. It scales inversedocument frequencies by logs, and both inversedocument frequencies and term frequencies by square roots. This causes the influence of IDF to grow logarithmically, and term frequency comparison to grow linearly.
Suppose we have a collection docs
of n
strings, which we will call documents in keeping with tradition.
Further let df(t,docs)
be the document frequency of
token t
, that is, the number of documents in which the
token t
appears. Then the inverse document frequency
(IDF) of t
is defined by:
idf(t,docs) = sqrt(log(n/df(t,docs)))
If the document frequency df(t,docs)
of a term is
zero, then idf(t,docs)
is set to zero. As a result,
only terms that appeared in at least one training document are
used during comparison.
The term vector for a string is then defined by its term
frequencies. If count(t,cs)
is the count of term
t
in character sequence cs
, then
the term frequency (TF) is defined by:
tf(t,cs) = sqrt(count(t,cs))
The termfrequency/inversedocument frequency (TF/IDF) vector
tfIdf(cs,docs)
for a character sequence cs
over a collection of documents ds
has a value
tfIdf(cs,docs)(t)
for term t
defined by:
tfIdf(cs,docs)(t) = tf(t,cs) * idf(t,docs)
The proximity between character sequences cs1
and
cs2
is defined as the cosine of their TF/IDF
vectors:
dist(cs1,cs2) = 1  cosine(tfIdf(cs1,docs),tfIdf(cs2,docs))
Recall that the cosine of two vectors is the dot product of the vectors divided by their lengths:
cos(x,y) = x ^{.} y / ( x * y )
where dot products are defined by:
x ^{.} y = Σ_{i} x[i] * y[i]
and length is defined by:
x = sqrt(x ^{.} x)
Distance is then just 1 minus the proximity value.
distance(cs1,cs2) = 1  proximity(cs1,cs2)
org.apache.lucene.search.Similarity
Class Documentation.
Constructor and Description 

TfIdfDistance(TokenizerFactory tokenizerFactory)
Construct an instance of TF/IDF string distance based on the
specified tokenizer factory.

Modifier and Type  Method and Description 

double 
distance(CharSequence cSeq1,
CharSequence cSeq2)
Return the TF/IDF distance between the specified character
sequences.

int 
docFrequency(String term)
Returns the number of training documents that contained
the specified term.

void 
handle(CharSequence cSeq)
Add the specified character sequence as a document for training.

double 
idf(String term)
Return the inversedocument frequency for the specified
term.

int 
numDocuments()
Returns the total number of training documents.

int 
numTerms()
Returns the number of terms that have been seen
during training.

double 
proximity(CharSequence cSeq1,
CharSequence cSeq2)
Returns the TF/IDF proximity between the specified character
sequences.

Set<String> 
termSet()
Returns the set of known terms for this distance.

termFrequencyVector, tokenizerFactory, tokenSet, tokenSet
public TfIdfDistance(TokenizerFactory tokenizerFactory)
tokenizerFactory
 Tokenizer factory for this distance.public void handle(CharSequence cSeq)
handle
in interface ObjectHandler<CharSequence>
cSeq
 Characters to trai.public double distance(CharSequence cSeq1, CharSequence cSeq2)
distance
in interface Distance<CharSequence>
cSeq1
 First character sequence.cSeq2
 Second character sequence.public double proximity(CharSequence cSeq1, CharSequence cSeq2)
proximity
in interface Proximity<CharSequence>
cSeq1
 First character sequence.cSeq2
 Second character sequence.public int docFrequency(String term)
term
 Term to test.public double idf(String term)
term
 The term whose IDF is returned.public int numDocuments()
public int numTerms()
public Set<String> termSet()