public abstract class WeightedEditDistance extends Object implements Distance<CharSequence>, Proximity<CharSequence>
WeightedEditDistance
class implements both the
proximity and distance interfaces based on the negative proximity
weights assigned to independent atomic edit operations.
Weights on edit operations are scaled as log probabilities.
Practically speaking, this means that the larger the weight, the
more likely the edit operation; keep in mind that 1 is larger than
3, representing 2^{1} = 1/2
and
2^{3} = 1/8
respectively on a linear
probability scale.
The log probability of a sequence of independent edits is the
sum of the log probabilities of the individual edits. Proximity
between strings s1
and s2
is defined as
the maximum sum of edit weights over sequences of edits that
convert s1
to s2
.
Like the individual edit weights, proximity is scaled as a log probability of the complete edit. The larger the proximity, the closer the strings; again, keep in mind that 10 is larger than 20, representing roughly 1/1000 and 1/1,000,000 on the linear probability scale.
Distance is just negative proximity. This scales edit distances in the usual way, with distance of 3 between strings indicating they are further away from each other than strings at distance 1.25.
This class generalizes the behavior of the class
spell.EditDistance
without extending it in the inheritance
sense. Weighted edit distance agrees with edit distance (up to
arithmetic precision) as a distance assuming the following weights:
match weight is 0, substitute, insert and delete weights are
1
, and the transposition weight is 1
if
transpositions are allowed in the edit distance and
Double.NEGATIVE_INFINITY
otherwise.
If the substitution and transposition weights are symmetric and the insert and delete costs of a character are equal, then weighted edit distance will be symmetric.
If the match weight of all characters is zero, then the distance between a character sequence and itself will be zero.
If transpose weights are negative infinity so that transposition is
not allowed, and if the assignment of substitution weights forms a
metric (see Distance
for a definition), and if delete and
insert weights are nonnegative and equal for all characters, and
if match weights are all zero, then weighted edit distance will
form a proper metric. Other values may also form metrics, such as
a weight of 1 for all edits other than transpose.
A probabilistic relational model between strings is defined if
the weights are properly scaled as log probabilities. Because
probabilities are between 0 and 1, log probabilities will be
between negative infinity and zero. Proximity between two strings
in
and out
is defined by:
where the cost of the edit is defined to be:proximity(in,out) = Max_{edit(in)=out} log2 P(edit)
log2 P(edit)
= log2 P(edit_{0},...,edit_{n1})
~ log2 P(edit_{0}) + ... + log P(edit_{n1})
The last line is an approximation assuming edits are
independent.
In order to create a proper probabilistic channel, exponentiated edit weights must sum to 1.0. This is not technically possible with a local model if transposition is allowed, because of boundary conditions and independence assumptions. It is possible to define a proper channel if transposition is off, and if all edit weights for a position (including all sequences of arbitrarily long insertions) sum to 1.0. In particular, if any edits at all are allowed (have finite weights), then there must be a nonzero weight assigned to matching, otherwise exponentiated edit weight sum would exceed 1.0. It is always possible to add an offset to normalize the values to a probability model (the offset will be negative if the sum exceeds 1.0 and positive if it falls below 1.0 and zero otherwise).
A fully probabilistic model would have to take the sum over all edits rather than the maximum. This class makes the socalled Viterbi approximation, assuming the full probability is close to that of the best probability, or at least proportional to it.
Constructor and Description 

WeightedEditDistance()
Construct a weighted edit distance.

Modifier and Type  Method and Description 

abstract double 
deleteWeight(char cDeleted)
Returns the weight of deleting the specified character.

double 
distance(CharSequence csIn,
CharSequence csOut)
Returns the weighted edit distance between the specified
character sequences.

abstract double 
insertWeight(char cInserted)
Returns the weight of inserting the specified character.

abstract double 
matchWeight(char cMatched)
Returns the weight of matching the specified character.

double 
proximity(CharSequence csIn,
CharSequence csOut)
Returns the weighted proximity between the specified character
sequences.

abstract double 
substituteWeight(char cDeleted,
char cInserted)
Returns the weight of substituting the inserted character for
the deleted character.

abstract double 
transposeWeight(char cFirst,
char cSecond)
Returns the weight of transposing the specified characters.

public WeightedEditDistance()
public double distance(CharSequence csIn, CharSequence csOut)
This method is thread safe and may be accessed concurrently if the abstract weighting methods are thread safe.
distance
in interface Distance<CharSequence>
csIn
 First character sequence.csOut
 Second character sequence.public double proximity(CharSequence csIn, CharSequence csOut)
This method is thread safe and may be accessed concurrently if the abstract weighting methods are thread safe.
proximity
in interface Proximity<CharSequence>
csIn
 First character sequence.csOut
 Second character sequence.public abstract double matchWeight(char cMatched)
All weights should be less than or equal to zero, with
heavier weights being larger absolute valued negatives.
Basically, the weights may be treated as unscaled log
probabilities. Thus valid values will range between 0.0
(probablity 1) and Double.NEGATIVE_INFINITY
(probability 0). See the class documentation above for more
information.
cMatched
 Character matched.public abstract double deleteWeight(char cDeleted)
All weights should be less than or equal to zero, with
heavier weights being larger absolute valued negatives.
Basically, the weights may be treated as unscaled log
probabilities. Thus valid values will range between 0.0
(probablity 1) and Double.NEGATIVE_INFINITY
(probability 0). See the class documentation above for more
information.
cDeleted
 Character deleted.public abstract double insertWeight(char cInserted)
All weights should be less than or equal to zero, with
heavier weights being larger absolute valued negatives.
Basically, the weights may be treated as unscaled log
probabilities. Thus valid values will range between 0.0
(probablity 1) and Double.NEGATIVE_INFINITY
(probability 0). See the class documentation above for more
information.
cInserted
 Character inserted.public abstract double substituteWeight(char cDeleted, char cInserted)
All weights should be less than or equal to zero, with
heavier weights being larger absolute valued negatives.
Basically, the weights may be treated as unscaled log
probabilities. Thus valid values will range between 0.0
(probablity 1) and Double.NEGATIVE_INFINITY
(probability 0). See the class documentation above for more
information.
cDeleted
 Deleted character.cInserted
 Inserted character.public abstract double transposeWeight(char cFirst, char cSecond)
All weights should be less than or equal to zero, with
heavier weights being larger absolute valued negatives.
Basically, the weights may be treated as unscaled log
probabilities. Thus valid values will range between 0.0
(probablity 1) and Double.NEGATIVE_INFINITY
(probability 0). See the class documentation above for more
information.
cFirst
 First character in input.cSecond
 Second character in input.