public class EditDistance extends Object implements Distance<CharSequence>, Proximity<CharSequence>
EditDistance
class implements the standard notion
of edit distance, with or without transposition. The distance
without transposition is known as Levenshtein distance, and
with transposition as DamerauLevenstein distance (see below about
distancelike metric properties).
The edit distance between two strings is defined to be the minimum number of nonmatching substring edits that is required to turn one string into the other. The available edits and their corresponding input and output substrings are:
Examples of minimal edit sequences are as follows, with distances as indicated
Operation Input Output Notation Match "a" "a" m(a) Insert "" "a" i(a) Delete "a" "" d(a) Substitute "a" "b" s(b,a) Transpose "ab" "ba" t(ab)
Input
Output
Edit Sequence (w. Transp)
Dist (w. Transp)
it it m(i) m(t) 0
gage gauge m(g) m(a) i(u) m(g) m(e) 1
thhe the m(t) d(h) m(h) m(e) 1
tenser tensor m(t) m(e) m(n) m(s) s(o,e) m(r) 1
hte the d(h) m(t) i(h) m(e) [t(ht) m(e)] 2 [1]
htne then d(h) m(t) s(h,n) m(e) i(n) [t(ht) t(ne)] 3 [2]
pwn4g ownage s(o,p) m(w) m(n) s(a,4) m(g) i(e) 3
Note that, in general, there will be more than one way to edit a
string into another string. For instance, a delete and insert may
replace a transposition, so that "hte" becomes "the" through edits
"d(h) m(t) i(h) m(e)" or "i(t) m(h) d(t) m(e)", as well as many
many more, such as "d(h) d(t) d(e) i(t) i(h) i(e)".
Edit distance without transposition defines a proper metric.
Recall that a distance measure d(x,y)
forms a metric
if for all x, y, z
, we have (1) d(x,y) >=
0
, (2) d(x,y) = d(y,x)
, (3) d(x,x) =
0
, and (4) d(x,y) + d(y,z) >= d(x,z)
. All
of these properties are easy to verify. But with transposition, we
have strings such as AB
, BA
and
ACB
. With transposition, d(AB,BA)=1
,
d(BA,BCA)=1
, but d(AB,BCA)= 3 >= d(AB,BA) +
d(BA,BCA) = 1 + 1 = 2
.
Implementation Note: This class implements edit distance
using dynamic programming in time O(n*m)
where
n
and m
are the length of the sequences
being compared. Using a sliding window of three lattice slices rather
than allocating the entire lattice at once, the space required is
that for three arrays of integers as long as the shorter of the two
character sequences being compared. For details, see section
12.1.1 of:
Modifier and Type  Field and Description 

static Distance<CharSequence> 
NON_TRANSPOSING
Edit distance disallowing transposition.

static Distance<CharSequence> 
TRANSPOSING
Edit distance allowing transposition.

Constructor and Description 

EditDistance(boolean allowTransposition)
Construct an edit distance with or without transposition based
on the specified flag.

Modifier and Type  Method and Description 

double 
distance(CharSequence cSeq1,
CharSequence cSeq2)
Returns the edit distance between the specified character
sequences.

static int 
editDistance(CharSequence cSeq1,
CharSequence cSeq2,
boolean allowTransposition)
Returns the edit distance between the character sequences with
or without transpositions as specified.

double 
proximity(CharSequence cSeq1,
CharSequence cSeq2)
Returns the proximity between the character sequences.

String 
toString()
Returns a string representation of this edit distance.

public static final Distance<CharSequence> TRANSPOSING
public static final Distance<CharSequence> NON_TRANSPOSING
public EditDistance(boolean allowTransposition)
allowTransposition
 Set to true
to allow
transposition edits in the constructed distance.public double distance(CharSequence cSeq1, CharSequence cSeq2)
distance
in interface Distance<CharSequence>
cSeq1
 First character sequence.cSeq2
 Second character sequence.public double proximity(CharSequence cSeq1, CharSequence cSeq2)
and thus proximities will all be negative or zero.proximity(cs1,cs2) = distance(cs1,cs2)
proximity
in interface Proximity<CharSequence>
cSeq1
 First character sequence.cSeq2
 Second character sequence.public String toString()
public static int editDistance(CharSequence cSeq1, CharSequence cSeq2, boolean allowTransposition)
cSeq1
 First character sequence.cSeq2
 Second character sequence.allowTransposition
 Set to true
to allow
transposition edits.