public class JaroWinklerDistance extends Object implements Distance<CharSequence>, Proximity<CharSequence>
JaroWinklerDistance
class implements the original
Jaro string comparison as well as Winkler's modifications. As a
distance measure, Jaro-Winkler returns values between
0
(exact string match) and 1
(no matching
characters). Note that this is reversed from the original
definitions of Jaro and Winkler in order to produce a distance-like
ordering. The original Jaro-Winkler string comparator returned
1
for a perfect match and 0
for
complete mismatch; our method returns one minus the Jaro-Winkler
measure.
The Jaro-Winkler distance measure was developed for name comparison in the U.S. Census. It is designed to compae surnames to surnames and given names to given names, not whole names to whole names. There is no character-specific information in this implementation, but assumptions are made about typical lengths and the significance of initial matches that may not apply to all languages.
The easiest way to understand the Jaro measure and the Winkler variants is procedurally. The Jaro measure involves two steps, first to compute the number of "matches" and second to compute the number of "transpositions". The Winkler adjustment involves a final rescoring based on an exact match score for the initial characters of both strings.
Suppose we are comparing character sequences cs1
and cs2
. The Jaro-Winkler distance is defined by
the following steps. After the definitions, we consider some
examples.
Step 1: Matches: The match phase is a greedy alignment step of characters in one string against the characters in another string. The maximum distance (measured by array index) at which characters may be matched is defined by:
matchRange = max(cs1.length(), cs2.length()) / 2 - 1
The match phase is a greedy alignment that proceeds character by character through the first string, though the distance metric is symmetric (that, is reversing the order of arguments does not affect the result). For each character encountered in the first string, it is matched to the first unaligned character in the second string that is an exact character match. If there is no such character within the match range window, the character is left unaligned.
Step 2: Transpositions: After matching, the subsequence of characters actually matched in both strings is extracted. These subsequences will be the same length. The number of characters in one string that do not line up (by index in the matched subsequence) with identical characters in the other string is the number of "half transpositions". The total number of transpoisitons is the number of half transpositions divided by two, rounding down.
The Jaro distance is then defined in terms of the number
of matching characters matches
and the number
of transpositions, transposes
:
jaroProximity(cs1,cs2) = ( matches(cs1,cs2) / cs1.length() + matches(cs1,cs2) / cs2.length() + (matches(cs1,cs2) - transposes(cs1,cs2)) / matches(cs1,cs2) ) / 3 jaroDistance(cs1,cs2) = 1 - jaroProximity(cs1,cs2)
In words, the measure is the average of three values; (a) the percentage of the first string matched, (b) the percentage of the second string matched, and (c) the percentage of matches that were not transposed.
Step 3: Winkler Modification The Winkler modification to
the Jaro comparison, resulting in the Jaro-Winkler comparison,
boosts scores for strings that match character for character
initially. Let boostThreshold
be the minimum score
for a string that gets boosted. This value was set to
0.7
in Winkler's papers (see references below). If
the Jaro score is below the boost threshold, the Jaro score is
returned unadjusted. The second parameter for the Winkler
modification is the size of the initial prefix considered,
prefixSize
. The prefix size was set to 4
in Winkler's papers. Next, let
prefixMatch(cs1,cs2,prefixSize)
be the number of
characters in the prefix of cs1
and cs2
that exactly match (by original index), up to a maximum of
prefixSize
. The modified distance is then defined to
be:
jaroWinklerProximity(cs1,cs2,boostThreshold,prefixSize) = jaroMeasure(cs1,cs2) <= boostThreshold ? jaroMeasure(cs1,cs2) : jaroMeasure(cs1,cs2) + 0.1 * prefixMatch(cs1,cs2,prefixSize) * (1.0 - jaroDistance(cs1,cs2)) jaroWinklerDistance(cs1,cs2,boostThreshold,prefixSize) = 1 - jaroWinklerProximity(cs1,cs2,boostThreshold,prefixSize)
Examples: We will present the alignment steps in the form
of tables, with offsets in the second string below the first string
positions that match. For a simple example, consider comparing the
given (nick)name AL
to itself. Both strings are of
length 2. Thus the maximum match distance is max(2,2)/2 - 1
= 0
, meaning all matches must be exact. The matches are
illustrated in the following table:
cs1 | A | L |
matches | 0 | 1 |
cs2 | A | L |
The notation in the matches row is meant to indicate that the
A
at index 0
in cs1
is
matched to the A
at index 0
in
cs2
. Similarlty for the L
at index 1 in
cs1
, which matches the L
at index 1 in
cs2
. This results in matches(AL,AL) = 2
.
There are no transpositions, so the Jaro distance is just:
jaroProximity(AL,AL) = 1/3*(2/2 + 2/2 + (2-0)/2) = 1.0
Applying the Winkler modification yields the same result:
jaroWinklerProximity(AL,AL) = 1 + 0.1 * 2 * (1.0 - 1) = 1.0
Next consider a more complex case, matching MARTHA
and
MARHTA
. Here the match distance is max(6,6)/2 -
1 = 6/2 - 1 = 2
, allowing matching characters to be up to one
character away. This yields the following alignment.
cs1 |
M | A | R | T | H | A |
matches | 0 | 1 | 2 | 4 | 3 | 5 |
cs2 |
M | A | R | H | T | A |
Note that the T
at index 3 in the first string
aligns with the T
at index 4 in the second string,
whereas the H
at index 4 in the first string alings
with the H
at index 3 in the second string. The
strings that do not directly align are rendered in bold. This is
an instance of a transposition. The number of half transpositions
is determined by comparing the subsequences of the first and second
string matched, namely MARTHA
and MARHTA
.
There are two positions with mismatched characters, 3 and 4. This
results in two half transpositions, or a single transposition, for
a Jaro distance of:
jaroProximity(MARTHA,MARHTA) = 1/3 * (6/6 + 6/6 + (6 - 1)/6) = 0.944Three initial characters match,
MAR
, for a Jaro-Winkler
distance of:
jaroWinklerProximity(MARTHA,MARHTA) = 0.944 + 0.1 * 3 * (1.0 - 0.944) = 0.961
Next, consider matching strings of different lengths, such as
JONES
and JOHNSON
:
cs1 |
J | O | N | E | S | ||
matches | 0 | 1 | 3 | - | 5 | ||
cs2 |
J | O | H | N | S | O | N |
The unmatched characters are rendered in italics. Here the
subsequence of matched characters for the two strings are JONS
and
JONS
, so there are no transpositions. Thus the Jaro
distance is:
jaroProximity(JONES,JOHNSON) = 1/3 * (4/5 + 4/7 + (4 - 0)/4) = 0.790
The strings JONES
and JOHNSON
only
match on their first two characters, JO
, so the
Jaro-Winkler distance is:
jaroWinklerProximity(JONES,JOHNSON) = .790 + 0.1 * 2 * (1.0 - .790) = 0.832
We will now consider some artificial examples not drawn from
(Winkler 2006). First, compare ABCVWXYZ
and
CABVWXYZ
, which are of length 8, allowing alignments
up to 8/4 - 1 = 3
positions away. This leads
to the following alignment:
cs1 |
A | B | C | V | W | X | Y | Z |
matches | 1 | 2 | 0 | 3 | 4 | 5 | 6 | 7 |
cs2 |
C | A | B | V | W | X | Y | Z |
Here, there are 8/8 matches in both strings. There are only
three half-transpositions, in the first three characters,
because no position of CAB
has an identical character
to ABC
. This yields a total of one transposition,
for a Jaro score of:
jaroProximity(ABCVWXYZ,CABVWXYZ) = 1/3 * (8/8 + 8/8 + (8-1)/8) = .958
There is no initial prefix match, so the Jaro-Winkler comparison
produces the same result. Now consider matching
ABCVWXYZ
to CBAWXYZ
. Here, the initial
alignment is 2, 1, 0
, which yields only two half
transpositions. Thus under the Jaro distance, ABC
is
closer to CBA
than to CAB
, though due to
integer rounding in computing the number of transpositions, this
will only affect the final result if there is a further
transposition in the strings.
Now consider the 10-character string ABCDUVWXYZ
.
This allows matches up to 10/2 - 1 = 4
positions away.
If matched against DABCUVWXYZ
, the result is 10
matches, and 4 half transposes, or 2 transposes. Now consider
matching ABCDUVWXYZ
against DBCAUVWXYZ
.
Here, index 0 in the first string (A
) maps to
index 3 in the second string, and index 3 in the first string
(D
) maps to index 0 in the second string, but
positions 1 and 2 (B
and C
) map to
themselves. Thus when comparing the output, there are only two
half transpositions, thus making the second example
DBCAUVWXYZ
closer than DABCUVWXYZ
to the
first string ABCDUVWXYZ
.
Note that the transposition count cannot be determined solely by
the mapping. For instance, the string ABBBUVWXYZ
matches BBBAUVWXYZ
with alignment 4, 0, 1, 2, 5, 6, 7,
8, 9, 1
. But there are only two half-transpositions, because
only index 0 and index 3 mismatch in the subsequences of matching
characters. Contrast this with ABCDUVWXYZ
matching
DABCUVWXYZ
, which has the same alignment, but four
half transpositions.
The greedy nature of the alignment phase in the Jaro-Winkler
algorithm actually prevents the optimal alignments from being found
in some cases. Consider the alignment of ABCAWXYZ
with BCAWXYZ
:
cs1 |
A | B | C | A | W | X | Y | Z |
matches | 2 | 0 | 1 | - | 3 | 4 | 5 | 6 |
cs2 |
B | C | A | W | X | Y | Z |
Here the first pair of A
characters are matched,
leading to three half transposes (the first three matched
characters). A better scoring, though illegal, alignment would be
the following, because it has the same number of matches, but no
transposes:
cs1 |
A | B | C | A | W | X | Y | Z |
matches | - | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
cs2 |
B | C | A | W | X | Y | Z |
The illegal links are highlighted in yellow. Note that neither alignment matches in the initial character, so the Winkler adjustments do not apply.
This class's implementation is a literal translation of the C algorithm used in William E. Winkler's papers and for the 1995 U.S. Census Deduplication. The algorithm is the work of multiple authors and available from the folloiwng link:
Unlike the C version, the distance(CharSequence,CharSequence)
and proximity(CharSequence,CharSequence)
methods do not require its
inputs to be padded with spaces. In addition, spaces are treated
just like any other characters within the algorithm itself. There
is also no case normalization in this class's version.
Furthermore, the boundary conditions are changed so that two empty
strings return a score of 1.0
rather than zero, as in
the original algorithm.
Jaro's origial implementation is described in:
Winkler's modified algorithm, along with applications in record linkage, are described in the following highly readable survey article:
strcmp95.c
or the results of this class, which matches
strcmp95.c
). The description of the matching
procedure above is based on the actual strcmp95
code,
the boundary conditions of which are not obvious from the text
descriptions in the literature. An additional difference is that
strcmp95
, but not the algorithms in Winkler's papers
nor the algorithm in this class, provides the possibility of
partial matches with similar-sounding characters
(e.g. c
and k
).
We'd like to thank Bill Winkler for helping us understand the
versions of the algorithm and providing the strcmp95.c
code as a reference implementation.
Modifier and Type | Field and Description |
---|---|
static JaroWinklerDistance |
JARO_DISTANCE
A constant for the Jaro distance.
|
static JaroWinklerDistance |
JARO_WINKLER_DISTANCE
A constant for the Jaro-Winkler distance with defaults set as
in Winkler's papers.
|
Constructor and Description |
---|
JaroWinklerDistance()
Construct a basic Jaro string distance without the Winkler
modifications.
|
JaroWinklerDistance(double weightThreshold,
int numChars)
Construct a Winkler-modified Jaro string distance with the
specified weight threshold for refinement and an initial number
of characters over which to reweight.
|
Modifier and Type | Method and Description |
---|---|
double |
distance(CharSequence cSeq1,
CharSequence cSeq2)
Returns the Jaro-Winkler distance between the specified character
sequences.
|
double |
proximity(CharSequence cSeq1,
CharSequence cSeq2)
Return the Jaro-Winkler comparison value between the specified
character sequences.
|
public static final JaroWinklerDistance JARO_DISTANCE
JaroWinklerDistance()
.
Instances are thread safe, so this single distance instance may be used for all comparisons within an application.
public static final JaroWinklerDistance JARO_WINKLER_DISTANCE
JaroWinklerDistance(0.7,4)
.
Instances are thread safe, so this single distance instance may be used for all comparisons within an application.
public JaroWinklerDistance()
public JaroWinklerDistance(double weightThreshold, int numChars)
public double distance(CharSequence cSeq1, CharSequence cSeq2)
0
(perfect match) to 1
(no overlap).
See the class definition above for formal definitions.
This method is defined to be:
distance(cSeq1,cSeq2) = 1 - proximity(cSeq1,cSeq2)
distance
in interface Distance<CharSequence>
cSeq1
- First character sequence to compare.cSeq2
- Second character sequence to compare.public double proximity(CharSequence cSeq1, CharSequence cSeq2)
0
(no match) to 1
(perfect match)inclusive. See the class definition above for
an exact definition of Jaro-Winkler string comparison.
The method distance(CharSequence,CharSequence)
returns
a distance measure that is one minus the comparison value.
proximity
in interface Proximity<CharSequence>
cSeq1
- First character sequence to compare.cSeq2
- Second character sequence to compare.