public class SoundexTokenizerFactory extends ModifyTokenTokenizerFactory implements Serializable
SoundexTokenizerFactorymodifies the output of a base tokenizer factory to produce tokens in soundex representation. Soundex replaces sequences of characters with a crude four-character approximation of their pronunciation plus initial letter.
The process for converting an input to its Soundex representation is fairly straighforward for inputs that are all ASCII letters. Soundex is case insensitive, but is only defined for strings of ASCII letters. Thus to begin, all characters that are not Latin1 letters are removed, and all Latin1 characters are stripped of their diacritics. The algorithm then proceeds according to its standard definition:
The table of individual character encodings is as follows:
Characters Code B, F, P, V 1 C, G, J, K, Q, S, X, Z 2 D, T 3 L 4 M, N 5 R 6
Here are some examples of translations from the unit tests, drawn from the sources cited below.
Tokens Soundex Encoding Notes Gutierrez G362 Pfister P236 Jackson J250 Tymczak T522 Ashcraft A261 Robert, Rupert R163 Euler, Ellery E460 Gauss, Ghosh G200 Hilbert, Heilbronn H416 Knuth, Kant K530 Lloyd, Liddy L300 Lukasiewicz, Lissajous L222 Wachs, Waugh W200
As a tokenizer filter, the
SoundexFilterTokenizersimply replaces each token with its Soundex equivalent. Note that this may produce very many
0000outputs if it is fed standard text with punctuation, numbers, etc.
Note: In order to produce a deterministic tokenizer filter, names with prefixes are coded with the prefix. Recall that Soundex considers the following set of words prefixes, and suggests providing both the Soundex computed with the prefix and the Soundex encoding computed without the prefix:Van, Con, De, Di, La, Le
These are not accorded any special treatment by this implementation.
Thread SafetyAn English stop-listed tokenizer factory is thread safe if its base tokenizer factory is thread safe.
EnglishStopTokenizerFactoryis serializable if its base tokenizer factory is serializable.
References and Historical NotesSoundex was invented and patented by Robert C. Russell in 1918. The original version involved eight categories, including one for vowels, without the initial character being treated specially as to coding. The first vowel was retained in the original Soundex. Furthermore, some positional information was added, such as the deletion of final
The version in this class is the one described by Donald Knuth in The Art of Computer Programming and the one described by the United States National Archives and Records Administration version, which has been used for the United States Census.
- Knuth, D. 1973. The Art of Computer Programming Volum 3: Sorting and Searching. Addison-Wesley. 2nd Edition Pages 394-395.
- Wikipedia. Soundex.
- United States National Archives and Records Administration. Using the Census Soundex. General Information Leaflet 55.
- Robert C. Russell. 1918. United States Patent 1,261,167.
- Robert C. Russell. 1922. United States Patent 1,435,663.
- Bob Carpenter
- See Also:
- Serialized Form
Constructors Constructor and Description
SoundexTokenizerFactory(TokenizerFactory factory)Construct a Soundex-based tokenizer factory that converts tokens produced by the specified base factory into their soundex representations.
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method and Description
modifyToken(String token)Returns the Soundex encoding of the specified token.
soundexEncoding(String token)Returns the Soundex encoding of the specified token.
Methods inherited from class com.aliasi.tokenizer.ModifyTokenTokenizerFactory
Methods inherited from class com.aliasi.tokenizer.ModifiedTokenizerFactory
SoundexTokenizerFactorypublic SoundexTokenizerFactory(TokenizerFactory factory)Construct a Soundex-based tokenizer factory that converts tokens produced by the specified base factory into their soundex representations.
factory- Base tokenizer factory.
modifyTokenReturns the Soundex encoding of the specified token.
See the class documentation above for more information on the encoding.