public class IndoEuropeanTokenizerFactory extends Object implements TokenizerFactory, Serializable
IndoEuropeanTokenizerFactorycreates tokenizers with built-in support for alpha-numerics, numbers, and other common constructs in Indo-European langauges.
The tokenization rules are roughly based on those used in MUC-6, but are necessarily finer grained, because the MUC tokenizers were based on lexical and semantic information such as whether a string was an abbreviation.
A token is any sequence of characters satisfying one of the following patterns.
Whitespaces are defined as any sequence of whitespace characters, including the unicode non-breakable space (unicode
Pattern Description AlphaNumeric Any sequence of upper or lowercase letters or digits, as defined by
Character.isLetter(char), and including the Devanagari characters (unicode
Numerical Any sequence of numbers, commas, and periods. Hyphen Sequence Any number of hyphens (
Equals Sequence Any number of equals signs (
Double Quotes Double forward quotes (
``) or double backward quotes(
160). The tokenizer operates in a longest-leftmost fashion, returning the longest possible token starting at the current position in the underlying character array.
INSTANCE. There is no public constructor provided.
The serialized versions of this class deserialize to the
same singleton as produced by
|Modifier and Type||Field and Description|
The singleton instance of an Indo-European tokenizer factory.
|Constructor and Description|
Construct a tokenizer for Indo-European languages.
|Modifier and Type||Method and Description|
Returns a tokenizer for Indo-European for the specified subsequence of characters.
Returns tha name of this class.
public static final IndoEuropeanTokenizerFactory INSTANCE
Implementation Note: All Indo-European tokenizer
factories behave the same way, and they are thread safe, so the
INSTANCE may be used anywhere a freshly
constructed character tokenizer factory is used, without loss
public Tokenizer tokenizer(char ch, int start, int length)