Package com.aliasi.tokenizer

Classes for tokenizing character sequences.

See:
          Description

Interface Summary
TokenCategorizer A TokenCategorizer supplies a string-based category for string-based tokens.
TokenizerFactory A TokenizerFactory constructors tokenizers from subsequences of character arrays.
 

Class Summary
CharacterTokenCategorizer Returns a category for tokens made up out of a single character.
CharacterTokenizerFactory A CharacterTokenizerFactory considers each non-whitespace character in the input to be a distinct token.
EnglishStopListFilterTokenizer An EnglishStopListFilterTokenizer filters its input by removing words on the English stop list.
FilterTokenizer A FilterTokenizer contains a tokenizer to which it delegates the tokenizer methods.
IndoEuropeanTokenCategorizer A IndoEuropeanTokenCategorizer is a generic token categorizer for Indo-European languages that is based on character "shape".
IndoEuropeanTokenizerFactory An IndoEuropeanTokenizerFactory creates tokenizers for subsequences of character arrays.
LengthStopFilterTokenizer A StopFilterTokenizer removes tokens that exceed a specified length.
LineTokenizerFactory A LineTokenizerFactory treats each line of an input as a token.
LowerCaseFilterTokenizer A LowerCaseFilterTokenizer renders all of its tokens in lower case as defined by String.toLowerCase().
NGramTokenizerFactory An NGramTokenizerFactory creates n-gram tokenizers of a specified minimum and maximun length.
NormalizeWhiteSpaceFilterTokenizer A NormalizeWhiteSpaceFilterTokenizer reduces each non-empty whitespace to a single space.
PorterStemmer The PorterStemmer class is Martin Porter's Java implementation of his English stemmer.
PorterStemmerFilterTokenizer A PorterStemmerFilterTokenizer returns the stemmed version of each token, as produced by PorterStemmer.stem(String).
PunctuationStopListTokenizer A PunctuationStopListTokenizer removes tokens consisting entirely of punctuation.
RegExTokenizerFactory A RegExTokenizerFactory creates a tokenizer factory out of a regular expression.
SoundexFilterTokenizer The SoundexFilterTokenizer replaces each token with its Soundex encoding.
StopFilterTokenizer A StopFilterTokenizer removes tokens from the token stream if they meet conditions specified by concrete subclasses.
StopListFilterTokenizer A StopListFilterTokenizer is a stop-list-based stop filter tokenizer that removes tokens from a tokenizer stream if they are on a specified list of so-called ``stop'' tokens.
TokenFeatureExtractor A TokenFeatureExtractor produces feature vectors from character sequences representing token counts.
TokenFilterTokenizer A TokenFilterTokenizer allows a sequence of tokens to be filtered a token at a time.
Tokenizer Abstract base class for tokenizers.
 

Package com.aliasi.tokenizer Description

Classes for tokenizing character sequences.