com.aliasi.tokenizer
Class NGramTokenizerFactory

java.lang.Object
  extended by com.aliasi.tokenizer.NGramTokenizerFactory
All Implemented Interfaces:
TokenizerFactory, Compilable, Serializable

public class NGramTokenizerFactory
extends Object
implements TokenizerFactory, Serializable, Compilable

An NGramTokenizerFactory creates n-gram tokenizers of a specified minimum and maximun length.

An NGramTokenizer is a tokenizer that returns the character n-grams from a specified sequence between a minimum and maximum length. Whitespace takes the default behavior from Tokenizer.nextWhitespace(), returning a string consisting of a single space character.

For example, the result of

new NGramTokenizer("abcd".toCharArray(),0,4,2,3).tokenize()
is the string array:
{ "ab", "bc", "cd", "abc", "bcd" }

Serialization and Compilation

N-gram tokenizers are serializable and compilable. Both operations write the n-gram bounds to the output stream and read back in an instance of this class with those bounds.

Since:
LingPipe1.0
Version:
3.1.3
Author:
Bob Carpenter
See Also:
Serialized Form

Constructor Summary
NGramTokenizerFactory(int minNGram, int maxNGram)
          Create an n-gram tokenizer factory with the specified minimum and maximum n-gram lengths.
 
Method Summary
 void compileTo(ObjectOutput objOut)
          Compiles this n-gram tokenizer factory to the specified object output stream.
 Tokenizer tokenizer(char[] cs, int start, int length)
          Returns an n-gram tokenizer for the specified characters with the minimum and maximum n-gram lengths as specified in the constructor.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

NGramTokenizerFactory

public NGramTokenizerFactory(int minNGram,
                             int maxNGram)
Create an n-gram tokenizer factory with the specified minimum and maximum n-gram lengths.

Parameters:
minNGram - Minimum n-gram length.
maxNGram - Maximum n-gram length.
Throws:
IllegalArgumentException - If the minimum is greater than the maximum or if the maximum is less than one.
Method Detail

compileTo

public void compileTo(ObjectOutput objOut)
               throws IOException
Compiles this n-gram tokenizer factory to the specified object output stream.

Specified by:
compileTo in interface Compilable
Parameters:
objOut - Output stream to which to write the tokenizer factory.
Throws:
IOException - If there is an exception writing the parameters.

tokenizer

public Tokenizer tokenizer(char[] cs,
                           int start,
                           int length)
Returns an n-gram tokenizer for the specified characters with the minimum and maximum n-gram lengths as specified in the constructor.

Specified by:
tokenizer in interface TokenizerFactory
Parameters:
cs - Underlying character array.
start - Index of first character in array to tokenize.
length - Number of characters to tokenize.