com.aliasi.tokenizer
Class CharacterTokenizerFactory

java.lang.Object
  extended by com.aliasi.tokenizer.CharacterTokenizerFactory
All Implemented Interfaces:
TokenizerFactory, Compilable

public class CharacterTokenizerFactory
extends Object
implements Compilable, TokenizerFactory

A CharacterTokenizerFactory considers each non-whitespace character in the input to be a distinct token. This factory is useful for handling languages such as Chinese, which includes thousands of characters and presents a difficult tokenization problem for standard tokenizers.

Since:
LingPipe1.0
Version:
2.3.0
Author:
Bob Carpenter

Field Summary
static TokenizerFactory FACTORY
          A constant instance of a character tokenizer factory.
 
Constructor Summary
CharacterTokenizerFactory()
          Construct a character tokenizer factory.
 
Method Summary
 void compileTo(ObjectOutput objOut)
          Compiles this tokenizer factory to the specified object output.
 Tokenizer tokenizer(char[] ch, int start, int length)
          Returns a character tokenizer for the specified character array slice.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

FACTORY

public static final TokenizerFactory FACTORY
A constant instance of a character tokenizer factory. Note that compiled versions are all equal to this factory.

Constructor Detail

CharacterTokenizerFactory

public CharacterTokenizerFactory()
Construct a character tokenizer factory.

Implementation Note: All character tokenizer factories behave the same way, and they are thread safe, so the constant FACTORY may be used anywhere a freshly constructed character tokenizer factory is used, without loss of performance.

Method Detail

tokenizer

public Tokenizer tokenizer(char[] ch,
                           int start,
                           int length)
Returns a character tokenizer for the specified character array slice.

Specified by:
tokenizer in interface TokenizerFactory
Parameters:
ch - Characters to tokenize.
start - Index of first character to tokenize.
length - Number of characters to tokenize.

compileTo

public void compileTo(ObjectOutput objOut)
               throws IOException
Compiles this tokenizer factory to the specified object output. The tokenizer factory read back in is reference identical to the static constant FACTORY.

Specified by:
compileTo in interface Compilable
Parameters:
objOut - Object output to which this tokenizer factory is compiled.
Throws:
IOException - If there is an I/O error during the write.