com.aliasi.tokenizer
Class LineTokenizerFactory

java.lang.Object
  extended by com.aliasi.tokenizer.RegExTokenizerFactory
      extended by com.aliasi.tokenizer.LineTokenizerFactory
All Implemented Interfaces:
TokenizerFactory, Compilable, Serializable

public class LineTokenizerFactory
extends RegExTokenizerFactory

A LineTokenizerFactory treats each line of an input as a token. Whitespaces separating lines are simply newlines. This is useful for decoders that work at the line level.

Line terminators are as defined in Pattern, and include all of the Windows, Unix, and Macintosh standards, as well as some unicode extensions.

Whitespaces will be either empty strings or strings representing one or more newlines.

Tokens may consist entirely of whitespace characters if whitespace is the only thing on a line. But tokens will never contain sequences representing newlines. Tokens will alwyas consist of at least one character.

Examples

Input StringTokensWhitespaces
""{}{ "" }
"abc"{ "abc" }{ "", "" }
"abc\ndef"{ "abc", "def" }{ "", "\n", "" }
"abc\r\ndef"{ "abc", "def" }{ "", "\r\n", "" }
"abc\r\ndef"{ "abc", "def" }{ "", "\r\n", "" }
" abc\n def \n"{ " abc", " def " }{ "", "\n", "\n" }
" \n"{ " " }{ "", "\n" }

Compilation

A line tokenizer factory may be compiled. Upon deserialization, the resulting class will be an instance of RegExTokenizerFactory. In future versions, the deserialized class may change, so it is safest to simply cast it to the interface TokenizerFactory.

Implementation Note

This tokenizer factory is nothing more than a convenience wrapper around a very simple RegExTokenizerFactory, with the simplest possible regular expression:

      RegExTokenizerFactory(".+")

Because the regular expression tokenizer factory takes the default regular expression flags (see Pattern), the period (.) matches any character except a newline.

Since:
LingPipe3.2
Version:
3.2
Author:
Bob Carpenter
See Also:
Serialized Form

Constructor Summary
LineTokenizerFactory()
          Construct a line-based tokenizer.
 
Method Summary
 
Methods inherited from class com.aliasi.tokenizer.RegExTokenizerFactory
compileTo, pattern, tokenizer
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LineTokenizerFactory

public LineTokenizerFactory()
Construct a line-based tokenizer. See the class documentation above for a description of behavior.