com.aliasi.sentences
Class SentenceChunker

java.lang.Object
  extended by com.aliasi.sentences.SentenceChunker
All Implemented Interfaces:
Chunker

public class SentenceChunker
extends Object
implements Chunker

The SentenceChunker class uses a SentenceModel to implement sentence detection through the chunk.Chunker interface. A sentence chunker is constructed from a tokenizer factory and a sentence model. The tokenizer factory creates tokens that it sends to the sentence model. The types of the chunks produced are given by the constant SENTENCE_CHUNK_TYPE.

The tokens and whitespaces returned by the tokenizer are concatenated to form the underlying text slice of the chunks returned by the chunker. Thus a tokenizer like the stop list tokenizer or Porter stemmer tokenizer will create a character slice that does not match the input. A whitespace-normalizing tokenizer filter can be used, for example, to produce normalized text for the basis of the chunks.

Since:
LingPipe2.1
Version:
2.1
Author:
Mitzi Morris

Field Summary
static String SENTENCE_CHUNK_TYPE
          The type assigned to sentence chunks, namely "S".
 
Constructor Summary
SentenceChunker(TokenizerFactory tf, SentenceModel sm)
          Construct a sentence chunker from the specified tokenizer factory and sentence model.
 
Method Summary
 Chunking chunk(char[] cs, int start, int end)
          Return the chunking derived from the underlying sentence model over the tokenization of the specified character slice.
 Chunking chunk(CharSequence cSeq)
          Return the chunking derived from the underlying sentence model over the tokenization of the specified character slice.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

SENTENCE_CHUNK_TYPE

public static final String SENTENCE_CHUNK_TYPE
The type assigned to sentence chunks, namely "S".

See Also:
Constant Field Values
Constructor Detail

SentenceChunker

public SentenceChunker(TokenizerFactory tf,
                       SentenceModel sm)
Construct a sentence chunker from the specified tokenizer factory and sentence model.

Parameters:
tf - Tokenizer factory for chunker.
sm - Sentence model for chunker.
Method Detail

chunk

public Chunking chunk(CharSequence cSeq)
Return the chunking derived from the underlying sentence model over the tokenization of the specified character slice. Iterating over the returned set is guaranteed to return the sentence chunks in their original textual order.

Warning: As described in the class documentation above, a tokenizer factory that produces tokenizers that do not reproduce the original sequence may cause the underlying character slice for the chunks to differ from the slice provided as an argument.

Specified by:
chunk in interface Chunker
Parameters:
cSeq - Character sequence underlying the slice.
Returns:
The sentence chunking of the specified character sequence.

chunk

public Chunking chunk(char[] cs,
                      int start,
                      int end)
Return the chunking derived from the underlying sentence model over the tokenization of the specified character slice. See chunk(CharSequence) for more information.

Specified by:
chunk in interface Chunker
Parameters:
cs - Underlying character sequence.
start - Index of first character in slice.
end - Index of one past the last character in the slice.
Returns:
The sentence chunking of the specified character slice.