|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectcom.aliasi.sentences.SentenceChunker
public class SentenceChunker
The SentenceChunker class uses a
SentenceModel to implement sentence detection through
the chunk.Chunker interface. A sentence chunker is
constructed from a tokenizer factory and a sentence model. The
tokenizer factory creates tokens that it sends to the sentence
model. The types of the chunks produced are given by the
constant SENTENCE_CHUNK_TYPE.
The tokens and whitespaces returned by the tokenizer are concatenated to form the underlying text slice of the chunks returned by the chunker. Thus a tokenizer like the stop list tokenizer or Porter stemmer tokenizer will create a character slice that does not match the input. A whitespace-normalizing tokenizer filter can be used, for example, to produce normalized text for the basis of the chunks.
| Field Summary | |
|---|---|
static String |
SENTENCE_CHUNK_TYPE
The type assigned to sentence chunks, namely "S". |
| Constructor Summary | |
|---|---|
SentenceChunker(TokenizerFactory tf,
SentenceModel sm)
Construct a sentence chunker from the specified tokenizer factory and sentence model. |
|
| Method Summary | |
|---|---|
Chunking |
chunk(char[] cs,
int start,
int end)
Return the chunking derived from the underlying sentence model over the tokenization of the specified character slice. |
Chunking |
chunk(CharSequence cSeq)
Return the chunking derived from the underlying sentence model over the tokenization of the specified character slice. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final String SENTENCE_CHUNK_TYPE
"S".
| Constructor Detail |
|---|
public SentenceChunker(TokenizerFactory tf,
SentenceModel sm)
tf - Tokenizer factory for chunker.sm - Sentence model for chunker.| Method Detail |
|---|
public Chunking chunk(CharSequence cSeq)
Warning: As described in the class documentation above, a tokenizer factory that produces tokenizers that do not reproduce the original sequence may cause the underlying character slice for the chunks to differ from the slice provided as an argument.
chunk in interface ChunkercSeq - Character sequence underlying the slice.
public Chunking chunk(char[] cs,
int start,
int end)
chunk(CharSequence) for more information.
chunk in interface Chunkercs - Underlying character sequence.start - Index of first character in slice.end - Index of one past the last character in the slice.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||