B- the type of the underlying n-best chunker being rescored
O- the type of the process language model for non-entities
C- the type of the sequence language model for entities
public class AbstractCharLmRescoringChunker<B extends NBestChunker,O extends LanguageModel.Process,C extends LanguageModel.Sequence> extends RescoringChunker<B>
AbstractCharLmRescoringChunkerprovides the basic character language-model rescoring model used by the trainable
CharLmRescoringChunkerand its compiled version.
The exact model used is most easily described through an example. Consider the sentence John J. Smith lives in Washington. with John J. Smith as a person-type chunk and Washington as a location-type chunk. The probablity of this analysis derives from alternating chunk/non-chunk spans, starting and ending with non-chunk spans.
Note that the chunk modelsPOUT(CPER|CBOS) * PPER(John J. Smith) * POUT( lives in CLOC|CPER) * PLOC(Washington) * POUT(.CEOS|CLOC)
PLOCare bounded models, and thus predict the first letter given the fact that it's the first letter, and also encodes an end-of-string probability to model the end. See
NGramBoundaryLMfor more information on bounded models.
POUT model is a process
language model, but uses distinguished characters in much the same
way as the bounded models do internally. In particular, we have
distinguished characters for each type
CPER), and for
begin-of-sentence and end-of-sentence markers
CBOS). These must be chosen
so as not to conflict with any input characters in training or
decoding. With this encoding, the non-chunk model bears the brunt
of the burden in predicting types. To start, it conditions the
text it generates on the previous type, encoded as a character. To
end, it generates the next chunk type, also encoded as a character.
This allows the models to be sensitive to the fact that phrases
like lives in (including the spaces on either side) are
conditioned on following a person. The following chunk type,
location, is generated conditional on following
CPER lives in. The only constraints
on the length of these dependencies is the length of the n-gram
models (and the size of the chunk/non-chunk spans).
The resulting model generates a properly normalized probability distribution over chunkings.
BOS is reserved for use by the system
for encoding document start/end positions. See
for more information.
|Constructor and Description|
Construct a rescoring chunker based on the specified underlying chunker, with the specified number of underlying chunkings rescored, based on the models and type encodings provided in the last three arguments.
|Modifier and Type||Method and Description|
Returns the sequence language model for chunks of the specified type.
Returns the process language model for non-chunks.
Performs rescoring of the base chunking output using character language models.
Returns the character used to encode the specified type in the model.
baseChunker, chunk, chunk, nBest, nBestChunks, numChunkingsRescored, setNumChunkingsRescored
public AbstractCharLmRescoringChunker(B baseNBestChunker, int numChunkingsRescored, O outLM, Map<String,Character> typeToChar, Map<String,C> typeToLM)
baseNBestChunker- Underlying chunker to rescore.
numChunkingsRescored- Number of underlying chunkings rescored by this chunker.
outLM- The process language model for non-chunks.
typeToChar- A mapping from chunk types to the characters that encode them.
typeToLM- A mapping from chunk types to the language models used to model them.
public char typeToChar(String chunkType)
chunkType- Type of chunk.
IllegalArgumentException- If the specified chunk type does not exist.
public O outLM()
chunkType- Type of chunk.
public double rescore(Chunking chunking)