|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectcom.aliasi.classify.LMClassifier<L,MultivariateEstimator>
com.aliasi.classify.DynamicLMClassifier<L>
public class DynamicLMClassifier<L extends LanguageModel.Dynamic>
A DynamicLMClassifier is a language model classifier
that accepts training events of categorized character sequences.
Training is based on a multivariate estimator for the category
distribution and dynamic language models for the per-category
character sequence estimators. These models also form the basis of
the superclass's implementation of classification.
Because this class implements training and classification, it may be used in tag-a-little, learn-a-little supervised learning without retraining epochs. This makes it ideal for active learning applications, for instance.
At any point after adding training events, the classfier may be
compiled to an object output. The classifier read back in will be
a non-dynamic instance of LMClassifier. It will be based
on the compiled version of the multivariate estimator and the
compiled version of the dynamic language models for the categories.
Instances of this class allow concurrent read operations but
require writes to run exclusively. Reads in this context are
either calculating estimates or compiling; writes are training.
Extensions to LingPipe's classes may impose tighter restrictions.
For instance, a subclass of MultivariateEstimator
might be used that does not allow concurrent estimates; in that
case, its restrictions are passed on to this classifier. The same
goes for the language models and in the case of token language
models, the tokenizer factories.
| Constructor Summary | |
|---|---|
DynamicLMClassifier(String[] categories,
L[] languageModels)
Construct a dynamic language model classifier over the specified categories with specified language models per category and an overall category estimator. |
|
| Method Summary | ||
|---|---|---|
MultivariateEstimator |
categoryEstimator()
Deprecated. As of 3.0, use general method LMClassifier.categoryDistribution(). |
|
void |
compileTo(ObjectOutput objOut)
Writes a compiled version of this classifier to the specified object output. |
|
static DynamicLMClassifier<NGramBoundaryLM> |
createNGramBoundary(String[] categories,
int maxCharNGram)
Construct a dynamic classifier over the specified cateogries, using boundary character n-gram models of the specified order. |
|
static DynamicLMClassifier<NGramProcessLM> |
createNGramProcess(String[] categories,
int maxCharNGram)
Construct a dynamic classifier over the specified categories, using process character n-gram models of the specified order. |
|
static DynamicLMClassifier<TokenizedLM> |
createTokenized(String[] categories,
TokenizerFactory tokenizerFactory,
int maxTokenNGram)
Construct a dynamic language model classifier over the specified categories using token n-gram language models of the specified order and the specified tokenizer factory for tokenization. |
|
void |
handle(CharSequence charSequence,
Classification classification)
Provides a training instance for the specified character sequence using the best category from the specified classification. |
|
L |
lmForCategory(String category)
Deprecated. As of 3.0, use general LMClassifier.languageModel(String). |
|
void |
resetCategory(String category,
L lm,
int newCount)
Resets the specified category to the specified language model. |
|
void |
train(String category,
char[] cs,
int start,
int end)
Provide a training instance for the specified category consisting of the sequence of characters in the specified character slice. |
|
void |
train(String category,
CharSequence sampleCSeq)
Provide a training instance for the specified category consisting of the specified sample character sequence. |
|
void |
train(String category,
CharSequence sampleCSeq,
int count)
Provide a training instance for the specified category consisting of the specified sample character sequence with the specified count. |
|
static
|
trainEm(Factory<DynamicLMClassifier<L>> classifierFactory,
Corpus<ClassificationHandler<CharSequence,Classification>> labeledData,
Corpus<ObjectHandler<CharSequence>> unlabeledData,
int numEpochs,
double trainingInstanceMultiple)
Train a dynamic language model classifier using the specified labeled and unlabled corpora with the expectation maximization (EM) algorithm run for the specified number of epochs with the specified instance multiple, creating a dynamic classifier for each epoch using the specified factory. |
|
| Methods inherited from class com.aliasi.classify.LMClassifier |
|---|
categories, categoryDistribution, classify, classifyJoint, languageModel |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public DynamicLMClassifier(String[] categories,
L[] languageModels)
The multivariate estimator over categories is initialized
with one count for each category. Technically, initializing
counts involves a uniform Dirichlet prior with
α=1, which is often called Laplace
smoothing.
categories - Categories used for classification.languageModels - Dynamic language models for categories.
IllegalArgumentException - If there are not at least two
categories, or if the length of the category and language model
arrays is not the same.| Method Detail |
|---|
public void train(String category,
char[] cs,
int start,
int end)
No modeling of the begin or end of the sequence is carried out. If such a behavior is desired, it should be reflected in the training instances supplied to this method.
The component models for this classifier may be accessed and
trained independently using categoryEstimator() and
lmForCategory(String).
category - Category of this training sequence.cs - Characters used for training.start - Index of first character to use for training.end - Index of one past the last character to use for
training.
IllegalArgumentException - If the category is not known.
public void train(String category,
CharSequence sampleCSeq)
train(String,char[],int,int).
category - Category of this training sequence.sampleCSeq - Category sequence for training.
IllegalArgumentException - If the category is not known.
public void train(String category,
CharSequence sampleCSeq,
int count)
train(String,char[],int,int).
Counts of zero are ignored, whereas counts less than zero raise an exception.
category - Category of this training sequence.sampleCSeq - Category sequence for training.count - Number of training instances.
IllegalArgumentException - If the category is not known
or if the count is negative.
public static <L extends LanguageModel.Dynamic> DynamicLMClassifier<L> trainEm(Factory<DynamicLMClassifier<L>> classifierFactory,
Corpus<ClassificationHandler<CharSequence,Classification>> labeledData,
Corpus<ObjectHandler<CharSequence>> unlabeledData,
int numEpochs,
double trainingInstanceMultiple)
throws IOException
The training instance multiple parameter specifies the quantization of conditional probabilities into integer counts. The higher the value, the more outcomes are used for each unlabeled instance.
The exact form of the EM algorithm as used by this method is:
1. create classifier using factory
2. train on labeled data
3. for each epoch:
A. create a new classifier
B. train the new classifier on labeled data
C. for each unlabeled datum
i. classify using last classifier
ii. for each output category in result
a. multiply conditional prob by multiple, cast to int
b. train new classifier on datum using category plus count
classifierFactory - Factory for creating the dynamic
language model classifiers needed by EM.labeledData - A corpus of labeled data.unlabeledData - A corpus of unlabeled data.numEpochs - Number of epochs to run EM.trainingInstanceMultiple - Amount to multiply each
conditional probability by to generate an integer count
for training.
IOException
public void handle(CharSequence charSequence,
Classification classification)
CharSequence,
and the result passed along with the first-best category
to train(String,CharSequence).
handle in interface ClassificationHandler<CharSequence,Classification>charSequence - Character sequence for training.classification - Classification to use for training.
ClassCastException - If the specified object does not
implement CharSequence.public MultivariateEstimator categoryEstimator()
LMClassifier.categoryDistribution().
public L lmForCategory(String category)
LMClassifier.languageModel(String).
IllegalArgumentException - If the category is not known.
public void compileTo(ObjectOutput objOut)
throws IOException
LMClassifier.
compileTo in interface CompilableobjOut - Object output to which this classifier is
written.
IOException - If there is an I/O exception writing to
the output stream.
public void resetCategory(String category,
L lm,
int newCount)
category - Category to reset.lm - New dynamic language model for category.newCount - New count for category.
IllegalArgumentException - If the category is not known.
public static DynamicLMClassifier<NGramProcessLM> createNGramProcess(String[] categories,
int maxCharNGram)
See the documentation for the constructor DynamicLMClassifier(String[], LanguageModel.Dynamic[]) for
information on the category multivariate estimate for priors.
categories - Categories used for classification.maxCharNGram - Maximum length of character sequence
counted in model.
IllegalArgumentException - If there are not at least two
categories.
public static DynamicLMClassifier<NGramBoundaryLM> createNGramBoundary(String[] categories,
int maxCharNGram)
See the documentation for the constructor DynamicLMClassifier(String[], LanguageModel.Dynamic[]) for
information on the category multivariate estimate for priors.
categories - Categories used for classification.maxCharNGram - Maximum length of character sequence
counted in model.
IllegalArgumentException - If there are not at least two
categories.
public static DynamicLMClassifier<TokenizedLM> createTokenized(String[] categories,
TokenizerFactory tokenizerFactory,
int maxTokenNGram)
The multivariate estimator over categories is initialized with one count for each category.
The unknown token and whitespace models are uniform sequence models.
categories - Categories used for classification.maxTokenNGram - Maximum length of token n-grams used.tokenizerFactory - Tokenizer factory for tokenization.
IllegalArgumentException - If there are not at least two
categories.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||