com.aliasi.classify
Class DynamicLMClassifier<L extends LanguageModel.Dynamic>

java.lang.Object
  extended by com.aliasi.classify.LMClassifier<L,MultivariateEstimator>
      extended by com.aliasi.classify.DynamicLMClassifier<L>
All Implemented Interfaces:
Classifier<CharSequence,JointClassification>, ClassificationHandler<CharSequence,Classification>, Handler, Compilable
Direct Known Subclasses:
BinaryLMClassifier, NaiveBayesClassifier

public class DynamicLMClassifier<L extends LanguageModel.Dynamic>
extends LMClassifier<L,MultivariateEstimator>
implements ClassificationHandler<CharSequence,Classification>, Compilable

A DynamicLMClassifier is a language model classifier that accepts training events of categorized character sequences. Training is based on a multivariate estimator for the category distribution and dynamic language models for the per-category character sequence estimators. These models also form the basis of the superclass's implementation of classification.

Because this class implements training and classification, it may be used in tag-a-little, learn-a-little supervised learning without retraining epochs. This makes it ideal for active learning applications, for instance.

At any point after adding training events, the classfier may be compiled to an object output. The classifier read back in will be a non-dynamic instance of LMClassifier. It will be based on the compiled version of the multivariate estimator and the compiled version of the dynamic language models for the categories.

Instances of this class allow concurrent read operations but require writes to run exclusively. Reads in this context are either calculating estimates or compiling; writes are training. Extensions to LingPipe's classes may impose tighter restrictions. For instance, a subclass of MultivariateEstimator might be used that does not allow concurrent estimates; in that case, its restrictions are passed on to this classifier. The same goes for the language models and in the case of token language models, the tokenizer factories.

Since:
LingPipe2.0
Version:
3.3.1
Author:
Bob Carpenter

Constructor Summary
DynamicLMClassifier(String[] categories, L[] languageModels)
          Construct a dynamic language model classifier over the specified categories with specified language models per category and an overall category estimator.
 
Method Summary
 MultivariateEstimator categoryEstimator()
          Deprecated. As of 3.0, use general method LMClassifier.categoryDistribution().
 void compileTo(ObjectOutput objOut)
          Writes a compiled version of this classifier to the specified object output.
static DynamicLMClassifier<NGramBoundaryLM> createNGramBoundary(String[] categories, int maxCharNGram)
          Construct a dynamic classifier over the specified cateogries, using boundary character n-gram models of the specified order.
static DynamicLMClassifier<NGramProcessLM> createNGramProcess(String[] categories, int maxCharNGram)
          Construct a dynamic classifier over the specified categories, using process character n-gram models of the specified order.
static DynamicLMClassifier<TokenizedLM> createTokenized(String[] categories, TokenizerFactory tokenizerFactory, int maxTokenNGram)
          Construct a dynamic language model classifier over the specified categories using token n-gram language models of the specified order and the specified tokenizer factory for tokenization.
 void handle(CharSequence charSequence, Classification classification)
          Provides a training instance for the specified character sequence using the best category from the specified classification.
 L lmForCategory(String category)
          Deprecated. As of 3.0, use general LMClassifier.languageModel(String).
 void resetCategory(String category, L lm, int newCount)
          Resets the specified category to the specified language model.
 void train(String category, char[] cs, int start, int end)
          Provide a training instance for the specified category consisting of the sequence of characters in the specified character slice.
 void train(String category, CharSequence sampleCSeq)
          Provide a training instance for the specified category consisting of the specified sample character sequence.
 void train(String category, CharSequence sampleCSeq, int count)
          Provide a training instance for the specified category consisting of the specified sample character sequence with the specified count.
static
<L extends LanguageModel.Dynamic>
DynamicLMClassifier<L>
trainEm(Factory<DynamicLMClassifier<L>> classifierFactory, Corpus<ClassificationHandler<CharSequence,Classification>> labeledData, Corpus<ObjectHandler<CharSequence>> unlabeledData, int numEpochs, double trainingInstanceMultiple)
          Train a dynamic language model classifier using the specified labeled and unlabled corpora with the expectation maximization (EM) algorithm run for the specified number of epochs with the specified instance multiple, creating a dynamic classifier for each epoch using the specified factory.
 
Methods inherited from class com.aliasi.classify.LMClassifier
categories, categoryDistribution, classify, classifyJoint, languageModel
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DynamicLMClassifier

public DynamicLMClassifier(String[] categories,
                           L[] languageModels)
Construct a dynamic language model classifier over the specified categories with specified language models per category and an overall category estimator.

The multivariate estimator over categories is initialized with one count for each category. Technically, initializing counts involves a uniform Dirichlet prior with α=1, which is often called Laplace smoothing.

Parameters:
categories - Categories used for classification.
languageModels - Dynamic language models for categories.
Throws:
IllegalArgumentException - If there are not at least two categories, or if the length of the category and language model arrays is not the same.
Method Detail

train

public void train(String category,
                  char[] cs,
                  int start,
                  int end)
Provide a training instance for the specified category consisting of the sequence of characters in the specified character slice. A call to this method increments the count of the category in the maximum likelihood estimator and also trains the language model for the specified category. Thus the balance of categories reflected in calls to this method for training should reflect the balance of categories in the test set.

No modeling of the begin or end of the sequence is carried out. If such a behavior is desired, it should be reflected in the training instances supplied to this method.

The component models for this classifier may be accessed and trained independently using categoryEstimator() and lmForCategory(String).

Parameters:
category - Category of this training sequence.
cs - Characters used for training.
start - Index of first character to use for training.
end - Index of one past the last character to use for training.
Throws:
IllegalArgumentException - If the category is not known.

train

public void train(String category,
                  CharSequence sampleCSeq)
Provide a training instance for the specified category consisting of the specified sample character sequence. Training behavior is as described in train(String,char[],int,int).

Parameters:
category - Category of this training sequence.
sampleCSeq - Category sequence for training.
Throws:
IllegalArgumentException - If the category is not known.

train

public void train(String category,
                  CharSequence sampleCSeq,
                  int count)
Provide a training instance for the specified category consisting of the specified sample character sequence with the specified count. Training behavior is as described in train(String,char[],int,int).

Counts of zero are ignored, whereas counts less than zero raise an exception.

Parameters:
category - Category of this training sequence.
sampleCSeq - Category sequence for training.
count - Number of training instances.
Throws:
IllegalArgumentException - If the category is not known or if the count is negative.

trainEm

public static <L extends LanguageModel.Dynamic> DynamicLMClassifier<L> trainEm(Factory<DynamicLMClassifier<L>> classifierFactory,
                                                                               Corpus<ClassificationHandler<CharSequence,Classification>> labeledData,
                                                                               Corpus<ObjectHandler<CharSequence>> unlabeledData,
                                                                               int numEpochs,
                                                                               double trainingInstanceMultiple)
                                                                    throws IOException
Train a dynamic language model classifier using the specified labeled and unlabled corpora with the expectation maximization (EM) algorithm run for the specified number of epochs with the specified instance multiple, creating a dynamic classifier for each epoch using the specified factory.

The training instance multiple parameter specifies the quantization of conditional probabilities into integer counts. The higher the value, the more outcomes are used for each unlabeled instance.

The exact form of the EM algorithm as used by this method is:

 1. create classifier using factory
 2. train on labeled data
 3. for each epoch:
    A. create a new classifier
    B. train the new classifier on labeled data
    C. for each unlabeled datum
       i. classify using last classifier
       ii. for each output category in result
           a. multiply conditional prob by multiple, cast to int
           b. train new classifier on datum using category plus count
 

Parameters:
classifierFactory - Factory for creating the dynamic language model classifiers needed by EM.
labeledData - A corpus of labeled data.
unlabeledData - A corpus of unlabeled data.
numEpochs - Number of epochs to run EM.
trainingInstanceMultiple - Amount to multiply each conditional probability by to generate an integer count for training.
Throws:
IOException

handle

public void handle(CharSequence charSequence,
                   Classification classification)
Provides a training instance for the specified character sequence using the best category from the specified classification. Only the first-best category from the classification is used. The object is cast to CharSequence, and the result passed along with the first-best category to train(String,CharSequence).

Specified by:
handle in interface ClassificationHandler<CharSequence,Classification>
Parameters:
charSequence - Character sequence for training.
classification - Classification to use for training.
Throws:
ClassCastException - If the specified object does not implement CharSequence.

categoryEstimator

public MultivariateEstimator categoryEstimator()
Deprecated. As of 3.0, use general method LMClassifier.categoryDistribution().

Returns the maximum likelihood estimator for categories in this classifier. Changes to the returned model will be reflected in this classifier; thus it may be used to train the category estimator without affecting the language models for any category.

Returns:
The maximum likelihood estimator for categories in this classifier.

lmForCategory

public L lmForCategory(String category)
Deprecated. As of 3.0, use general LMClassifier.languageModel(String).

Returns the language model for the specified category. Changes to the returned model will be reflected in this classifier; thus it may be used to train a language model without affecting the category estimates.

Returns:
The language model for the specified category.
Throws:
IllegalArgumentException - If the category is not known.

compileTo

public void compileTo(ObjectOutput objOut)
               throws IOException
Writes a compiled version of this classifier to the specified object output. The object returned will be an instance of LMClassifier.

Specified by:
compileTo in interface Compilable
Parameters:
objOut - Object output to which this classifier is written.
Throws:
IOException - If there is an I/O exception writing to the output stream.

resetCategory

public void resetCategory(String category,
                          L lm,
                          int newCount)
Resets the specified category to the specified language model. This also resets the count in the multivariate estimator of categories to zero.

Parameters:
category - Category to reset.
lm - New dynamic language model for category.
newCount - New count for category.
Throws:
IllegalArgumentException - If the category is not known.

createNGramProcess

public static DynamicLMClassifier<NGramProcessLM> createNGramProcess(String[] categories,
                                                                     int maxCharNGram)
Construct a dynamic classifier over the specified categories, using process character n-gram models of the specified order.

See the documentation for the constructor DynamicLMClassifier(String[], LanguageModel.Dynamic[]) for information on the category multivariate estimate for priors.

Parameters:
categories - Categories used for classification.
maxCharNGram - Maximum length of character sequence counted in model.
Throws:
IllegalArgumentException - If there are not at least two categories.

createNGramBoundary

public static DynamicLMClassifier<NGramBoundaryLM> createNGramBoundary(String[] categories,
                                                                       int maxCharNGram)
Construct a dynamic classifier over the specified cateogries, using boundary character n-gram models of the specified order.

See the documentation for the constructor DynamicLMClassifier(String[], LanguageModel.Dynamic[]) for information on the category multivariate estimate for priors.

Parameters:
categories - Categories used for classification.
maxCharNGram - Maximum length of character sequence counted in model.
Throws:
IllegalArgumentException - If there are not at least two categories.

createTokenized

public static DynamicLMClassifier<TokenizedLM> createTokenized(String[] categories,
                                                               TokenizerFactory tokenizerFactory,
                                                               int maxTokenNGram)
Construct a dynamic language model classifier over the specified categories using token n-gram language models of the specified order and the specified tokenizer factory for tokenization.

The multivariate estimator over categories is initialized with one count for each category.

The unknown token and whitespace models are uniform sequence models.

Parameters:
categories - Categories used for classification.
maxTokenNGram - Maximum length of token n-grams used.
tokenizerFactory - Tokenizer factory for tokenization.
Throws:
IllegalArgumentException - If there are not at least two categories.