public class NaiveBayesClassifier extends DynamicLMClassifier<TokenizedLM>
NaiveBayesClassifier
provides a trainable naive Bayes
text classifier, with tokens as features. A classifier is
constructed from a set of categories and a tokenizer factory. The
token estimator is a unigram token language model with a uniform
whitespace model and an optional ngram character language model
for smoothing unknown tokens.
Naive Bayes applied to tokenized text results in a socalled "bag of words" model where the tokens (words) are assumed to be independent of one another:
P(tokenscat)
= Π_{i<tokens.length}
P(tokens[i]cat)
This class implements this assumption by plugging unigram token
language models into a dynamic language model classifier. The
unigram token language model makes the naive Bayes assumption by
virtue of having no tokens of context.
The unigram model smooths maximum likelihood token estimates with a characterlevel model. Unfolding the general definition of that class to the unigram case yields the model:
P(tokencat)
= P_{tokenLM(cat)}(token)
= λ * count(token,cat) / totalCount(cat)
+ (1  λ) * P_{charLM(cat)}(Word)
where tokenLM(cat)
is the token language model defined
for the specified category and charLM(cat)
is the
character level language model it uses for smoothing. The unigram
token model is based on counts count(token,cat)
of a
token in the category and an overall count
totalCount(cat)
of tokens in the category. The
interpolation factor λ
is computed as per the
WittenBell model C with hyperparameter one:
&lambda = totalCount(cat) / (totalCount(cat) + numTokens(cat))
Roughly, the probability mass smoothed from the token model is
equal to the number of firstsightings of tokens in the training
data.
If this character smoothing model is uniform, there are two extremes that need to be balanced, especially in cases where there is not very much training data per category. If it is in initialized with the true number of characters, it will return a proper uniform character estimate. In practice, this will probably underestimate unknown tokens and thus categories in which they are unknown will pay a high penalty. If the token smoothing model is initalized with zero as the max number of characters, the token backoff will always be zero and thus not contribute to the classification scores. This will overestimate unknown tokens for classification, with probabilities summing to more than one. In practice, it will probably not penalize unknown words in categories enough. If the cost is greater than zero, it will be linear in the length of the unknown token.
Another way to smooth unknown tokens is to provide each model at least one instance of each token known to every other model, so there are no tokens known to one model and not another. But this adds an additional smoothing bias to the maximum likelihood character estimates which may or may not be helpful.
The unigram model is constructed with a whitespace model that
returns a constant zero estimate, UniformBoundaryLM.ZERO_LM
,
and thus contributes no probability mass to estimates.
As with the other language model classifiers, the conditional category probability ratios are determined with a category distribution and inversion:
ARGMAX_{cat} P(cattokens)
= ARGMAX_{cat} P(cat,tokens) / P(tokens)
= ARGMAX_{cat} P(cat,tokens)
= ARGMAX_{cat} P(tokenscat) * P(cat)
The category probability model P(cat)
is taken
to be a multivariate estimator with an initial count of one
for each category.
For this class, the tokens are produced by a tokenizer factory. This tokenizer factory may normalize tokens to stems, to lower case, remove stop words, etc. An extreme example would be to trim the bag to a small set of salient words, as picked out by TF/IDF with categories as documents.
Instances of this class may be compiled and read back into
memory in the same way as other instances of DynamicLMClassifier
using the compileTo()
method or
utiltiies in the class com.aliasi.util.AbstractExternalizable
.
Deserializing After compilation, Deserialized instances of naive
Bayes classifiers should be cast to the interface JointClassifier<CharSequence>
, though they may also be cast to
LMClassifier<CompiledTokenizedLM,MultivariateEstimator>
; the only
advantage to the latter cast is that you can still retrieve the
multivariate estimator over categories as well as the underlying
language model for each category. These will be compiled instances.
Constructor and Description 

NaiveBayesClassifier(String[] categories,
TokenizerFactory tokenizerFactory)
Construct a naive Bayes classifier with the specified
categories and tokenizer factory.

NaiveBayesClassifier(String[] categories,
TokenizerFactory tokenizerFactory,
int charSmoothingNGram)
Construct a naive Bayes classifier with the specified
categories, tokenizer factory and level of character ngram for
smoothing token estimates.

NaiveBayesClassifier(String[] categories,
TokenizerFactory tokenizerFactory,
int charSmoothingNGram,
int maxObservedChars)
Construct a naive Bayes classifier with the specified
categories, tokenizer factory and level of character ngram for
smoothing token estimates, along with a specification of the
total number of characters in test and training instances.

compileTo, createNGramBoundary, createNGramProcess, createTokenized, handle, resetCategory, train
categories, categoryDistribution, classify, classifyJoint, languageModel
public NaiveBayesClassifier(String[] categories, TokenizerFactory tokenizerFactory)
The character backoff models are assumed to be uniform
and there is no limit on the number of observed characters
other than Character.MAX_VALUE
.
categories
 Categories into which to classify text.tokenizerFactory
 Text tokenizer.IllegalArgumentException
 If there are not at least two
categories.public NaiveBayesClassifier(String[] categories, TokenizerFactory tokenizerFactory, int charSmoothingNGram)
There is no limit on the number of observed characters
other than Character.MAX_VALUE
.
categories
 Categories into which to classify text.tokenizerFactory
 Text tokenizer.charSmoothingNGram
 Order of character ngram used to
smooth token estimates.IllegalArgumentException
 If there are not at least two
categories.public NaiveBayesClassifier(String[] categories, TokenizerFactory tokenizerFactory, int charSmoothingNGram, int maxObservedChars)
As noted in the class documentation above, setting the max observed characters parameter to one effectively eliminates estimates of the string of an unknown token.
categories
 Categories into which to classify text.tokenizerFactory
 Text tokenizer.charSmoothingNGram
 Order of character ngram used to
smooth token estimates.maxObservedChars
 The maximum number of characters found
in the text of training and test sets.IllegalArgumentException
 If there are not at least two
categories or if the number of observed characters is less than 1
or more than the total number of characters.