public class TokenizedLM extends Object implements LanguageModel.Dynamic, LanguageModel.Sequence, LanguageModel.Tokenized, ObjectHandler<CharSequence>
TokenizedLM
provides a dynamic sequence language
model which models token sequences with an ngram model, and
whitespace and unknown tokens with their own sequence language
models.
A tokenized language model factors the probability assigned to a character sequence as follows:
P(cs)
= P_{tok}(toks(cs))
Π_{t in unknownToks(cs)}
P_{unk}(t)
Π_{w in whitespaces(cs)}
P_{whsp}(w)
where
P_{tok}
is the token model
esimate and where toks(cs)
replaces known tokens with
their integer identifiers, unknown tokens with 1
and
adds boundary symbols 2
front and back, the same
adjustment is used to remove the initial boundary estimate as in
NGramBoundaryLM
;
P_{unk}
is the unknown token
sequence language model and unknownToks(cs)
is the
list of unknown tokens in the input (with duplicates); and
P_{whsp}
is the whitespace sequence
language model and whitespaces(cs)
is the list of
whitespaces in the character sequence (with duplicates).
The token ngram model itself uses the same method of counting
and smoothing as described in the class documentation for NGramProcessLM
. Like NGramBoundaryLM
, boundary tokens are
inserted before and after other tokens. And like the ngram
character boundary model, the initial boundary estimate is subtracted
from the overall estimate for normalization purposes.
Tokens are all converted to integer identifiers using an
internal dynamic symbol table. All symbols in symbol tables get
nonnegative identifiers; the negative value 1
is
used for the unknown token in models, just as in symbol tables.
The value 2
is used for the boundary marker in the
counters.
In order for all estimates to be nonzero, the integer sequence counter used to back the token model is initialized with a count of 1 for the endofstream identifier (2). The unknown token count for any context is taken to be the number of outcomes in that context. Because unknowns are estimated directly in this manner, there is no need to interpolate the unigram model with a uniform model for unknown outcome. Instead, the occurrence of an unknown is modeled directly and its identity is modeled by the unknown token language model.
In order to produce a properly normalized sequence model, the
concatenation of tokens and whitespaces returned by the tokenizer
should concatenate together to produce the original input. Note
that this condition is not checked at runtime. But,
sequences may be normalized before being trained and evaluated for
a language model. For instance, all alphabetic characters might be
reduced to lower case and all punctuation characters removed and
all nonempty sequences of whitespace reduced to a single space
character. A langauge model may then be defined over this
normalized space of input, not the original space (and may thus use
a reduced number of characters for its uniform estimates).
Although this normalization may be carried out by a tokenizer in
practice, for instance for use in a tokenized classifier, an
normalization is consistent the interface specification for LanguageModel.Sequence
or LanguageModel.Dynamic
only if
done on the outside.
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized
Modifier and Type  Field and Description 

static int 
BOUNDARY_TOKEN
The symbol used for boundaries in the counter, 2.

static int 
UNKNOWN_TOKEN
The symbol used for unknown symbol IDs.

Constructor and Description 

TokenizedLM(TokenizerFactory factory,
int nGramOrder)
Constructs a tokenized language model with the specified
tokenization factory and ngram order (see warnings below on
where this simple constructor may be used).

TokenizedLM(TokenizerFactory tokenizerFactory,
int nGramOrder,
LanguageModel.Sequence unknownTokenModel,
LanguageModel.Sequence whitespaceModel,
double lambdaFactor)
Construct a tokenized language model with the specified
tokenization factory and ngram order, sequence models for
unknown tokens and whitespace, and an interpolation
hyperparameter.

TokenizedLM(TokenizerFactory tokenizerFactory,
int nGramOrder,
LanguageModel.Sequence unknownTokenModel,
LanguageModel.Sequence whitespaceModel,
double lambdaFactor,
boolean initialIncrementBoundary)
Construct a tokenized language model with the specified
tokenization factory and ngram order, sequence models for
unknown tokens and whitespace, and an interpolation
hyperparameter, as well as a flag indicating whether to
automatically increment a null input to avoid numerical
problems with zero counts.

Modifier and Type  Method and Description 

double 
chiSquaredIndependence(int[] nGram)
Returns the maximum value of Pearson's C_{2}
independence test statistic resulting from splitting the
specified ngram in half to derive a contingency matrix.

SortedSet<ScoredObject<String[]>> 
collocationSet(int nGram,
int minCount,
int maxReturned)
Returns an array of collocations in order of confidence that
their token sequences are not independent.

void 
compileTo(ObjectOutput objOut)
Writes a compiled version of this tokenized language model to
the specified object output.

SortedSet<ScoredObject<String[]>> 
frequentTermSet(int nGram,
int maxReturned)
Returns the most frequent ngram terms in the training data up
to the specified maximum number.

void 
handle(CharSequence cs)
Trains the language model on the specified character sequence.

void 
handleNGrams(int nGramLength,
int minCount,
ObjectHandler<String[]> handler)
Visits the ngrams of the specified length with at least the specified
minimum count stored in the underlying counter of this
tokenized language model and passes them to the specified handler.

SortedSet<ScoredObject<String[]>> 
infrequentTermSet(int nGram,
int maxReturned)
Returns the least frequent ngram terms in the training data up
to the specified maximum number.

double 
lambdaFactor()
Returns the interpolation ratio, or lambda factor,
for interpolating in this tokenized language model.

double 
log2Estimate(char[] cs,
int start,
int end)
Returns an estimate of the log (base 2) probability of the
specified character slice.

double 
log2Estimate(CharSequence cSeq)
Returns an estimate of the log (base 2) probability of the
specified character sequence.

SortedSet<ScoredObject<String[]>> 
newTermSet(int nGram,
int minCount,
int maxReturned,
LanguageModel.Tokenized backgroundLM)
Returns a list of scored ngrams ordered by the significance
of the degree to which their counts in this model exceed their
expected counts in a specified background model.

int 
nGramOrder()
Returns the order of the token ngram model underlying this
tokenized language model.

SortedSet<ScoredObject<String[]>> 
oldTermSet(int nGram,
int minCount,
int maxReturned,
LanguageModel.Tokenized backgroundLM)
Returns a list of scored ngrams ordered in reverse order
of significance with respect to the background model.

double 
processLog2Probability(String[] tokens)
Returns the probability of the specified tokens in the
underlying token ngram distribution.

TrieIntSeqCounter 
sequenceCounter()
Returns the integer sequence counter underlying this model.

SymbolTable 
symbolTable()
Returns the symbol table underlying this tokenized language
model's token ngram model.

TokenizerFactory 
tokenizerFactory()
Returns the tokenizer factory for this tokenized language
model.

double 
tokenLog2Probability(String[] tokens,
int start,
int end)
Returns the log (base 2) probability of the specified
token slice in the underlying token ngram distribution.

double 
tokenLog2ProbCharSmooth(String[] tokens,
int start,
int end) 
double 
tokenLog2ProbCharSmoothNoBounds(String[] tokens,
int start,
int end) 
double 
tokenProbability(String[] tokens,
int start,
int end)
Returns the probability of the specified token slice in the
token ngram distribution.

double 
tokenProbCharSmooth(String[] tokens,
int start,
int end) 
double 
tokenProbCharSmoothNoBounds(String[] tokens,
int start,
int end) 
String 
toString()
Returns a stringbased representation of the token
counts for this language model.

void 
train(char[] cs,
int start,
int end)
Trains the token sequence model, whitespace model (if dynamic) and
unknown token model (if dynamic).

void 
train(char[] cs,
int start,
int end,
int count)
Trains the token sequence model, whitespace model (if dynamic) and
unknown token model (if dynamic).

void 
train(CharSequence cSeq)
Trains the token sequence model, whitespace model (if dynamic) and
unknown token model (if dynamic).

void 
train(CharSequence cSeq,
int count)
Trains the token sequence model, whitespace model (if dynamic) and
unknown token model (if dynamic) with the specified count number
of instances.

void 
trainSequence(CharSequence cSeq,
int count)
This method increments the count of the entire sequence
specified.

LanguageModel.Sequence 
unknownTokenLM()
Returns the unknown token seqeunce language model for this
tokenized language model.

LanguageModel.Sequence 
whitespaceLM()
Returns the whitespace language model for this tokenized
language model.

double 
z(int[] nGram,
int nGramSampleCount,
int totalSampleCount)
Returns the zscore of the specified ngram with the specified
count out of a total sample count, as measured against the
expectation of this tokenized language model.

public static final int UNKNOWN_TOKEN
public static final int BOUNDARY_TOKEN
public TokenizedLM(TokenizerFactory factory, int nGramOrder)
The unknown token and whitespace models are both uniform
sequence language models with default parameters as described
in the documentation for the constructor UniformBoundaryLM.UniformBoundaryLM()
. The default
interpolation hyperparameter is equal to the ngram Order.
Warning: This construction method is probably only going to be useful if you are only using the tokenized LM to store character ngrams. Because it uses fat constant uniform language models for smoothing tokens and whitespaces, it will provide very high entropy estimates for unseen text. The other constructors allow smoothing LMs to be supplied (which will take up more space to estimate, but will provide more reasonable estimates).
factory
 Tokenizer factory for the model.nGramOrder
 Ngram Order.IllegalArgumentException
 If the ngram order is less
than 0.public TokenizedLM(TokenizerFactory tokenizerFactory, int nGramOrder, LanguageModel.Sequence unknownTokenModel, LanguageModel.Sequence whitespaceModel, double lambdaFactor)
In order for this model to be serializable, the unknown
token and whitespace models should be serializable. If they do
not, a runtime exception will be thrown when attempting to
serialize this model. If these models implement LanguageModel.Dynamic
, they will be trained by calls to the
training method.
tokenizerFactory
 Tokenizer factory for the model.nGramOrder
 Length of maximum ngram for model.unknownTokenModel
 Sequence model for unknown tokens.whitespaceModel
 Sequence model for all whitespace.lambdaFactor
 Value of the interpolation hyperparameter.IllegalArgumentException
 If the ngram order is less
than 1 or the interpolation is not a nonnegative number.public TokenizedLM(TokenizerFactory tokenizerFactory, int nGramOrder, LanguageModel.Sequence unknownTokenModel, LanguageModel.Sequence whitespaceModel, double lambdaFactor, boolean initialIncrementBoundary)
In order for this model to be serializable, the unknown
token and whitespace models should be serializable. If they do
not, a runtime exception will be thrown when attempting to
serialize this model. If these models implement LanguageModel.Dynamic
, they will be trained by calls to the
training method.
tokenizerFactory
 Tokenizer factory for the model.nGramOrder
 Length of maximum ngram for model.unknownTokenModel
 Sequence model for unknown tokens.whitespaceModel
 Sequence model for all whitespace.lambdaFactor
 Value of the interpolation hyperparameter.initialIncrementBoundary
 Flag indicating whether or not
to increment the subsequence { BOUNDARY_TOKEN }
automatically after construction to avoid NaN
error
states.IllegalArgumentException
 If the ngram order is less
than 1 or the interpolation is not a nonnegative number.public double lambdaFactor()
public TrieIntSeqCounter sequenceCounter()
symbolTable()
. Changes to this counter affect this
tokenized language model.public SymbolTable symbolTable()
public int nGramOrder()
public TokenizerFactory tokenizerFactory()
public LanguageModel.Sequence unknownTokenLM()
public LanguageModel.Sequence whitespaceLM()
public void compileTo(ObjectOutput objOut) throws IOException
CompiledTokenizedLM
.compileTo
in interface Compilable
objOut
 Object output to which a compiled version of this
model is written.IOException
 If there is an I/O error writing the
output.public void handleNGrams(int nGramLength, int minCount, ObjectHandler<String[]> handler)
nGramLength
 Length of ngrams visited.minCount
 Minimum count of a visited ngram.handler
 Handler whose handle method is called for each
visited ngram.public void train(CharSequence cSeq)
train
in interface LanguageModel.Dynamic
cSeq
 Character sequence to train.public void train(CharSequence cSeq, int count)
train(cs,n)
is equivalent to
calling train(cs)
a total of n
times.train
in interface LanguageModel.Dynamic
cSeq
 Character sequence to train.count
 Number of instances to train.IllegalArgumentException
 If the count is not positive.public void train(char[] cs, int start, int end)
train
in interface LanguageModel.Dynamic
cs
 Underlying character array.start
 Index of first character in slice.end
 Index of one plus last character in slice.IndexOutOfBoundsException
 If the indices are out of
range for the character array.public void handle(CharSequence cs)
This method delegates to the train(CharSequence,int)
method.
This method implements the ObjectHandler<CharSequence>
interface.
handle
in interface ObjectHandler<CharSequence>
cs
 Object to be handled.public void train(char[] cs, int start, int end, int count)
train
in interface LanguageModel.Dynamic
cs
 Underlying character array.start
 Index of first character in slice.end
 Index of one plus last character in slice.count
 Number of instances of sequence to train.IndexOutOfBoundsException
 If the indices are out of range for the
character array.IllegalArgumentException
 If the count is negative.public void trainSequence(CharSequence cSeq, int count)
This method may be used to train a tokenized language model
from individual character sequence counts. Because the token
smoothing models are not implemented for this method, a pure
token model may be constructed by calling
train(CharSequence,int)
for character sequences
corresponding to unigrams rather than this method in order to
train token smoothing with character subseuqneces.
For instance, with
com.aliasi.tokenizer.IndoEuropeanTokenizerFactory
,
the sequence calling trainSequence("the fast
computer",5)
would extract three tokens,
the
, fast
and computer
,
and would increment the count of the threetoken sequence, but
not any of its subsequences.
If the number of tokens is longer than the maximum ngram
length, only the final tokens are trained. For instance, with
an ngram length of 2, and the IndoEuropean tokenizer factory,
calling trainSequence("a slightly faster
computer",93)
is equivalent to calling
trainSequence("faster computer",93)
.
All tokens trained are added to the symbol table. This does not include any initial tokens that are not used because the maximum ngram length is too short.
cSeq
 Character sequence to train.count
 Number of instances to train.IllegalArgumentException
 If the count is negative.public double log2Estimate(CharSequence cSeq)
LanguageModel
log2Estimate
in interface LanguageModel
cSeq
 Character sequence to estimate.public double log2Estimate(char[] cs, int start, int end)
LanguageModel
log2Estimate
in interface LanguageModel
cs
 Underlying array of characters.start
 Index of first character in slice.end
 One plus index of last character in slice.public double tokenProbability(String[] tokens, int start, int end)
LanguageModel.Tokenized
tokenProbability
in interface LanguageModel.Tokenized
tokens
 Underlying array of tokens.start
 Index of first token in slice.end
 Index of one past the last token in the slice.public double tokenLog2ProbCharSmooth(String[] tokens, int start, int end)
public double tokenProbCharSmooth(String[] tokens, int start, int end)
public double tokenLog2ProbCharSmoothNoBounds(String[] tokens, int start, int end)
public double tokenProbCharSmoothNoBounds(String[] tokens, int start, int end)
public double tokenLog2Probability(String[] tokens, int start, int end)
LanguageModel.Tokenized
tokenLog2Probability
in interface LanguageModel.Tokenized
tokens
 Underlying array of tokens.start
 Index of first token in slice.end
 Index of one past the last token in the slice.public double processLog2Probability(String[] tokens)
tokens
 Tokens whose probability is returned.public SortedSet<ScoredObject<String[]>> collocationSet(int nGram, int minCount, int maxReturned)
String[]
containing tokens. The length of ngram,
minimum count for a result and the maximum number of results
returned are all specified. The confidence ordering is based
on the result of Pearson's C_{2}
independence statistic as computed by chiSquaredIndependence(int[])
.nGram
 Length of ngrams to search for collocations.minCount
 Minimum count for a returned ngram.maxReturned
 Maximum number of results returned.public SortedSet<ScoredObject<String[]>> newTermSet(int nGram, int minCount, int maxReturned, LanguageModel.Tokenized backgroundLM)
ScoredObject
instances
whose objects are terms represented as string arrays and whose
scores are the collocation score for the term. For instance,
the new terms may be printed in order of significance by:
ScoredObject[] terms = new Terms(3,5,100,bgLM);
for (int i = 0; i < terms.length; ++i) {
String[] term = (String[]) terms[i].getObject();
double score = terms[i].score();
...
}
The exact scoring used is the zscore as defined in BinomialDistribution.z(double,int,int)
with the success
probability defined by the ngrams probability estimate in the
background model, the number of successes being the count of
the ngram in this model and the number of trials being the
total count in this model.
See oldTermSet(int,int,int,LanguageModel.Tokenized)
for a method that returns the least significant terms in
this model relative to a background model.
nGram
 Length of ngrams to search for significant new terms.minCount
 Minimum count for a returned ngram.maxReturned
 Maximum number of results returned.backgroundLM
 Background language model against which
significance is measured.public SortedSet<ScoredObject<String[]>> oldTermSet(int nGram, int minCount, int maxReturned, LanguageModel.Tokenized backgroundLM)
Note that only terms that exist in the foreground model are
considered. By contrast, reversing the roles of the models in
the sister method newTermSet(int,int,int,LanguageModel.Tokenized)
considers
every ngram in the background model and may return slightly
different results.
nGram
 Length of ngrams to search for significant old terms.minCount
 Minimum count in background model for a returned ngram.maxReturned
 Maximum number of results returned.backgroundLM
 Background language model from which counts are
derived.public SortedSet<ScoredObject<String[]>> frequentTermSet(int nGram, int maxReturned)
See infrequentTermSet(int,int)
to retrieve the most
infrequent terms.
nGram
 Length of ngrams to search.maxReturned
 Maximum number of results returned.public SortedSet<ScoredObject<String[]>> infrequentTermSet(int nGram, int maxReturned)
See frequentTermSet(int,int)
to retrieve the most
frequent terms.
nGram
 Length of ngrams to search.maxReturned
 Maximum number of results returned.public double chiSquaredIndependence(int[] nGram)
The input ngram is split into two halves,
Term_{1}
and
Term_{2}
, each of which is a
nonempty sequence of integers.
Term_{1}
consists of the tokens
indexed 0
to mid1
and
Term_{2}
from mid
to end1
.
The contingency matrix for computing the independence statistic is:
where values for a specified integer sequence
+Term_{2} Term_{2} +Term_{1} Term(+,+) Term(+,) Term_{1} Term(,+) Term(,)
nGram
and midpoint 0 < mid < end
is:
Term(+,+) = count(nGram,0,end)
Term(+,) = count(nGram,0,mid)  count(nGram,0,end)
Term(,+) = count(nGram,mid,end)  count(nGram,0,end)
Term(,) = totalCount  Term(+,+)  Term(+,)  Term(,+)
Note that using the overall total count provides a slight
overapproximation of the count of appropriatelength ngrams.
For further information on the independence test, see the
documentation for Statistics.chiSquaredIndependence(double,double,double,double)
.
nGram
 Array of integers whose independence
statistic is returned.IllegalArgumentException
 If the specified ngram is not at
least two elements long.public double z(int[] nGram, int nGramSampleCount, int totalSampleCount)
Formulas for zscores and an explanation of their scaling by
deviation is described in the documentation for the static
method BinomialDistribution.z(double,int,int)
.
nGram
 The ngram to test.nGramSampleCount
 The number of observations of the
ngram in the sample.totalSampleCount
 The total number of samples.