public class NGramProcessLM extends Object implements Model<CharSequence>, LanguageModel.Process, LanguageModel.Conditional, LanguageModel.Dynamic, ObjectHandler<CharSequence>, Serializable
NGramProcessLM
provides a dynamic conditional
process language model process for which training, estimation, and
pruning may be interleaved. A process language model normalizes
probablities for a given length of input.
The model may be compiled to an object output stream; the
compiled model read back in will be an instance of CompiledNGramProcessLM
.
This class implements a generative language model based on the
chain rule, as specified by LanguageModel.Conditional
.
The maximum likelihood estimator (see CharSeqCounter
),
is smoothed by linear interpolation with the next lowerorder context
model:
P'(c_{k}c_{j},...,c_{k1})
= lambda(c_{j},...,c_{k1})
* P_{ML}(c_{k}c_{j},...,c_{k1})
+ (1lambda(c_{j},...,c_{k1}))
* P'(c_{k}c_{j+1},...,c_{k1})
The P_{ML}
terms in the above definition
are maximum likelihood estimates based on frequency:
TheP_{ML}(c_{k}c_{j},...,c_{k1}) = count(c_{j},...,c_{k1}, c_{k}) / extCount(c_{j},...,c_{k1})
count
is just the number of times a given string
has been seein the data, whereas extCount
is the number
of times an extension to the string has been seen in the data:
extCount(c_{1},...,c_{n}) = Σ_{d} count(c_{1},...,c_{n},d)
In the parametric WittenBell method, the interpolation ratio
lambda
is defined based on extensions of the context
of estimation to be:
lambda(c_{1},...,c_{n})
= extCount(c_{1},...,c_{n})
/ (extCount(c_{1},...,c_{n})
+ L * numExtensions(c_{1},...,c_{n}))
where
c_{1},...,c_{n}
is the conditioning context for estimation, extCount
is as defined above, where numExtensions
is the
number of extensions of a context:
and wherenumExtensions(c_{1},...,c_{n}) = cardinality( { d  count(c_{1},...,c_{n},d) > 0 } )
L
is a hyperparameter of the distribution
(described below).
As a base case, P(c_{k})
is
interpolated with the uniform distribution
P_{U}
, with interpolation defined
as usual with the argument to lambda
being the
empty (i.e. zero length) sequence:
The uniform distributionP(d) = lambda() * P_{ML}(d) + (1lambda()) * P_{U}(d)
P_{U}
only
depends on the number of possible characters used in training and
tests:
whereP_{U}(c) = 1/alphabetSize
alphabetSize
is the maximum number of distinct
characters in this model.
The free hyperparameter L
in the smoothing equation
determines the balance between higherorder and lowerorder models.
A higher value for L
gives more of the weight to
lowerorder contexts. As the amount of data grows against a fixed
alphabet of characters, the impact of L
is reduced.
In Witten and Bell's original paper, the hyperparameter
L
was set to 1.0, which is not a particularly good
choice for most text sources. A value of the lambda factor that is
roughly the length of the longest ngram seems to be a good rule of
thumb.
Methods are provided for computing a sample crossentropy rate
for a character sequence. The sample crossentropy
H(c_{1},...,c_{n};P_{M})
for
sequence c_{1},...,c_{n}
in
probability model P_{M}
is defined to be the
average log (base 2) probability of the characters in the sequence
according to the model. In symbols:
H(c_{1},...,c_{n};P_{M})
= (log_{2} P_{M}(c_{1},...,c_{n}))/n
The crossentropy rate of distribution P'
with respect to a distribution P
is defined by:
H(P',P)
= Σ_{x}
P(x) * log_{2} P'(x)
The ShannonMcMillanBreiman theorem shows that as the length of
the sample drawn from the true distribution P
grows,
the sample crossentropy rate approaches the actual crossentropy
rate. In symbols:
H(P,P_{M})
= lim_{n>infinity}
H(c_{1},...,c_{n};P_{M})/n
The entropy of a distribution P
is defined by its
crossentropy against itself, H(P,P)
. A
distribution's entropy is a lower bound on its crossentropy; in
symbols, H(P',P) > H(P,P)
for all distributions
P'
.
Models may be pruned by pruning the underlying substring
counter for the language model. This counter is returned by
the method substringCounter()
. See the class documentat
for the return result TrieCharSeqCounter
for more information.
Models may be serialized in the usual way by creating an object output object and writing the object:
Reading just reverses the process:NGramProcessLM lm = ...; ObjectOutput out = ...; out.writeObject(lm);
Serialization is based on the methodsObjectInput in = ...; NGramProcessLM lm = (NGramProcessLM) in.readObject();
writeTo(OutputStream)
and readFrom(InputStream)
. These write compressed forms of
the model to streams in binary format.
Warning: The object input and output used for
serialization must extend InputStream
and OutputStream
. The only implementations of ObjectInput
and
ObjectOutput
as of the 1.6 JDK do extend the streams, so
this will only be a problem with customized object input or output
objects. If you need this method to work with custom input and
output objects that do not extend the corresponding streams, drop
us a line and we can perhaps refactor the output methods to remove
this restriction.
For information on the WittenBell interpolation method, see:
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized
Constructor and Description 

NGramProcessLM(int maxNGram)
Constructs an ngram process language model with the specified
maximum ngram length.

NGramProcessLM(int numChars,
double lambdaFactor,
TrieCharSeqCounter counter)
Construct an ngram process language model with the specified
number of characters, interpolation parameter
and character sequence counter.

NGramProcessLM(int maxNGram,
int numChars)
Construct an ngram language process with the specified maximum
ngram length and maximum number of characters.

NGramProcessLM(int maxNGram,
int numChars,
double lambdaFactor)
Construct an ngram language process with the specified maximum
ngram length, number of characters, and interpolation ratio
hyperparameter.

Modifier and Type  Method and Description 

void 
compileTo(ObjectOutput objOut)
Writes a compiled version of this process language model to the
specified object output.

double 
getLambdaFactor()
Returns the current setting of the interpolation ratio
hyperparameter.

void 
handle(CharSequence cSeq)
Implements the object handler interface over character
sequences for training.

double 
log2ConditionalEstimate(char[] cs,
int start,
int end)
Returns the log (base 2) of the probability estimate for the
conditional probability of the last character in the specified
slice given the previous characters.

double 
log2ConditionalEstimate(char[] cs,
int start,
int end,
int maxNGram,
double lambdaFactor)
Returns the log (base 2) conditional estimate for a specified
character slice with a specified maximum ngram and specified
hyperparameter.

double 
log2ConditionalEstimate(CharSequence cSeq)
Returns the log (base 2) of the probabilty estimate for the
conditional probability of the last character in the specified
character sequence given the previous characters.

double 
log2ConditionalEstimate(CharSequence cSeq,
int maxNGram,
double lambdaFactor)
Returns the log (base 2) conditional estimate of the last
character in the specified character sequence given the
previous characters based only on counts of ngrams up to the
specified maximum ngram.

double 
log2Estimate(char[] cs,
int start,
int end)
Returns an estimate of the log (base 2) probability of the
specified character slice.

double 
log2Estimate(CharSequence cSeq)
Returns an estimate of the log (base 2) probability of the
specified character sequence.

double 
log2Prob(CharSequence cSeq)
Returns the log (base 2) of the probability of
the specified object.

int 
maxNGram()
Returns the maximum ngram length for this model.

char[] 
observedCharacters()
Returns the array of characters that have been observed
for this model.

double 
prob(CharSequence cSeq)
Returns the probability of the specified object.

static NGramProcessLM 
readFrom(InputStream in)
Reads a language model from the specified input stream.

void 
setLambdaFactor(double lambdaFactor)
Sets the value of the interpolation ratio hyperparameter
to the specified value.

void 
setNumChars(int numChars)
Sets the number of characters for this language model.

TrieCharSeqCounter 
substringCounter()
Returns the substring counter for this language model.

String 
toString()
Returns a stringbased representation of this language model.

void 
train(char[] cs,
int start,
int end)
Update the model with the training data provided by
the specified character slice.

void 
train(char[] cs,
int start,
int end,
int incr)
Update the model with the training data provided by the
specified character sequence with the specifiedc count.

void 
train(CharSequence cSeq)
Update the model with the training data provided by the
specified character sequence with a count of one.

void 
train(CharSequence cSeq,
int incr)
Update the model with the training data provided by the
specified character sequence with the specified count.

void 
trainConditional(char[] cs,
int start,
int end,
int condEnd)
Trains the specified conditional outcome(s) of the specified
character slice given the background slice.

void 
writeTo(OutputStream out)
Writes this language model to the specified output stream.

public NGramProcessLM(int maxNGram)
Character.MAX_VALUE
and the interpolation
parameter is set to the default of being equal to the ngram
length.maxNGram
 Maximum length ngram for which counts are
stored.public NGramProcessLM(int maxNGram, int numChars)
maxNGram
 Maximum length ngram for which counts are
stored.numChars
 Maximum number of characters in training data.IllegalArgumentException
 If the maximum ngram is
less than 1 or if the number of characters is not between 1 and
the maximum number of characters.public NGramProcessLM(int maxNGram, int numChars, double lambdaFactor)
maxNGram
 Maximum length ngram for which counts are
stored.numChars
 Maximum number of characters in training data.lambdaFactor
 Central value of interpolation
hyperparameter explored.IllegalArgumentException
 If the maximum ngram is
less than 1, the number of characters is not between 1 and
the maximum number of characters, of if the lambda factor
is not greater than or equal to 0.public NGramProcessLM(int numChars, double lambdaFactor, TrieCharSeqCounter counter)
The counter argument allows serialized counters to be read back in and used to create an ngram process LM.
numChars
 Maximum number of characters in training and
test data.lambdaFactor
 Interpolation parameter (see class doc).counter
 Character sequence counter to use.IllegalArgumentException
 If the number of characters is
not between 1 and the maximum number of characters, of if the
lambda factor is not greater than or equal to 0.public void writeTo(OutputStream out) throws IOException
A language model is written using a BitOutput
wrapped around the specified output stream. This bit output is
used to delta encode the maximum ngram, number of characters,
lambda factor times 1,000,000, and then the underlying sequence
counter using TrieCharSeqCounter.writeCounter(CharSeqCounter,TrieWriter,int)
.
The bit output is flushed, but the output stream is not closed.
A language model can be read and written using the following
code, given a file f
:
NGramProcessLM lm = ...; File f = ...; OutputStream out = new FileOutputStream(f); BufferedOutputStream bufOut = new BufferedOutputStream(out); lm.writeTo(bufOut); bufOut.close(); ... InputStream in = new FileInputStream(f); BufferedInputStream bufIn = new BufferedInputStream(in); NGramProcessLM lm2 = NGramProcessLM.readFrom(bufIn); bufIn.close();
out
 Output stream to which to write language model.IOException
 If there is an underlying I/O error.public static NGramProcessLM readFrom(InputStream in) throws IOException
See writeTo(OutputStream)
for information on the
binary I/O format.
in
 Input stream from which to read a language model.IOException
 If there is an underlying I/O error.public double log2Prob(CharSequence cSeq)
Model
log2Prob
in interface Model<CharSequence>
cSeq
 The object whose probability is returned.public double prob(CharSequence cSeq)
Model
prob
in interface Model<CharSequence>
cSeq
 The object whose probability is returned.public final double log2Estimate(CharSequence cSeq)
LanguageModel
log2Estimate
in interface LanguageModel
cSeq
 Character sequence to estimate.public final double log2Estimate(char[] cs, int start, int end)
LanguageModel
log2Estimate
in interface LanguageModel
cs
 Underlying array of characters.start
 Index of first character in slice.end
 One plus index of last character in slice.public void train(CharSequence cSeq)
LanguageModel.Dynamic
train
in interface LanguageModel.Dynamic
cSeq
 The character sequence to use as training data.public void train(CharSequence cSeq, int incr)
LanguageModel.Dynamic
train(cs,n)
is equivalent
to calling train(cs)
a total of n
times.train
in interface LanguageModel.Dynamic
cSeq
 The character sequence to use as training data.incr
 Number of instances to train.public void train(char[] cs, int start, int end)
LanguageModel.Dynamic
train
in interface LanguageModel.Dynamic
cs
 The underlying character array for the slice.start
 Index of first character in the slice.end
 Index of one plus the last character in the
training slice.public void train(char[] cs, int start, int end, int incr)
LanguageModel.Dynamic
train(cs,n)
is equivalent
to calling train(cs)
a total of
n
times.
Update the model with the training data provided by
the specified character slice.train
in interface LanguageModel.Dynamic
cs
 The underlying character array for the slice.start
 Index of first character in the slice.end
 Index of one plus the last character in the
training slice.incr
 Number of instances to train.public void handle(CharSequence cSeq)
train(CharSequence)
.handle
in interface ObjectHandler<CharSequence>
cSeq
 Character sequence on which to train.public void trainConditional(char[] cs, int start, int end, int condEnd)
This method just shorthand for incrementing the counts of
all substrings of cs
from position
start
to end1
inclusive, then
decrementing all of the counts of substrings from position
start
to condEnd1
. For instance, if
cs
is
"abcde".toCharArray()
, then calling
trainConditional(cs,0,5,2)
will increment the
counts of cde
given ab
, but will not
increment the counts of ab
directly. This increases
the following probabilities:
P('e'"abcd")
P('e'"bcd")
P('e'"cd")
P('e'"d")
P('e'"")
P('d'"abc")
P('d'"bc")
P('d'"c")
P('d'"")
P('c'"ab")
P('c'"b")
P('c'"")
but does not increase the following probabilities:
P('b'"a")
P('b'"")
P('a'"")
cs
 Array of characters.start
 Start position for slice.end
 One past end position for slice.condEnd
 One past the end of the conditional portion of
the slice.public char[] observedCharacters()
LanguageModel.Conditional
observedCharacters
in interface LanguageModel.Conditional
public void compileTo(ObjectOutput objOut) throws IOException
The object written will be an instance of CompiledNGramProcessLM
. It may be read in by casting the
result of ObjectInput.readObject()
.
Compilation is time consuming, because it must traverse the entire trie structure, and for each node, estimate its log probability and if it is internal, its log interpolation value. Given that time taken is proportional to the size of the trie, pruning first may greatly speed up this operation and reduce the size of the compiled object that is written.
compileTo
in interface Compilable
objOut
 Object output to which a compiled version of this
langauge model will be written.IOException
 If there is an I/O exception writing the
compiled object.public double log2ConditionalEstimate(CharSequence cSeq)
LanguageModel.Conditional
log2ConditionalEstimate
in interface LanguageModel.Conditional
cSeq
 Character sequence to estimate.public double log2ConditionalEstimate(char[] cs, int start, int end)
LanguageModel.Conditional
log2ConditionalEstimate
in interface LanguageModel.Conditional
cs
 Underlying array of characters.start
 Index of first character in slice.end
 One plus the index of the last character in the slice.public TrieCharSeqCounter substringCounter()
public int maxNGram()
public double log2ConditionalEstimate(CharSequence cSeq, int maxNGram, double lambdaFactor)
cSeq
 Character sequence to estimate.maxNGram
 Maximum length of ngram count to use for
estimate.lambdaFactor
 Value of interpolation hyperparameter for
this estimate.IllegalArgumentException
 If the character sequence is not at
least one character long.public double log2ConditionalEstimate(char[] cs, int start, int end, int maxNGram, double lambdaFactor)
cs
 Underlying character array.start
 Index of first character in slice.end
 Index of one past last character in slice.maxNGram
 Maximum length of ngram to use in estimates.lambdaFactor
 Value of interpolation hyperparameter.IndexOutOfBoundsException
 If the start index and end
index minus one are out of range of the character array or if the
character slice is less than one character long.public double getLambdaFactor()
public final void setLambdaFactor(double lambdaFactor)
lambdaFactor
 New value for interpolation ratio
hyperparameter.IllegalArgumentException
 If the value is not greater
than or equal to zero.public final void setNumChars(int numChars)
numChars
 New number of characters for this language model.IllegalArgumentException
 If the number of characters is
less than 0
or more than
Character.MAX_VALUE
.