public class CompiledNGramProcessLM extends Object implements LanguageModel.Process, LanguageModel.Conditional, Model<CharSequence>
CompiledNGramProcessLM
implements a conditional
process language model. Instances are constructed by reading a
serialized version of an instance of NGramProcessLM
through
data input.
Compiled models contain precompulted estimates and smoothing parameters. For instance, consider an 3gram process language model trained on the string "abracadabra", which yields the following counts:
For instance,11 a 5 br 2 ca 1 da 1 bra 2 cad 1 dab 1 r 2 a 2 c 1
count("")=11
,
count("a")=5
,
count("ab")=2
,
count("abr")=2
,
count("br")=2
,
count("bra")=2
,
count("r")=2
, and
count("ra")=2
, and
count("rac")=1
.
Assuming a lambda factor hyperparameter set to 4.0, the compiled
model consists of the following five parallel arrays:
The actual data is contained in the last five columns of the table. Thus internal nodes require 10 bytes and terminal nodes require 18 bytes. The indices in the first column and the implicit context represented by the node are for convenience. Each of these indices picks out a row in the table that corresponds to a node in the trie structure representing the counts. The nodes are arranged according to a unicodeorder breadthfirst traversal of the count trie. Each nonterminal node stores a character, the index of the suffix node (as in a suffix tree), a probability estimate for the character given the context of characters that led to the character, an interpolation value for the context represented by the context leading to the character plus the character itself, and finally a pointer to the first daughter of the node in the array. Note that the last daughter of a node is thus picked out by the first daughter of the next node minus one. For instance, the daughters of node 1 are 6, 7 and 8, and the daguther of node 9 is 16. The terminal nodes have no daughters; note that memory is not allocated for these terminal cells.
char (2) int (4) float (4) float (4) int (4) Index Context Char Suffix log_{2} P(charcontext) log_{2} (1  λ(context.char)) First Dtr 0 n/a n/a n/a n/a 0.6322682 1 1 "" a 0 2.6098135 0.4150375 6 2 "" b 0 3.8987012 0.5849625 9 3 "" c 0 4.845262 0.32192808 10 4 "" d 0 4.845262 0.32192808 11 5 "" r 0 3.8987012 0.5849625 12 6 "a" b 2 2.5122285 0.5849625 13 7 "a" c 3 3.4966948 0.32192808 14 8 "a" d 4 3.4966948 0.32192808 15 9 "b" r 5 1.4034244 0.5849625 16 10 "c" a 1 1.5948515 0.32192808 17 11 "d" a 1 1.5948515 0.32192808 18 12 "r" a 1 1.1760978 0.32192808 19 13 "ab" r 9 0.77261907 This space intentionally left left blank: maximum length ngrams are not contexts. 14 "ac" a 10 1.1051782 15 "ad" a 11 1.1051782 16 "br" a 12 0.6703262 17 "ca" d 8 1.8843123 18 "da" b 6 1.5554274 19 "ra" c 7 1.8843123
The second column indicates the context derived by walking from the root node (index 0) down to the node. For instance, the path "bra" leads from the root node 0 ("") to the node 2 ("b"), to node 9 ("br"), and finally to node 16 ("bra"). The tree is traversed by starting at the root node and looking up daughters by binary search.
If a full context and node are in the tree, the estimate can
simply be looked up. For instance,
log_{2} P(b) = 3.898
,
log_{2} P(rb) = 1.403
, and
log_{2} P(abr) = 0.670
. If the
context is available, but not the outcome, the result is just the
addition of the log one minus lambda estimates down to the first
context for which the outcome exists. These contexts to explore
are all suffixes of the context currently being explored and may be
looked up in the suffix array. For instance, consider:
P(cbr)
= λ(br) P_{ML}(cbr)
+ (1λ(br)) P(cb)
In this case, the outcome c
is not available for the
context br
(the only daughter is a
),
hence the maximum likelihood estimate is zero, and the result
reduces to the second term:
P(cbr) = (1λ(br)) P(cb)
and hence
log_{2} P(cbr)
= log_{2}((1λ(br)) P(cb))
= log_{2}(1λ(br))
+ log_{2} P(cb)
This continues until the outcome is found. In this case,
we continue with the next term in the same way:
= log_{2} (1λ(br))
+ log_{2} (1 λ(b))
+ log_{2} P(c)
= 0.584 + 0.584 + 0.4845
Because 3grams are the upper bound on context length, and terminal
nodes have no daughters, we conserve length at the end of the
smoothing and daughter arrays.
In practice, this smoothing may require going all the way down to the uniform model. The uniform model estimate is stored separately. The interpolation parameter for the root node works as for any other node.
For estimates of sequences, the final node used will act as the first potential context for the next estimate. This replaces a number of lookups equal to the ngram length by binary character search with a simple array lookup.
Implementation Note: The suffix indices are not included in the binary serialization format. Instead, they are initialized right after the binary data is read in. This requires walking the triestructure and computing each node's suffix by doing a walk from the root. For instance, a million node 8gram model will require at most 8 million binary character searches during initialization.
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized
Modifier and Type  Field and Description 

static int 
ROOT_NODE_INDEX
The index of the root node, namely
0 . 
Modifier and Type  Method and Description 

double 
log2ConditionalEstimate(char[] cs,
int start,
int end)
Returns the log (base 2) of the probability estimate for the
conditional probability of the last character in the specified
slice given the previous characters.

double 
log2ConditionalEstimate(CharSequence cSeq)
Returns the log (base 2) of the probabilty estimate for the
conditional probability of the last character in the specified
character sequence given the previous characters.

double 
log2Estimate(char[] cs,
int start,
int end)
Returns an estimate of the log (base 2) probability of the
specified character slice.

double 
log2Estimate(CharSequence cSeq)
Returns an estimate of the log (base 2) probability of the
specified character sequence.

double 
log2Estimate(int contextIndex,
char nextChar)
Returns the log (base 2) estimate of the specified character in
the context with the specified index.

double 
log2Prob(CharSequence cSeq)
This method is a convenience impelementation of the
Model interface which delegates the call to log2Estimate(CharSequence) . 
int 
longestContextIndex(String context)
Return the index in the parallel array structure underlying of
the maximum length suffix of the specified string that is
defined as a context.

int 
maxNGram()
Returns the maximum length ngram used for this compiled
language model.

int 
nextContext(int contextIndex,
char nextChar)
Returns the index of the context formed by appending the
specified character to the context of the specified index.

int 
numNodes()
Returns the total number of nodes in this language model's trie
structure.

char[] 
observedCharacters()
Returns the array of characters that have been observed for this
language model.

double 
prob(CharSequence cSeq)
This method is a convenience implementation of the
Model
interface which returns the result of raising 2.0 to the
power of the result of a call to log2Estimate(CharSequence) . 
String 
toString()
Returns a stringbased representation of this compiled ngram.

public static final int ROOT_NODE_INDEX
0
.public char[] observedCharacters()
It is assumed that the observed characters are those stored unigram probabilities in the trie.
observedCharacters
in interface LanguageModel.Conditional
public int maxNGram()
public int numNodes()
public int longestContextIndex(String context)
context
 String of context.public double log2Prob(CharSequence cSeq)
Model
interface which delegates the call to log2Estimate(CharSequence)
.log2Prob
in interface Model<CharSequence>
cSeq
 Character sequence whose probability is returned.public double prob(CharSequence cSeq)
Model
interface which returns the result of raising 2.0 to the
power of the result of a call to log2Estimate(CharSequence)
.prob
in interface Model<CharSequence>
cSeq
 Character sequence whose probability is returned.public final double log2Estimate(CharSequence cSeq)
LanguageModel
log2Estimate
in interface LanguageModel
cSeq
 Character sequence to estimate.public final double log2Estimate(int contextIndex, char nextChar)
The main use of this method is to incrementally compute
conditional estimates and contexts, in conjunction with the
method nextContext(int,char)
.
contextIndex
 Index of context of estimate.nextChar
 Character being estimated.public int nextContext(int contextIndex, char nextChar)
log2Estimate(int,char)
.
Note that the index of the root node is always 0
.
contextIndex
 Index of present context.nextChar
 Next character.IllegalArgumentException
 If the context index is less
than zero or greater than the last context index.public final double log2Estimate(char[] cs, int start, int end)
LanguageModel
log2Estimate
in interface LanguageModel
cs
 Underlying array of characters.start
 Index of first character in slice.end
 One plus index of last character in slice.public double log2ConditionalEstimate(CharSequence cSeq)
LanguageModel.Conditional
log2ConditionalEstimate
in interface LanguageModel.Conditional
cSeq
 Character sequence to estimate.public double log2ConditionalEstimate(char[] cs, int start, int end)
LanguageModel.Conditional
log2ConditionalEstimate
in interface LanguageModel.Conditional
cs
 Underlying array of characters.start
 Index of first character in slice.end
 One plus the index of the last character in the slice.public String toString()