public class LatentDirichletAllocation extends Object implements Serializable
LatentDirichletAllocation
object represents a latent
Dirichlet allocation (LDA) model. LDA provides a Bayesian model of
document generation in which each document is generated by
a mixture of topical multinomials. An LDA model specifies the
number of topics, a Dirichlet prior over topic mixtures for a
document, and a discrete distribution over words for each topic.
A document is generated from an LDA model by first selecting a multinomial over topics given the Dirichlet prior. Then for each token in the document, a topic is generated from the documentspecific topic distribution, and then a word is generated from the discrete distribution for that topic. Note that document length is not generated by this model; a fully generative model would need a way of generating lengths (e.g. a Poisson distribution) or terminating documents (e.g. a disginuished endofdocument symbol).
An LDA model may be estimated from an unlabeled training corpus (collection of documents) using a second Dirichlet prior, this time over word distributions in a topic. This class provides a static inference method that produces (collapsed Gibbs) samples from the posterior distribution of topic assignments to words, any one of which may be used to construct an LDA model instance.
An LDA model can be used to infer the topic mixture of unseen text documents, which can be used to compare documents by topical similarity. A fixed LDA model may also be used to estimate the likelihood of a word occurring in a document given the other words in the document. A collection of LDA models may be used for fully Bayesian reasoning at the corpus (collection of documents) level.
LDA may be applied to arbitrary multinomial data. To apply it
to text, a tokenizer factory converts a text document to bag of
words and a symbol table converts these tokens to integer outcomes.
The static method tokenizeDocuments(CharSequence[],TokenizerFactory,SymbolTable,int)
is available to carry out the text to multinomial conversion with
pruning of low counts.
LDA operates over a fixed vocabulary of discrete outcomes, which we call words for convenience, and represent as a set of integers:
words = { 0, 1, ..., numWords1 }
A corpus is a ragged array of documents, each document consisting of an array of words:
A given documentint[][] words = { words[0], words[1], ..., words[numDocs1] }
words[i]
, i < numDocs
,
is itself represented as a bag of words, each word being
represented as an integer:
The documents do not all need to be the same length, so the twodimensional arrayint[] words[i] = { words[i][0], words[i][1], ..., words[i][words[i].length1] }
words
is ragged.
A particular LDA model is defined over a fixed number of topics, also represented as integers:
topics = { 0, 1, ..., numTopics1 }
For a given topic topic < numTopics
, LDA specifies a
discrete distribution φ[topic]
over words:
φ[topic][word] >= 0.0 Σ_{word < numWords} φ[topic][word] = 1.0
In an LDA model, each document in the corpus is generated from a
documentspecific mixture of topics θ[doc]
. The
distribution θ[doc]
is a discrete distribution over
topics:
θ[doc][topic] >= 0.0 Σ_{topic < numTopics} θ[doc][topic] = 1.0
Under LDA's generative model for documents, a documentspecific
topic mixture θ[doc]
is generated from a uniform
Dirichlet distribution with concentration parameter
α
. The Dirichlet is the conjugate prior for the
multinomial in which α
acts as a prior count
assigned to a topic in a document. Typically, LDA is run with a
fairly diffuse prior with concentration α <
1
, leading to skewed posterior topic distributions.
Given a topic distribution θ[doc]
for a
document, tokens are generated (conditionally) independently. For
each token in the document, a topic
topics[doc][token]
is generated according to the
topic distribution θ[doc]
, then the word instance
words[doc][token]
is
generated given the topic using the topicspecific distribution
over tokens φ[topics[doc][token]]
.
For estimation purposes, LDA places a uniform Dirichlet prior
with concentration parameter β
on each of the
topic distributions φ[topic]
. The first step in
modeling a corpus is to generate the topic distributions
φ
from a Dirichlet parameterized by
β
.
In sampling notation, the LDA generative model is expressed as follows:
φ[topic] ~ Dirichlet(β) θ[doc] ~ Dirichlet(α) topics[doc][token] ~ Discrete(θ[doc]) words[doc][token] ~ Discrete(φ[topic[doc][token]])
A generative Bayesian model is based on a the joint probablity
of all observed and unobserved variables, given the priors. Given a
text corpus, only the words are observed. The unobserved variables
include the assignment of a topic to each word, the assignment of a
topic distribution to each document, and the assignment of word
distributions to each topic. We let
topic[doc][token]
, doc < numDocs, token <
docLength[doc]
, be the topic assigned to the specified token
in the specified document. This leaves two Dirichlet priors, one
parameterized by α
for topics in documents, and
one parameterized by β
for words in topics.
These priors are treated as hyperparameters for purposes of
estimation; crossvalidation may be used to provide socalled empirical
Bayes estimates for the priors.
The full LDA probability model for a corpus follows the generative process outlined above. First, topic distributions are chosen at the corpus level for each topic given their Dirichlet prior, and then the remaining variables are generating given these topic distributions:
Note that because the multinomial parameters forp(words, topics, θ, &phi  α, &beta) = Π_{topic < numTopics} p(φ[topic]  β) * p(words, topics, θ  α, φ)
φ[topic]
are continuous,
p(φ[topic]  β)
represents a density, not a discrete
distribution. Thus it does not make sense to talk about the
probability of a given multinomial φ[topic]
; nonzero
results only arise from integrals over measurable subsets of the
multinomial simplex. It is possible to sample from a density, so
the generative model is well founded.
A document is generated by first generating its topic distribution given the Dirichlet prior, and then generating its topics and words:
The topic and word are generated from the multinomialsp(words, topics, θ  α, φ) = Π_{doc < numDocs} p(θ[doc]  α) * p(words[doc], topics[doc]  θ[doc], φ)
θ[doc]
and the topic distributions
φ
using the chain rule, first generating the topic
given the document's topic distribution and then generating the
word given the topic's word distribution.
p(words[doc], topics[doc]  θ[doc], φ) = Π_{token < words[doc].length} p(topics[doc][token]  θ[doc]) * p(words[doc][token]  φ[topics[doc][token]])
Given the topic and document multinomials, this distribution is discrete, and thus may be evaluated. It may also be marginalized by summation:
p(words[doc]  θ[doc], φ) = Π_{ token < words[doc].length} Σ_{topic < numTopics} p(topic  θ[doc]) * p(words[doc][token]  φ[topic])
Conditional probablities are computed in the usual way by marginalizing other variables through integration. Unfortunately, this simple mathematical operation is often intractable computationally.
This class uses a collapsed form of Gibbs sampling over the posterior distribution of topic assignments given the documents and Dirichlet priors:
p(topics  words, α β)
This distribution may be derived from the joint distribution by marginalizing (also known as "collapsing" or "integrating out") the contributions of the documenttopic and topicword distributions.
The Gibbs sampler used to estimate LDA models produces samples that consist of a topic assignment to every token in the corpus. The conjugacy of the Dirichlet prior for multinomials makes the sampling straightforward.
An initial sample is produced by randomly assigning topics to tokens. Then, the sampler works iteratively through the corpus, one token at a time. At each token, it samples a new topic assignment to that token given all the topic assignments to other tokens in the corpus:
The notationp(topics[doc][token]  words, topics')
topics'
represents the set of topic
assignments other than to topics[doc][token]
. This
collapsed posterior conditional is estimated directly:
The counts are all defined relative top(topics[doc][token] = topic  words, topics') = (count'(doc,topic) + α) / (docLength[doc]1 + numTopics*α) * (count'(topic,word) + β) / (count'(topic) + numWords*β)
topics'
; that
is, the current topic assignment for the token being sampled is not
considered in the counts. Note that the two factors are estimates
of θ[doc]
and φ[topic]
with all
data other than the assignment to the current token. Note how the
prior concentrations arise as additive smoothing factors in these
estimates, a result of the Dirichlet's conjugacy to the
multinomial. For the purposes of sampling, the documentlength
normalization in the denominator of the first term is not necessary,
as it remains constant across topics.
The posterior Dirichlet distributions may be computed using just the counts. For instance, the posterior distribution for topics in documents is estimated as:
p(&theta[doc]α, β, words) = Dirichlet(count(doc,0)+β, count(doc,1)+β, ..., count(doc,numTopics1)+β)
The sampling distribution is defined from the maximum a posteriori (MAP) estimates of the multinomial distribution over topics in a document:
which we know from the Dirichlet distribution is:θ^{*}[doc] = ARGMAX_{θ[doc]} p(θ[doc]  α, β, words)
By the same reasoning, the MAP word distribution in topics is:θ^{*}[doc][topic] = (count(doc,topic) + α) / (docLength[doc] + numTopics*α)
φ^{*}[topic][word] = (count(topic,word) + β) / (count(topic) + numWords*β)
A complete Gibbs sample is represented as an instance of LatentDirichletAllocation.GibbsSample
, which provides access to
the topic assignment to every token, as well as methods to compute
θ^{*}
and φ^{*}
as defined above. A sample also maintains the original priors and
word counts. Just the estimates of the topicword distributions
φ[topic]
and the prior topic concentration
α
are sufficient to define an LDA model. Note
that the imputed values of θ^{*}[doc]
used during estimation are part of a sample, but are not part of
the LDA model itself. The LDA model contains enough information to
estimate θ^{*}
for an arbitrary
document, as described in the next section.
The Gibbs sampling algorithm starts with a random assignment of topics to words, then simply iterates through the tokens in turn, sampling topics according to the distribution defined above. After each run through the entire corpus, a callback is made to a handler for the samples. This setup may be configured for an initial burnin period, essentially just discarding the first batch of samples. Then it may be configured to sample only periodically thereafter to avoid correlations between samples.
An LDA model consists of a topic distribution Dirichlet prior
α
and a word distribution
φ[topic]
for each topic. Given an LDA model and a
new document words = { words[0], ..., words[length1] }
consisting of a sequence of words, the posterior distribution over
topic weights is given by:
Although this distribution is not solvable analytically, it is easy to estimate using a simplified form of the LDA estimator's Gibbs sampler. The conditional distribution of a topic assignmentp(θ  words, α, φ)
topics[token]
to a single token given an assignment
topics'
to all other tokens is given by:
This leads to a straightforward sampler over posterior topic assignments, from which we may directly compute the Dirichlet posterior over topic distributions or a MAP topic distribution.p(topic[token]  topics', words, α, φ) ∝ p(topic[token], words[token]  topics', α φ) = p(topic[token]  topics', α) * p(words[token]  φ[topic[token]]) = (count(topic[token]) + α) / (words.length  1 + numTopics * α) * p(words[token]  φ[topic[token]])
This class provides a method to sample these topic assignments,
which may then be used to form Dirichlet distributions or MAP point
estimates of θ^{*}
for the document
words
.
An LDA model may be used to estimate the likelihood of a word given a previous bag of words:
This integral is easily evaluated using sampling over the topic distributionsp(word  words, α, φ) = ∫p(word  θ, φ) p(θ  words, α, φ) dθ
p(θ  words, α, φ)
and
averaging the word probability determined by each sample.
The word probability for a sample θ
is defined by:
Although this approach could theoretically be applied to generate the probability of a document one word at a time, the cost would be prohibitive, as there are quadratically many samples required because samples for thep(word  θ, φ) = Σ_{topic < numTopics} p(topic  θ) * p(word  φ[topic])
n
th word consist of topic
assignments to the previous n1
words.
An LDA model may be used for a variety of statistical calculations. For instance, it may be used to determine the distribution of topics to words, and using these distributions, may determine word similarity. Similarly, document similarity may be determined by the topic distributions in a document.
Point estimates are derived using a single LDA model. For Bayesian calculation, multiple samples are taken to produce multiple LDA models. The results of a calculation on these models is then averaged to produce a Bayesian estimate of the quantity of interest. The sampling methodology is effectively numerically computing the integral over the posterior.
Bayesian calculations over multiple samples are complicated by the exchangeability of topics in the LDA model. In particular, there is no guarantee that topics are the same between samples, thus it is not acceptable to combine samples in topiclevel reasoning. For instance, it does not make sense to estimate the probability of a topic in a document using multiple samples.
LDA has also been applied to collaborative filtering. Movies act as words, with users modeled as documents, the bag of words they've seen. Given an LDA model and a user's films, the user's topic distribution may be inferred and used to estimate the likelihood of seeing unseen films.
Modifier and Type  Class and Description 

static class 
LatentDirichletAllocation.GibbsSample
The
LatentDirichletAllocation.GibbsSample class
encapsulates all of the information related to a single Gibbs
sample for latent Dirichlet allocation (LDA). 
Constructor and Description 

LatentDirichletAllocation(double docTopicPrior,
double[][] topicWordProbs)
Construct a latent Dirichelt allocation (LDA) model using the
specified documenttopic prior and topicword distributions.

Modifier and Type  Method and Description 

double[] 
bayesTopicEstimate(int[] tokens,
int numSamples,
int burnin,
int sampleLag,
Random random)
Return the Bayesian point estimate of the topic distribution
for a document consisting of the specified tokens, using Gibbs
sampling with the specified parameters.

double 
documentTopicPrior()
Returns the concentration value of the uniform Dirichlet prior over
topic distributions for documents.

static Iterator<LatentDirichletAllocation.GibbsSample> 
gibbsSample(int[][] docWords,
short numTopics,
double docTopicPrior,
double topicWordPrior,
Random random)
Return an iterator over Gibbs samples for the specified
documentword corpus, number of topics, priors and randomizer.

static LatentDirichletAllocation.GibbsSample 
gibbsSampler(int[][] docWords,
short numTopics,
double docTopicPrior,
double topicWordPrior,
int burninEpochs,
int sampleLag,
int numSamples,
Random random,
ObjectHandler<LatentDirichletAllocation.GibbsSample> handler)
Run Gibbs sampling for the specified multinomial data, number
of topics, priors, search parameters, randomization and
callback sample handler.

int 
numTopics()
Returns the number of topics in this LDA model.

int 
numWords()
Returns the number of words on which this LDA model
is based.

short[][] 
sampleTopics(int[] tokens,
int numSamples,
int burnin,
int sampleLag,
Random random)
Returns the specified number of Gibbs samples of topics for the
specified tokens using the specified number of burnin epochs,
the specified lag between samples, and the specified
randomizer.

static int[] 
tokenizeDocument(CharSequence text,
TokenizerFactory tokenizerFactory,
SymbolTable symbolTable)
Tokenizes the specified text document using the specified tokenizer
factory returning only tokens that exist in the symbol table.

static int[][] 
tokenizeDocuments(CharSequence[] texts,
TokenizerFactory tokenizerFactory,
SymbolTable symbolTable,
int minCount)
Tokenize an array of text documents represented as character
sequences into a form usable by LDA, using the specified
tokenizer factory and symbol table.

double[] 
wordProbabilities(int topic)
Returns an array representing of probabilities of words in the
specified topic.

double 
wordProbability(int topic,
int word)
Returns the probability of the specified word in the specified
topic.

public LatentDirichletAllocation(double docTopicPrior, double[][] topicWordProbs)
The topicword probability array topicWordProbs
represents a collection of discrete distributions
topicwordProbs[topic]
for topics, and thus must
satisfy:
topicWordProbs[topic][word] >= 0.0 Σ_{word < numWords} topicWordProbs[topic][word] = 1.0
Warning: These requirements are not checked by the constructor.
See the class documentation above for an explanation of the parameters and what can be done with a model.
docTopicPrior
 The documenttopic prior.topicWordProbs
 Array of discrete distributions over words,
indexed by topic.IllegalArgumentException
 If the documenttopic prior is
not finite and positive, or if the topicword probabilities
arrays are not all the same length with entries between 0.0 and
1.0 inclusive.public int numTopics()
public int numWords()
public double documentTopicPrior()
public double wordProbability(int topic, int word)
topic
 Topic identifier.word
 Word identifier.public double[] wordProbabilities(int topic)
The returned result is a copy of the underlying data in the model so that changing it will not change the model.
topic
 Topic identifier.public short[][] sampleTopics(int[] tokens, int numSamples, int burnin, int sampleLag, Random random)
See the class documentation for more information on how the samples are computed.
tokens
 The tokens making up the document.numSamples
 Number of Gibbs samples to return.burnin
 The number of samples to take and throw away
during the burnin period.sampleLag
 The interval between samples after burnin.random
 The random number generator to use for this sampling
process.IndexOutOfBoundsException
 If there are tokens whose
value is less than zero, or whose value is greater than the
number of tokens in this model.IllegalArgumentException
 If the number of samples is
not positive, the sample lag is not positive, or if the burnin
period is negative. if the number of samples, burnin, and lag
are not positive numbers.public double[] bayesTopicEstimate(int[] tokens, int numSamples, int burnin, int sampleLag, Random random)
See the method sampleTopics(int[],int,int,int,Random)
and the class documentation for more information on the sampling
procedure.
tokens
 The tokens making up the document.numSamples
 Number of Gibbs samples to return.burnin
 The number of samples to take and throw away
during the burnin period.sampleLag
 The interval between samples after burnin.random
 The random number generator to use for this sampling
process.IndexOutOfBoundsException
 If there are tokens whose
value is less than zero, or whose value is greater than the
number of tokens in this model.IllegalArgumentException
 If the number of samples is
not positive, the sample lag is not positive, or if the burnin
period is negative.public static LatentDirichletAllocation.GibbsSample gibbsSampler(int[][] docWords, short numTopics, double docTopicPrior, double topicWordPrior, int burninEpochs, int sampleLag, int numSamples, Random random, ObjectHandler<LatentDirichletAllocation.GibbsSample> handler)
LatentDirichletAllocation.GibbsSample
. This method will return
the final sample and also send intermediate samples to an
optional handler.
The class documentation above explains Gibbs sampling for LDA as used in this method.
The primary input is an array of documents, where each document is represented as an array of integers representing the tokens that appear in it. These tokens should be numbered contiguously from 0 for space efficiency. The topic assignments in the Gibbs sample are aligned as parallel arrays to the array of documents.
The next three parameters are the hyperparameters of the model, specifically the number of topics, the prior count assigned to topics in a document, and the prior count assigned to words in topics. A rule of thumb for the documenttopic prior is to set it to 5 divided by the number of topics (or less if there are very few topics; 0.1 is typically the maximum value used). A good general value for the topicword prior is 0.01. Both of these priors will be diffuse and tend to lead to skewed posterior distributions.
The following three parameters specify how the sampling is
to be done. First, the sampler is "burned in" for a
number of epochs specified by the burnin parameter. After burn
in, samples are taken after fixed numbers of documents to avoid
correlation in the samples; the sampling frequency is specified
by the sample lag. Finally, the number of samples to be taken
is specified. For instance, if the burnin is 1000, the sample
lag is 250, and the number of samples is 5, then samples are
taken after 1000, 1250, 1500, 1750 and 2000 epochs. If a
nonnull handler object is specified in the method call, its
handle(GibbsSample)
method is called with each the
samples produced as above.
The final sample in the chain of samples is returned as the result. Note that this sample will also have been passed to the specified handler as the last sample for the handler.
A random number generator must be supplied as an argument.
This may just be a new instance of Random
or
a custom extension. It is used for all randomization in this
method.
docWords
 Corpus of documents to be processed.numTopics
 Number of latent topics to generate.docTopicPrior
 Prior count of topics in a document.topicWordPrior
 Prior count of words in a topic.burninEpochs
 Number of epochs to run before taking a sample.sampleLag
 Frequency between samples.numSamples
 Number of samples to take before exiting.random
 Random number generator.handler
 Handler to which the samples are sent.public static Iterator<LatentDirichletAllocation.GibbsSample> gibbsSample(int[][] docWords, short numTopics, double docTopicPrior, double topicWordPrior, Random random)
gibbsSampler(int[][],short,double,double,int,int,int,Random,ObjectHandler)
.
See that method and the class documentation for more details.docWords
 Corpus of documents to be processed.numTopics
 Number of latent topics to generate.docTopicPrior
 Prior count of topics in a document.topicWordPrior
 Prior count of words in a topic.random
 Random number generator.public static int[][] tokenizeDocuments(CharSequence[] texts, TokenizerFactory tokenizerFactory, SymbolTable symbolTable, int minCount)
Warning: With some tokenizer factories and or minimum count thresholds, there may be documents with no tokens in them.
texts
 The text corpus.tokenizerFactory
 A tokenizer factory for tokenizing the texts.symbolTable
 Symbol table used to convert tokens to identifiers.minCount
 Minimum count for a token to be included in a
document's representation.public static int[] tokenizeDocument(CharSequence text, TokenizerFactory tokenizerFactory, SymbolTable symbolTable)
text
 Character sequence to tokenize.tokenizerFactory
 Tokenizer factory for tokenization.symbolTable
 Symbol table to use for converting tokens
to symbols.