What are Character Language Models?
Character language models estimate the likelihood of a given text. They are trained using samples of text and then are able to match other text that is like the text they saw in training.
What's in this Tutorial?
This tutorial shows how to tune and evaluate character language models on a corpus. The tutorial may be run on data supplied with the LingPipe demos.
Online Training and Evaluation
This tutorial illustrates how language model training and evaluation may be interleaved with dynamic language models. Language models may be compiled through LingPipe's generic compilable interface; the compiled models are relatively small in memory and very fast to evaluate.
Scaling to Large Corpora
The motivation for LingPipe's implementation is to be able to scale to large training data sizes. We show performance on two widely available multi-gigabyte corpora:
- MEDLINE: Abstracts of biomedical research articles. Distributed by the United States National Center for Biotechnology Information. 8 billion characters in abstract bodies.
- Gigaword: Newspaper articles. Distributed by the Linguistic Data Consortium at the University of Pennsylvania. 9 billion characters in article bodies.
Applications
Character language models are used in LingPipe for a broad range of applications.
Classification
For classification, character LMs model the type of text being classified. We provide two tutorials in which character language models are used implicitly for classification:
- Classification by Topic Tutorial: Character language models are trained on newsgroups of varying topic and and used to classify new messages based on those topics.
- Classification by Sentiment Tutorial: Character language models are trained on sentences which are objective (statements of fact) versus subjective (statements of opionion), and used to classify whether other statements are objective or subjective. They are also trained on movie reviews that are positive or negative, and used to classfiy other reviews.
They may also be used for classification by language or any other distinction.
Tagging
Two of the taggers in LingPipe use character language models to learn the forms of words used in various ways.
- Part-of-Speech Tutorial: Character language models represent the forms of words: nouns, verbs, adjectives, etc.
- Named Entity Recognition Tutorial: Character language models represent the forms of content items like person names, company names or gene or disease names.
Spelling Correction
Character language models represent the kinds of queries people are likely to make over a corpus by modeling the corpus.
Running the Demo
To run the tutorial out of the box, cd to the lingpipe/demos/tutorial/lm directory and run the command:
> ant demo
which prints the following output as well as writing it to the file lingpipe/demos/tutorial/lm/demo-out-8.txt:
RUN PARAMETERS CORPUS NAME=medsamp FILE EXTENSION=medsamp2005.xml INPUT DIRECTORIES=[../../data/] TEXT PARSER CLASS=com.aliasi.corpus.parsers.MedlineTextParser MAX TRAINING CHARS=9223372036854775807 MAX NGRAM=8 NUM CHARS=256 LAMBDA FACTORS=[2.0, 8.0, 32.0] SAMPLE SIZE=5000 SAMPLE FREQUENCY=1 PRINT FREQUENCY=5000 REPORT WRITTEN TO FILE=demo-out-8.txt LEARNING CURVE #CHARS, ELAPSED(s), TOTAL_MEM(MB), FREE_MEM(MB), TOT-FREE(MB), MEAN(2.0), DEV(2.0), MEAN(8.0), DEV(8.0), MEAN(32.0), DEV(32.0) # Visiting directory=../../data 5000, 1, 232, 225, 7, 5.771, 3.474, 5.530, 2.916, 5.736, 2.404 10000, 2, 232, 220, 12, 3.210, 3.483, 3.041, 2.562, 3.507, 2.002 15000, 2, 232, 215, 17, 1.908, 3.066, 1.967, 2.139, 2.614, 1.632 20000, 2, 232, 212, 20, 3.307, 3.779, 2.920, 2.695, 3.148, 2.074 25000, 3, 232, 207, 25, 2.570, 3.500, 2.432, 2.537, 2.832, 2.019 30000, 3, 232, 202, 30, 3.054, 3.595, 2.705, 2.537, 2.933, 1.972 35000, 3, 232, 198, 34, 3.137, 3.565, 2.773, 2.609, 2.958, 2.086 40000, 4, 232, 195, 37, 3.207, 3.698, 2.847, 2.706, 3.035, 2.174 45000, 4, 232, 190, 42, 3.463, 3.915, 3.051, 2.903, 3.180, 2.366 50000, 4, 232, 185, 47, 2.940, 3.685, 2.660, 2.734, 2.911, 2.247 55000, 4, 232, 180, 52, 2.936, 3.524, 2.626, 2.587, 2.853, 2.130 60000, 5, 232, 176, 55, 2.735, 3.480, 2.480, 2.547, 2.724, 2.080 65000, 5, 232, 172, 60, 2.848, 3.605, 2.578, 2.692, 2.802, 2.229 N-GRAM COUNTS N, #Unique, #Total, % 0 , 1, 65752, 1.000 1, 83, 65752, 0.999 2, 1413, 65705, 0.978 3, 5960, 65658, 0.909 4, 14179, 65611, 0.784 5, 23478, 65564, 0.642 6, 31633, 65517, 0.517 7, 38508, 65470, 0.412 8, 43928, 65423, 0.329 TOP N-GRAMS N, (N-GRAM,Count)* 0, "",65752 1, " ",9536 "e",6567 "t",4668 "i",4266 "a",4071 2, "e ",1625 "s ",1099 " t",1070 "in",1032 "th",1013 3, " th",696 "the",635 "he ",552 "ed ",465 " of",432 4, " the",548 "the ",471 " of ",427 "tion",357 " and",296 5, " the ",468 " and ",287 "ation",210 "tion ",197 " of t",122 6, " of th",121 "ation ",115 "of the",113 "f the ",104 " with ",101 7, " of the",113 "of the ",103 "patient",65 " patien",62 "ion of ",62 8, " of the ",103 " patient",62 " in the ",57 "tion of ",55 "patients",45
The first few lines print out the parameters, most of which are specified in the build.xml file. These parameters include the name of a corpus, a path and pattern for file extraction, the name of a text parser class to extract text, a limit on the max number of training characters before stopping, an n-gram size specification, a specification of the number of characters in the corpus and test sets, hyperparameters for the model in the form of lambda factors (more on these later), along with sample size and frequency information, and finally, a path to a file to which results are concatenated. After the parameters come the output reports, which we discuss later in detail.
Getting started with AbstractCommand
The code for this example is in the single file src/TrainLM.java. The first thing to note is
that the class extends AbstractCommand
. The abstract
command class provides methods for parsing command-line arguments that
it gathers through its constructor and making them available at
run time in various types with various error-checking provided. The
main(String[])
method simply constructs a new instance
with the command-line arguments as an argument and calls the run()
method on the constructed class:
public static void main(String[] args) { new TrainLM(args).run(); }
This is the usual pattern for AbstractCommand
extensions
and can be seen throughout the command tutorials. The first
lines of the constructor call the superclass constructor with the
static default parameters properties object:
public TrainLM(String[] args) { super(args,DEFAULT_PARAMS); ... }
As usual with abstract commands, constants are used for flag
and property parameter names, and default values are set
on a Properties
instance to be passed to the
AbstractCommand(String[],Properties)
constructor,
as shown above:
static final String CORPUS_NAME_PARAM = "corpusName"; ... static final String LAMBDA_FACTORS_PARAM = "lambdaFactors"; ... static final String REPORT_FILE_PARAM = "reportFile"; static final Properties DEFAULT_PARAMS = new Properties(); static { DEFAULT_PARAMS.setProperty(MAX_NGRAM_PARAM,"5"); DEFAULT_PARAMS.setProperty(SAMPLE_SIZE,"1000"); ... }
This just says that the default maximum n-gram is 5, the default sample size is 1000 characters, and so on.
A quick look at the build.xml file
shows how the parameters are invoked with the java
Ant task:
<java classname="TrainLM" fork="true"> <classpath refid="classpath.standard"/> <arg value="-corpusName=medsamp"/> ... <arg value="-lambdaFactors=2.0, 8.0, 32.0"/> ... <arg value="-reportFile=demo-out-8.txt"/> <arg value="../../data/"/> </java>
The ellipses contain further parameters in the build file. This would translate into a one-line command:
java TrainLM \ -corpusName=medsamp ... "-lambdaFactors=2.0, 8.0, 32.0" ... \ -reportFile=demo-out-8.txt ../../data/
Lines in the code that access these values are as follows:
File outFile = getArgumentFile(REPORT_FILE_PARAM); ... mNGram = getArgumentInt(MAX_NGRAM_PARAM); ... String[] lambdaNames = getCSV(LAMBDA_FACTORS_PARAM);
Each of these assignments uses a different accessor inherited from
AbstractCommand
; of particular note is the method
getCSV(String)
which parses an argument as a
comma-separated list. All of these methods will throw an illegal
argument exception if the parameter is not defined either by default
or in the command invocation.
All of the parameters above are specified on the command
line as -paramName=value
. The abstract
command also allows an arbitrary number of unmarked arguments,
such as the path ../../data/
in the above invocation.
These are accessed as follows:
for (int i = 0; i < numBareArguments(); ++i) { File dir = new File(getBareArgument(i));
The only other issue of note for AbstractCommand
is
that it specifies an abstract method run()
that does
not take any arguments, return any values, or throw any checked
exceptions. This calls for an implementation of the following
form:
public void run() { try { train(); } catch (Exception e) { println("Exception=" + e); e.printStackTrace(System.out); e.printStackTrace(mPrinter); } finally { Streams.closeWriter(mPrinter); } }
All exceptions are caught and their stack-trace printed both to the
file (through mPrinter
) and to System.out
(which is the standard console output unless it's been reset). And
there is a finally
block to close the print writer
that is writing to a file, making sure the file handle is released
if there is an exception in processing locally.
Getting started with Parser
and Handler
The com.aliasi.corpus.Parser
abstract class and
com.aliasi.corpus.Handler
interfaces follow the same
pattern as the org.xml.sax.XMLParser
and
org.xml.sax.ContentHandler
interfaces for parsing XML
documents with SAX. Each parser will contain a handler to which it
supplies callbacks derived from parsing an input source. LingPipe's
arrangement is more generic than SAX in that the parser interface
abstracts over a range of different handler interfaces (when LingPipe
converts to generics, this will be generic).
For this demo, we're interested in parsers that provide events for
a text handler. With an com.aliasi.corpus.ObjectHandler<E>
,
there is a single method for performing an arbitrary operation on
an object:
public void handle(E e);
In our case, we set E
to be CharSequence
for handling text, yielding the instantiated method:
public void handle(CharSequence cSeq);
A text parser is a Parser
that performs callbacks to its
contained handler, in this case an instance of
TextHandler
. In our running example, a
TextHandler
implementation will perform language model
training -- the exact implementation is considered in the next
section.
On the command line, we specify the name of the class that implements the text parser:
-textParser=com.aliasi.corpus.parsers.MedlineTextParser
This is a parser that knows how to extract the text of abstracts from MEDLINE citations in XML format; the XML will be provided through files (found through command-line specification of directory and file suffix).
The text parser is constructed through the
com.aliasi.util.Reflection
helper class:
String textParserClassName = getExistingArgument(TEXT_PARSER_PARAM); mTextParser = (Parser<ObjectHandler<CharSequence>>) Reflection .newInstance(textParserClassName) ...
Training the Estimator
The actual estimator is declared as a member variable:
NGramProcessLM mLM;
and set in the constructor using command-line parameters:
mLM = new NGramProcessLM(mNGram,mNumChars);
This is the basic language model estimator class; it knows how to train a language model given text samples. The parameters in its constructor are the n-gram size and the total number of characters in the training and test collections.
The actual work of training a language model is performed
by an inner class implementing TextHandler
:
class TrainingHandler implements ObjectHandler<CharSequence> { public void handle(CharSequence cSeq) { ... mLM.train(cSeq); } }
The ellipses are filled with code that perform a learning curve evaluation versus various parameter settings for the model.
The training text handler is assigned to the text parser using the
generic setHandler(Handler)
method:
mTextHandler = new TrainingHandler(); ... mTextParser.setHandler(mTextHandler);
The actual training files are extracted from the command line into
an array and passed to the method trainFile(File)
:
File[] files = ... for (int j = 0; j < files.length; ++j) trainFile(files[j]);
This method simply branches to create an input source based on whether the file's suffix indicates it's gzipped:
void trainFile(File file) throws IOException { String fileName = file.getName(); if (fileName.endsWith(".gz")) trainGZipFile(file); else trainTextFile(file); }
Training a text file is as simple as converting the file name
to a URL (using the utility in com.aliasi.util.Files
and wrapping it in an input source to be passed to the parser:
void trainTextFile(File file) throws IOException { String url = Files.fileToURLName(file); InputSource in = new InputSource(url); mTextParser.parse(in); }
Gzipped files are only a bit more complex, constructing their input source through the streamed gzipped decoder:
void trainGZipFile(File file) throws IOException { System.out.println("# Training gzip file=" + file + " [char count=" + mCharCount + "]"); FileInputStream fileIn = null; BufferedInputStream bufIn = null; GZIPInputStream gzipIn = null; try { fileIn = new FileInputStream(file); bufIn = new BufferedInputStream(fileIn); gzipIn = new GZIPInputStream(bufIn); InputSource in = new InputSource(gzipIn); mTextParser.parse(in); } finally { Streams.closeInputStream(gzipIn); Streams.closeInputStream(bufIn); Streams.closeInputStream(fileIn); } }
Note that the streams are closed using the exception-swallowing
close-method in the com.aliasi.util.Streams
class.
Online Learning Curve Estimation
With more training data, our language models become more accurate in
the sense of providing better estmiates of the likelihood of a
character given some previous characters. The example run above
shows a plot of performance versus amount of training data and
parameter settings. The n-gram model actually provides
n-gram estimates for all n up to the maximum specified in the model's
constructor. The hyperparameters are a comma-separated value on the
command line; in the demo target, "-lambdaFactors=2.0, 8.0,
32.0"
.
Technical Note
These parameters determine the relative weight of higher-order n-grams to lower order n-grams. The interpolation also depends on the number of training events seen for an outcome (the more, the higher the weight), and the number of outcomes seen for a given context (the more, the lower the weight for that context). The JavaDoc contains precise mathematical definitions. We need to run some experiments in order to tune the hyperparameter setting.
The hyperparameters are extracted into a double[]
-typed
member variable mLambdas
. The full handler code for the
training handler is:
public void handle(CharSequence cSeq) { char[] cs = cSeq.toString().toCharArray(); int start = 0; int length = cSeq.length(); for (int i = 1; i <= length; ++i) { ++mCharCount; if (mCharCount > mMaxTrainingCharCount) exit(); if ((mCharCount % mSampleFrequency) != 0) continue; for (int j = 0; j < mLambdas.length; ++j) mSamples[j][mSampleIndex] = -mLM.log2ConditionalEstimate(cs,start, i,nGram, mLambdas[j]); ++mSampleIndex; if (mSampleIndex == mSamples[0].length) { report(); mSampleIndex = 0; } } mLM.train(cSeq); } }
This method is simply iterating through the characters passed
into training before performing the training. For each character,
it increments the overall character counter mCharCount
.
The second line checks if we've exceeded the maximum counts
and calls the top-level exit()
method if so.
The next line says to simply go on if the character count is
not an even multiple of the sample frequency. Finally, for
characters being sampled, an estimate is computed for each
hyperparameter value. The method called on the language model is:
mLM.log2ConditionalEstimate(cs,start,i,mNGram,mLambdas[j])
This returns the log (base 2) of the estimate of character cs[i-1]
given the characters cs[start]...cs[i-2]
, with the global
n-gram length and the j
-th hyperparameter.
These estimates are stored in the two dimensional
double
array mSamples
, indexed by
hyperparameter and sample index. The sample index counter is
incremented, and if it the sample buffer is full (the sampel index is
equal to the length of the array), then a report is generated and the
sample index is reset. Finally, after all the estimation is carried
out on the characters being handled, they are used for training.
void report() { print(Long.toString(mCharCount)); print(", " + ((System.currentTimeMillis() - mStartTime)/1000l)); long totalMem = mRuntime.totalMemory(); long freeMem = mRuntime.freeMemory(); print(", " + totalMem/1000000l); print(", " + freeMem/1000000l); print(", " + (totalMem-freeMem)/1000000l); for (int i = 0; i < mLambdas.length; ++i) { double xEntropy = Statistics.mean(mSamples[i]); double dev = Statistics.standardDeviation(mSamples[i]); print(", " + decimalFormat(xEntropy) + "," + decimalFormat(dev)); } println(""); }
This method reports on a single line. First, the number of
characters, then elapsed time in seconds. The next three reports are
for total, free and used memory in megabytes. Then, for each
hyperparameter, the mean and deviation of the samples are computed
using the methods in com.alias.stats.Statistics
. The
mean of log (base 2) sample estimates is known as the sample
cross-entropy. The print(String)
method called here simply
calls a print to both standard output and the contained file.
Final Counts Structure Report
The n-gram language models work by keeping track of all substring
counts up to the specified length. These are stored in a trie
structure, TrieCharSeqCounter
, for easy access. The
second report in the output above is for n-gram counts at the
end of training. There are columns for the n-gram length,
number of unique n-grams seen of that length, total number of
n-grams seen of that length, and the percentage of n-grams
that were seen a second time. This percentage is an indication
of how often the character being estimated had been previously
observed in the same context. For instance, the 5-gram row
indicates 18,221 unique 5-grams and 58108 total 5-grams, which
means that 68.6% of the 5-gram cases being estimated after
65,000 characters had previously been seen. The reason there
aren't as many longer n-grams as shorter n-grams is due to boundary
conditions on the text. For instance, the text "abcde" has
4 bigrams, 3 trigrams, 2 4-grams and 1 5-gram. The code to extract
this depends on a single method:
TrieCharSeqCounter counter = mLM.substringCounter(); long[][] uniqueTotals = counter.uniqueTotalNGramCount();
which extracts all the counts through the substring counter. This array indexes by n-gram length, providing pairs of unique counts and total coounts; they're just printed in a loop:
for (int i = 0; i < uniqueTotals.length; ++i) { long unique = uniqueTotals[i][0]; long total = uniqueTotals[i][1]; double avg = 1.0 - ((double)unique)/(double)total; println(i + ", " + unique + ", " + total + ", " + decimalFormat(avg)); }
Note that the single character n-gram unique count is 83, indicating that there are 83 distinct characters in the model.
Top N-grams
The next report is just the top 5 n-grams of each order and their counts. These are presented in CSV-form, with quotes around each entry and all contained quotes escaped. The code to extract the top n-grams is just a single call to the character sequence counter:
TrieCharSeqCounter seqCounter = mLM.substringCounter(); for (int i = 0; i <= mNGram; ++i) { ObjectToCounter topNGrams = seqCounter.topNGrams(i,5); ...
The result is returned as an instance of
com.aliasi.util.ObjectToCounter
, which provides a
mutable mapping of objects to counts. Among its features is
the ability to extract the keys (n-grams) ordered by count:
Object[] keysByCount = topNGrams.keysOrderedByCount();
It is then a simple matter to loop over the keys and print the keys in comma-separated-value format (quoted and with internal quotes escaped):
for (int j = 0; j < keysByCount.length; ++j) { String nGram = keysByCount[j].toString(); int count = topNGrams.getCount(keysByCount[j]); String csvNGram = '"' + nGram.replaceAll("\"","\\\"") + '"'; ...
Running Larger Data Sets: Gigaword and MEDLINE
The build.xml build file contains two additional
targets, medline
and gigaword
.
For both of these demos, the data should be downloaded and stored in
gzipped format and the ant targets modified to point to the download
location that contains the gzipped data files.
Time and Space
With large n-gram lengths, these jobs will take hours to run and require gigabytes of memory. For example, running MEDLINE 8-grams on the full set of MEDLINE abstracts without pruning required an 8GB Java heap. They can be run for smaller samples or shorter n-grams with less memory.
Pruning
In order to train on very large amounts of data, the models may be
pruned incrementally. This is done by retrieving the sequence counter
for the language model and calling its prune(int)
method.
This will remove all sequence counts below the specified value. This
usually reduces the size of the model substantially even with the
smallest possible threshold, 2. Pruning efficiency follows
from Zipf's Law,
which says that the frequency of a word (or in our case, sequence
of characters) is inversely proportional to its rank (ordered
by frequency).
Generating Pretty Graphs
The report lines are formatted as comma-separated values (CSV) to enable exporting to a spreadsheet or to our favorite free, open source graphing package, Gnuplot. For instance, the graphs in the whitepaper reference were generated this way by combining reports for different n-gram lengths into a single plot.
Scaling Token Language Models
This demo contains code to serialize token language models to files. It also contains code to merge files produced in this way. Because neither stage requires much memory, the two functions may be used together to scale tokenized language models to arbitrary amounts of data.
Serializing Token Language Models
The serialization and merge functionality is provided in src/TokenNGramFiles.java
.
To illustrate the three public static methods in
TokenNGramFiles
, we have provided a demo main method in
src/DemoTokenNGramFiles.java
.
Running the Demo
The demo may be run from ant using:
$ ant tok-io
The output is provided in four gzipped files:
$ ls *.gz temp1.lm.gz temp12.lm.gz temp2.lm.gz tempIo.lm.gz
where temp1.lm.gz
and temp2.lm.gz
contain the counts for sequences in the even and odd
numbered lines of the input respectively, where temp12.lm.gz
contains counts merged through files, and tempIo.lm.gz
is the result of reading in and then printing back out
the merged file.
The input arguments for the demo are:
<arg value="A TrieCharSeqCounter stores counts of sub strings of strings."/> <arg value="zz zz zz xx"/> <arg value="A TrieCharSeqCounter stores counts of sub strings of strings."/> <arg value="aa aa aa aa aa bb bb bb cc cc dd"/>
To inspect the output, we first unzip the files:
demos/tutorial/lm> gunzip *.gz
Here are the contents of files 1, 2 and 3:
1 | 2 | 12 |
---|---|---|
13 <S> 3 <S> aa 1 <S> aa bb 1 <S> cc 1 <S> cc dd 1 aa 1 aa bb 1 aa bb bb 1 bb 2 bb bb 1 bb bb cc 1 bb cc 1 bb cc dd 1 cc 4 cc <S> 1 cc dd 3 cc dd <S> 1 cc dd cc 2 dd 3 dd <S> 1 dd cc 2 dd cc <S> 1 dd cc dd 1 |
12 <S> 3 <S> dd 2 <S> dd ee 2 dd 4 dd <S> 1 dd dd 1 dd dd <S> 1 dd ee 2 dd ee dd 1 dd ee ff 1 ee 4 ee <S> 1 ee dd 1 ee dd dd 1 ee ee 1 ee ee <S> 1 ee ff 1 ee ff ee 1 ff 1 ff ee 1 ff ee ee 1 |
25 <S> 6 <S> dd 2 <S> dd ee 2 bb 2 cc 4 cc dd 3 cc dd cc 2 dd 7 dd <S> 2 dd cc 2 dd ee 2 ee 4 |
The counts for the even rows (0 and 2) are in column 1, those for the odd rows (1 and 3) in column 2, and the merged counts, pruned to a minimum count of 2, is shown in the third column.
Code Walkthrough
The demo code for carrying all of this out is contained
in a short main method in DemoTokenNGramFiles
:
public static void main(String[] args) throws IOException { ... }
The first step simply defines the tokenizer factory to use and constructs a tokenized language model:
TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE; TokenizedLM lm1 = new TokenizedLM(tokenizerFactory,3);
This will store token sequences up to length 3 based on the Indo-European tokenizer factory. Of course, other tokenizer factories, such as ones based on customized regular expressions, will return different results. The use of this tokenizer factory is just for simplicity here.
Then the language model is trained on the odd input lines:
for (int i = 0; i < args.length; i += 2) lm1.handle(args[i]);
Next, the model is written to the file temp1.lm.gz
using a static method from TokenNGramFiles
:
File file1 = new File("temp1.lm.gz"); TokenNGramFiles.writeNGrams(lm1,file1,0,3,1,"UTF-8");
The (unzipped) contents written to temp1.lm.gz
is
shown in column 1 of the table above. The parameters 0,3
indicate the length of n-grams to store, here 0-grams to 3-grams. The
1
argument indicates the minimum count for n-grams to be
stored. The string "UTF-8"
specifies UTF-8
encoded Unicode as a character encoding.
In the second step, the second LM is trained on the odd-numbered lines and written to file in the same way.
TokenizedLM lm2 = new TokenizedLM(tokenizerFactory,3); for (int i = 1; i < args.length; i += 2) lm2.train(args[i]); File file2 = new File("temp2.lm.gz"); TokenNGramFiles.writeNGrams(lm2,file2,0,3,1,"UTF-8");
After writing out the n-grams computed from alternating
lines of the input, they are merged using the method
TokenNGramFiles.merge()
:
File file12 = new File("temp12.lm.gz"); TokenNGramFiles.merge(Arrays.asList(file1,file2), file12, "UTF-8", 2, false);
The merged n-grams are written (in gzipped form) to
temp12.lm.gz
. The arguments to the merge()
method consist of a list of files to merge (which may be of any
size), followed by a target file and encoding, followed by
a minimum count for pruning and a boolean flag indicating
whether to delete the input files after the merge completes.
Here, we left the input files so they could be inspected.
After unzipping, the contents of the merged file are shown
in the third column of the table above. Note that any string
with combined count lower than 2 is removed. The string
dd <S>
has count 1 in both files, which
gives it a combined count of 2, so it is preserved in the output.
The fourth stage of the demo reads the combined file into
a new language model using TokenNGramFiles.addNGrams()
:
TokenizedLM lmIo = new TokenizedLM(tokenizerFactory, 3, lm1.unknownTokenLM(), lm1.whitespaceLM(), lm1.lambdaFactor(), false); TokenNGramFiles.addNGrams(file12,"UTF-8",lmIo,0); File fileIo = new File("tempIo.lm.gz"); TokenNGramFiles.writeNGrams(lmIo,fileIo,1,3,2,"UTF-8");
First, a tokenized language model is created using the six-argument
constructor. This is done so that the final argument can be specified
as false
, so that boundaries are not auto-incremented
during construction. This allows exactly the count of n-grams in the
files to be read. The same tokenizer factory is used as before, the
maximum n-gram is still 3. For convenience, we use the same unknown
token, whitespace and interpolation variables as in the first model
(these are specified by default as constants when the two-argument
constructor was used).
Next, the addNGrams()
method is called, which reads
the counts from the specified file using the specified encoding
and adds them to the specified language model after pruning
at the specified level.
For inspection, we finally write the n-grams from this new
model as before to tempIo.lm.gz
. The result is
identical to temp12.lm.gz
(as verified by the
diff
command), the contents of which
are in the third column of the table above.
Warning on Tokenizers
Tokens must not contain unicode characters with code points at or
below U+0020
(e.g. no spaces, line feeds, or carriage
returns in tokens).
MEDLINE Token N-Grams
In this section, we illustrate the class that performs
map-reduce-like processing for scaling indexers,
TokenNGramIndexer
, the code for which is in src/TokenNGramIndexer.java
.
For an interface, the class implements
corpus.TextHandler
, which provides a single void
handle()
method accepting a character slice. The text is
tokenized and buffered in an in-memory language model. Before an
in-memory character threshold is passed, the model is written out to
level zero file. When the number of files of a given level exceeds a
threshold, they are merged into a file of the next level (which may
implicate further merges, all of which are carried out
simultaneously). In the end, the remaining files are merged into a
single output file. In this way, token n-gram counts may be scaled to
arbitrary inputs, with the only limit being processing time (which may
be further distributed over multiple machines, leading to a final
meta-merge).
MEDLINE Demo Data
Although this code was built to count n-grams in all of MEDLINE, this demo will restrict attention to the reasonably sized public samples available from:
It will contain the current year sample files.
For 2009, there are seven sample files, named medsamp2009a.xml.gz
through medsamp2009b.xml.gz
, ranging in size from 4 to 20 megabytes
(gzipped), most of which is irrelevant XML for this demo.
For parsing MEDLINE, we use the LingPipe package
com.aliasi.medline
, which contains a full object-oriented
MEDLINE parser, medline.MedlineParser
, which
sends citations to a medline.MedlineHandler
. We will
simply extract the text from a citation and send it to the
indexer.
Running the Demo
The demo may be run from ant, using the following command (warning: takes half an hour on a decent but not state of the art notebook circa 2009):
$ ant med-n MEDLINE Sample Data Directory=c:\data\medline\samples Index Directory=temp-medline-idx Data File=c:\data\medline\samples\medsamp2009a.xml.gz Data File=c:\data\medline\samples\medsamp2009b.xml.gz ...
After the task is run, the index will be in the
specified index directory in gzipped format, just like
the output of the previous demos. For the 2009 MEDLINE
release, that's a 25,443,818 byte gzipped file of counts
at level 5_1
. It may be read
back into models, etc.
Code Walkthrough
The MEDLINE-specific portion of the demo is in src/MedlineNGrams.java
.
The main method starts by setting up the directories
and parameters for the token n-gram indexer:
public static void main(String[] args) throws IOException, SAXException { File medlineDir = new File(args[0]); File indexDir = new File(args[1]);
The data directory containing the gzipped MEDLINE files and the index directory to which files are written are given as fixed command-line arguments. Next, the other parameters are assigned:
int maxCharsBuffered = 4 * 1024 * 1024; int maxFilesPerLevel = 2; TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE; int nGramOrder = 5; String encoding = "UTF-8";
Then the indexer itself is created, based on all of these parameters:
final TokenNGramIndexer indexer = new TokenNGramIndexer(maxCharsBuffered, maxFilesPerLevel, tokenizerFactory, nGramOrder, indexDir,encoding);
We also create a sentence chunker before parsing, which like the indexer is final to allow incorporation into an upcoming anonymous inner class:
final Chunker sentenceChunker = new SentenceChunker(tokenizerFactory, new MedlineSentenceModel());
Note the use of a special model for MEDLINE sentence
chunking given in com.aliasi.sentences.MedlineSentenceModel
.
The work is done in a MEDLINE citation handler, with a handle method and a do-nothing delete which will not be called here.
MedlineHandler medlineHandler = new MedlineHandler() { public void handle(MedlineCitation citation) { ... } public void delete(String id) { } }
The handle method just grabs the text of each sentence from the title and body and sends it to the indexer:
Article article = citation.article(); String title = article.articleTitleText(); char[] cs = title.toCharArray(); indexer.handle(cs,0,cs.length); Abstract abstrct = article.abstrct(); if (abstrct == null) return; String text = abstrct.textWithoutTruncationMarker(); Chunking sentenceChunking = sentenceChunker.chunk(text); Set<Chunk> sentenceChunks = sentenceChunking.chunkSet(); for (Chunk chunk : sentenceChunks) { int start = chunk.start(); int end = chunk.end(); String sentence = text.substring(start,end); cs = sentence.toCharArray(); indexer.handle(cs,0,cs.length); } ...
The rest of the code simply parses the files out of their gzipped distribution format and sends them to LingPipe's MEDLINE parser, with the handler defined above attached. It looks as follows, with the gory details of selecting the file, creating the input source, and cleaning up in an orderly fashion elided:
MedlineParser parser = new MedlineParser(includeRawXml); parser.setHandler(medlineHandler); for (File file : ...) { ... InputSource in = new InputSource(...); in.setEncoding("UTF-8"); parser.parse(in); } indexer.close(); int minFinalCount = 2; indexer.optimize(minFinalCount);
Token N-Gram Indexer
The rest of the work is done in the token n-gram indexer, the code
for which is in src/TokenNGramIndexer.java
.
It contains the methods required to buffer n-grams in a tokenized LM
and write to the file system when necessary. It also carries out
merges if the number of files at a given level would exceed the
maximum allowed. All of these operations simply call the
token n-gram files commands in TokenNGramFiles
.
References
- Bob Carpenter. 2005. Scaling high-order character language models to gigabytes. Proceedings of the 2005 Assocation for Computational Linguistics Software Workshop. [All the gory details with comparisons to other approaches.]
- Manning, Chris and Hinrich Schuetze. 2000. Foundations of Statistical Natural Language Processing. MIT Press. Chapter 6: N-gram models over sparse data.