What is Part-of-Speech Tagging?
Part-of-speech tagging is a process whereby tokens are sequentially labeled with syntactic labels, such as "finite verb" or "gerund" or "subordinating conjunction". This tutorial shows how to train a part-of-speech tagger and compile its model to a file, how to load a compiled model from a file and perform part-of-speech tagging, and finally, how to evaluate and tune models.
What is Phrase Chunking?
Phrase chunking is the process of recovering the
phrases (typically base noun phrases and verb phrases)
constructed by the part-of-speech tags. For instance,
in the sentence John Smith will eat the beans.,
there is a proper noun phrase John Smith,
a verb phrase will eat and a
common noun phrase, the beans. Note that
this notion of phrase may not line up with any theoretically
motivated linguistic analysis.
In the second part of this tutorial, we show how to generate phrase chunkings based on a part-of-speech tagger.
Downloading Training Corpora
We use three different freely downloadable English corpora as examples, though the entire tutorial may be completed with only the MedPost corpus:
| POS Corpora | |||||
|---|---|---|---|---|---|
| Corpus | Domain | Tags | Toks | Parser | Link >>target file(s) |
| Brown | Balanced | 93 | 1.1M | BrownPosParser |
nltk-data-0.3.zip
nltk-data-0.3/brown.zip |
| GENIA | Biomed | 48 | 501K | GeniaPosParser |
GENIA 3.02p
GENIAcorpus3.02p-1.tgz GENIAcorpus3.02.pos.txt |
| MedPost | Biomed | 63 | 182K | MedPostPosParser |
medtag.tar.gz
medtag/medpost/tag*.ioc |
Each corpus is listed with a link to the JavaDoc for the
corresponding parser class in com.aliasi.corpus.parsers,
such as BrownPosParser for the Brown corpus. The class
documentation for these parsers details the corpus contents including
domain and POS tag set, the corpus format, and provides links to
download and get more information about the corpus.
The last column provides a download link, as well as an
indication of the target file(s) for training. For instance,
the link to the NLTK distribution of the Brown corpus
should be unzipped and the file nltk-data-0.3/brown.zip
is the relevant file for training. These names are reflected
in the build.xml file for this tutorial.
Training Part-of-Speech Models
Like the other statistical packages in LingPipe (e.g. named entity detection, language-model classifiers, spelling correction, etc.), part-of-speech labeling is based on statistical models that are trained from a corpus of labeled data. For this illustration, we use the MedPost data, which is availble from the link above.
The Training Corpus
We downloaded the data to the directory
/data1/data/medtag/medpost:
> ls /data1/data/medtag/medpost medpost.db tag_mb01.ioc tag_mb04.ioc tag_mb07.ioc tag_mb10.ioc medpost.sql tag_mb02.ioc tag_mb05.ioc tag_mb08.ioc tag_mb.ioc tag_cl.ioc tag_mb03.ioc tag_mb06.ioc tag_mb09.ioc tag_ml01.ioc
The files ending with the suffix .ioc make up the
text-formatted actual corpus of training files. For example, using
the tail command to print the last few lines of the
training file tag_mb.ioc produces (with our ellipses
to shorten lines):
> tail /data1/data/medtag/medpost/tag_mb.ioc P12569660A06 The_DD N-terminal_JJ region_NN had_VHD high_JJ homology_NN with_II ... P12571010A13 Several_JJ sequences_NNS were_VBD identified_VVN in_II the_DD libra... P12576309A07 Our_PNG findings_NNS indicate_VVB that_CST CRCL_NN has_VHZ prominen... P12582233A05 The_DD corresponding_VVGJ mRNA_NN of_II 3.5_MC kb_NN is_VBZ compose... P12586375A07 A_DD few_JJ examples_NNS of_II heterologous_JJ expression_NN of_II ...
The data is formatted with a PubMed identifier
(e.g. P12569660) and sentence position
(e.g. A06) on their own line, followed by the text of the
sentence represented as a sequence of token/tag pairs, such as
The_DD, which indicates that the token The
is assigned the determiner part-of-speech DD.
Training a Model
Given the location of the data directory, the training program src/TrainMedPost.java can be run using the ant task:
> ant -Ddata.pos.medpost=/data1/data/medtag/medpost train-medpost
The -D option sets a system property that is picked up by
Ant to indicate the location of the MedPost data directory. The
italicized portion of the above command should be replaced with the
path to where you unpacked the MedPost distribution. The output
produces is:
Buildfile: build.xml
compile:
train-medpost:
[java] Training file=/data1/data/medtag/medpost/tag_cl.ioc
[java] Training file=/data1/data/medtag/medpost/tag_mb.ioc
[java] Training file=/data1/data/medtag/medpost/tag_mb01.ioc
[java] Training file=/data1/data/medtag/medpost/tag_mb02.ioc
[java] Training file=/data1/data/medtag/medpost/tag_mb03.ioc
[java] Training file=/data1/data/medtag/medpost/tag_mb04.ioc
[java] Training file=/data1/data/medtag/medpost/tag_mb05.ioc
[java] Training file=/data1/data/medtag/medpost/tag_mb06.ioc
[java] Training file=/data1/data/medtag/medpost/tag_mb07.ioc
[java] Training file=/data1/data/medtag/medpost/tag_mb08.ioc
[java] Training file=/data1/data/medtag/medpost/tag_mb09.ioc
[java] Training file=/data1/data/medtag/medpost/tag_mb10.ioc
[java] Training file=/data1/data/medtag/medpost/tag_ml01.ioc
BUILD SUCCESSFUL
Total time: 10 seconds
and creates a model file of rougly 5MB:
> ls -l medpost.model -rw-rw-r-- 1 carp carp 4974338 Sep 20 14:02 medpost.model
The Training Code
The actual code making up the sample is an almost trivial sequence of
lines in a single main(String[]) method. First, it
creates an estimator for a hidden Markov model (HMM):
HmmCharLmEstimator estimator
= new HmmCharLmEstimator(N_GRAM, NUM_CHARS, LAMBDA_FACTOR);
The parameters are for the HMM and determine how many characters to use as the basis for the model, the total number of characters, and an interpolation parameter for smoothing:
static int N_GRAM = 8; static int NUM_CHARS = 256; static double LAMBDA_FACTOR = 8.0;
These are reasonable default values. Their behavior is outlined in hmm.HmmCharLmEstimator and described in detail in lm.NGramBoundaryLM. These are reasonable default
settings for English data.
This estimator implements the corpus.TagHandler interface, through which it
receives its training events in the form of aligned
token/whitespace/tag arrays. The handler acts as a visitor over the
training data. It is escorted by a parser, which does the actual data
parsing and then provides the arrays to the handler. The parser for
this demo is an instance of corpus.parsers.MedPostPosParser. The code to set it
up and make sure the taggings it extracts are sent to the estimator is
just two lines:
Parser parser = new MedPostPosParser(); parser.setHandler(estimator);
The next step is to find the actual training files and walk over them.
This is done with an instance of io.FileExtensionFilter that is set to pick out just the files
ending in "ioc":
File dataDir = new File(args[0]);
File[] files = dataDir.listFiles(new FileExtensionFilter("ioc"));
We then loop over the files, parsing each one:
for (int i = 0; i < files.length; ++i) {
System.out.println("Training file=" + files[i]);
parser.parse(files[i]);
}
That's it. At this point, the estimator is trained. Because
the HmmCharLmEstimator class implements
hmm.HiddenMarkovModel, it may be used
immediately to do part-of-speech tagging. Rather than do that
in the tutorial, instead we demonstrate how to compile the
model to a file using an object output stream:
File modelFile = new File(args[1]); FileOutputStream fileOut = new FileOutputStream(modelFile); ObjectOutputStream objOut = new ObjectOutputStream(fileOut); estimator.compileTo(objOut); Streams.closeOutputStream(objOut);
HMM estimators implement the util.Compilable interface, which is used to write them
to an object output. Note that the method util.Streams.closeOutputStream(ObjectOut) is used to
close the output stream; in more robust settings this would be done
inside a try/finally block to makes sure the streams were
closed. That's it.
Running Part-of-Speech Taggers
Now that we have a model file, we can use it to assign part-of-speech
tags to phrases. The code to run a compiled model is in src/RunMedPost.java. We
set it up to run interactively, which doesn't play very nicely with
Ant (it plays a bit better with a println() rather than
print() for the prompt). Instead, we can call it from
the command-line directly:
> java -cp build/classes:../../../lingpipe-3.7.0.jar RunMedPost medpost.model
The demo prompts for input sentences and then returns their part-of-speech tagging on the next line in the same form as the input was found (with line-breaks inserted for readability):
Reading model from file=medpost.model
INPUT (return)> A good correlation was found between the
grade of Barrett's esophagus dysplasia and high p53 positivity.
A_DD good_JJ correlation_NN was_VBD found_VVN between_II
the_DD grade_NN of_II Barrett's_NNP esophagus_NN dysplasia_NN
and_CC high_JJ p53_NN positivity_NN ._.
INPUT (return)> This correlation was also confirmed by
detection of early carcinoma in patients with "preventive"
extirpation of the esophagus due to a high-grade dysplasia.
This_DD correlation_NN was_VBD also_RR confirmed_VVN by_II
detection_NN of_II early_JJ carcinoma_NN in_II patients_NNS
with_II "_`` preventive_JJ "_'' extirpation_NN of_II the_DD
esophagus_NN due_II+ to_II a_DD high-grade_NN dysplasia_NN ._.
INPUT (return)> exit
Note that an empty line or the term exit will cause the
system to exit gracefully.
Tokenization
The code to do decoding is nearly as simple as that to do training.
First, our input will come in the form of lines and we thus need to
tokenize the input to break it down into similar chunks to the
training-data. We do this by creating a tokenizer factory that allows
defines a token as the longest contiguous non-empty sequence of letter
characters (\p{L}), numerals (\d), hyphens
(-) and apostrophes ('); it also allows
single non-whitespace characters (\S).
static TokenizerFactory TOKENIZER_FACTORY
= new RegExTokenizerFactory("(-|'|\\d|\\p{L})+|\\S");
In general, the trick with pre-tokenized corpora is developing a tokenizer to match the corpus. The above tokenizer is only an approximate guess as to what the real MedPost tokenizer looks like. The MMTx Tokenization page points to NLM's tokenizer, hints that it's highly heuristic and context sensitive, but does not provide a grammar for it.
Reading the Model
Actually reading in the model and constructing the decoder requires just a bit of stream manipulation, casting and wrapping:
FileInputStream fileIn = new FileInputStream(args[0]); ObjectInputStream objIn = new ObjectInputStream(fileIn); HiddenMarkovModel hmm = (HiddenMarkovModel) objIn.readObject(); Streams.closeInputStream(objIn); HmmDecoder decoder = new HmmDecoder(hmm);
An object input stream wraps a file input stream that points to the
model (args[0] on the command line). The HMM is then
just read using the standard
java.io.ObjectInput.readObject() method; this method may
throw an IOException or a
ClassNotFoundException. The object read from the stream
is then cast to instance of
com.aliasi.hmm.HiddenMarkovModel. The input stream is
closed; again, a robust approach would do this in a
finally block. Finally, the decoder is created by
wrapping the HMM read from the input stream.
Standard Input Loop
The next part of the code just goes into a loop to read characters a line at a time from the standard input and quit if there's no input or the input is command to stop:
InputStreamReader isReader = new InputStreamReader(System.in);
BufferedReader bufReader = new BufferedReader(isReader);
while (true) {
System.out.println("\n\nINPUT (return)> ");
System.out.flush();
String line = bufReader.readLine();
if (line == null || line.length() < 1
|| line.equalsIgnoreCase("quit") || line.equalsIgnoreCase("exit"))
break;
char[] cs = line.toCharArray();
...
The real work then happens once we have the characters cs
to tag. First, we generate the array of tokens:
...
Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer(cs,0,cs.length);
String[] tokens = tokenizer.tokenize();
First-Best Results
With the tokens in hand, retrieving the first-best tags is trivial:
String[] tags = decoder.firstBest(tokens);
We then just print these out in the same format as the training data in a simple loop (actually, a bit more complicated because of pretty printing):
for (int i = 0; i < tokens.length; ++i)
System.out.print(tokens[i] + "_" + tags[i] + " ");
N-best Results
The following code will print the best analyses up to the
maximum number of analyses MAX_N_BEST (modulo
a little padding and decimal formatting):
Iterator nBestIt = decoder.nBest(tokens);
for (int n = 0; n < MAX_N_BEST && nBestIt.hasNext(); ++n) {
ScoredObject tagScores = (ScoredObject) nBestIt.next();
double score = tagScores.score();
String[] tags = (String[]) tagScores.getObject();
System.out.print(n + ") " + score + ": ");
for (int i = 0; i < tokens.length; ++i)
System.out.print(tokens[i] + "_" + tags[i] + " ");
}
Here's an example of the demo's print out for the above code with an input that's a shortened form of the one above.
INPUT> This correlation was also confirmed by detection of early carcinoma.
...
N BEST
# JointLogProb Analysis
0 -90.265 This_DD correlation_NN was_VBD also_RR confirmed_VVN by_II detection_NN of_II early_JJ carcinoma_NN ._.
1 -94.072 This_DD correlation_NN was_VBD also_RR confirmed_VVD by_II detection_NN of_II early_JJ carcinoma_NN ._.
2 -99.905 This_PND correlation_NN was_VBD also_RR confirmed_VVN by_II detection_NN of_II early_JJ carcinoma_NN ._.
3 -101.574 This_DD correlation_NN was_VBD also_RR confirmed_VVN by_II detection_NN of_II early_RR carcinoma_NN ._.
4 -102.253 This_DD correlation_NN was_VBD also_RR confirmed_VVN by_II detection_NN of_II early_NN carcinoma_NN ._.
The method HmmDecoder.nBest(String[]) returns an iterator over
the top scoring tag sequences for the specified tokens. The iterator
produces instances of LingPipe's util.ScoredObject class. This class simply
encapsulates an object with a double-based score. The
score is retrieved with the score() method and the object
with the getObject() method. The object in this case is
an array of tags. The score consists of a joint log (base 2)
probability for the tags and tokens together. Another method, HmmDecoder.nBestConditional(String[]) returns the
tag sequences in the same order but provides a conditional log (base
2) probability of the tag sequence given the tokens rather than a
joint probability.
The difference between the probabilities in the n-best analyses
gives a good first-approximation of the tagger's confidence in its tag
assignments. In the example above, note that the first-best analysis
has a log (base 2) joint probability of -90.2 whereas the second
ranking analysis is at -94.1; this means that the model estimates the
probability of the first answer as being
23.9 (roughly 15) times more likely
than the second. For strings that are more confusable to the tagger,
the gap will be narrower. For strings in which the tagger is highly
confident of the total tagging, the gap will be higher. Further note
that the only difference in the second analysis is in the form of the
verb "confirmed". The third analysis, ranked as almost 1000
times less likely than the first, only varies from the first in
assigning This to a pronoun rather than a determiner.
Looking at the positions that vary also gives you a measure of
confidence on a tag-by-tag basis. In this case, it's clear the
analyzer is very sure of its analysis of all but two tokens.
Confidence-Based Results
In addition to extracting n-best results one at a time, the entire
statistical analysis can be returned in one go through the HmmDecoder.lattice(String[]) method. This method
returns an hmm.TagWordLattice object. For those familiar with
HMM decoding, this is quite simply the lattice of forward/backward
scores with boundary conditions. Despite being one of the more
complex data structures in LingPipe, it looks even bigger because
the methods come in pairs return probabilities and log probabilities.
The code in the demo to print out confidences for tag assignments to individual tokens is also quite simple (again modulo formatting):
TagWordLattice lattice = decoder.lattice(tokens);
for (int tokenIndex = 0; tokenIndex < tokens.length; ++tokenIndex) {
ScoredObject[] tagScores = lattice.scoredTags(tokenIndex);
tem.out.print(tokenIndex + "_" + tokens[tokenIndex]);
for (int i = 0; i < 5; ++i) {
double conditionalProb = Math.pow(2.0,tagScores[i].score());
String tag = (String) tagScores[i].getObject();
System.out.print(" " + conditionalProb + ":" + tag);
}
System.out.println();
}
The conditional prob is returned in log form as a score and
thus the exponentiation Math.pow(2,tagScores[i].score())
returns it to a normal (linear) scale.
Run on our simplified demo sentence, this produces the following output, consisting of a row for each token, with its top 5 tags with their joint probabilities.
INPUT> This correlation was also confirmed by detection of early carcinoma. CONFIDENCE # Token (Prob:Tag)* 0 This 0.999:DD 0.001:PND 0.000:PNG 0.000:NN 1 correlation 1.000:NN 0.000:RR 0.000:NNS 0.000:VVN 2 was 1.000:VBD 0.000:NNS 0.000:VVZ 0.000:II 3 also 1.000:RR 0.000:PND 0.000:VVN 0.000:JJR 4 confirmed 0.933:VVN 0.067:VVD 0.000:VVNJ 0.000:VVB 5 by 1.000:II 0.000:NN 0.000:RR 0.000:JJ 6 detection 1.000:NN 0.000:VVGN 0.000:VVI 0.000:VVB 7 of 1.000:II 0.000:VVZ 0.000:RR 0.000:MC 8 early 0.999:JJ 0.000:RR 0.000:NN 0.000:VVGJ 9 carcinoma 1.000:NN 0.000:NNS 0.000:JJ 0.000:VVGN 10 . 1.000:. 0.000:) 0.000:NN 0.000:,
The decoder's 99.9% sure of its estimates in all cases but for the
form of the verb "confirmed", for which it estimates 93.3%
for probability of the tag being VVN, reserving 6.6% for
the probability it is VVD. In fact, the tagger picked up
on a fundamental ambiguity of English verbs between simple past and
past participles. This case is confusing to a bigram HMM decoder
(like ours), because the previous word is also with
the modifier tag RR; this doesn't disambiguate. We'd
need to go back to the auxiliary was with category
VBD.
Note that the ratio of probabilities from the confidence-based results (0.933/0.067=14.1) is very close to the estimate given by inspecting the top two full analyses in the n-best results. This is due to a deep mathematical link that says the confidences are equal to the limit of doing n-best for an unlimited n (as opposed to just the top two).
Breaking down the code, the key method to compute the confidences
is the first one called, HmmDecoder.lattice(String[]):
TagWordLattice lattice = decoder.lattice(tokens);
This returns what is known as a forward-backward lattice in the HMM decoder literature, as an instance of hmm.TagWordLattice.
To extract a confidence-ordered list of tags for a particular token index, we use:
for (int tokenIndex = 0; tokenIndex < tokens.length; ++tokenIndex) {
ScoredObject[] tagScores = lattice.scoredTags(tokenIndex);
...
Then, given the array of scored objects, we iterate over it and print the top 5 tags along with their scores:
...
for (int i = 0; i < 5; ++i) {
double logProb = tagScores[i].score();
double conditionalProb = Math.pow(2.0,logProb);
String tag = (String) tagScores[i].getObject();
...
}
}
The only issue to note is that the log probability is converted to a
normal (linear) probability using
java.lang.Math.pow(double,double).
3. Evaluating and Tuning Tagging Models
In the final part of this tutorial, we show how to evaluate HMM part-of-speech models and how to tune their parameters.
Running a Part-of-Speech Evaluation
In this section, we show how to dump out a large, but by no means comprehensive, set of statistics on part-of-speech tagging.
Train-a-Little, Evaluate-a-Little
The way in which we will evaluate is to train-a-little and evaluate-a-little. Specifically, as we parse the corpus we extract reference taggings. We then take the underlying text and extract taggings with our model for a response tagging and then add it to a cumulative evaluation. After evaluating on a reference tagging, we add it as training data before evaluating the next sentence of data.
Running an Evaluation
For us, with the MedPost corpus in
/data1/data/medtag/medpost, we run the MedPost evaluation
with the following invocation of Ant:
ant -Dmedpost-dir=/data1/data/medtag/medpost eval-medpost
The output begins with a dump of the parameters of the evaluation:
COMMAND PARAMETERS Sent eval rate=1 Toks before eval=170000 Max n-best eval=100 Max n-gram=8 Num chars=256 Lambda factor=8.0
and then collects data from the corpus itself in a first-pass run-through and prints it out.
CORPUS PROFILE:
Corpus class=MedPostPosCorpus
#Sentences=6700
#Tokens=182399
#Tags=63
Tags=['', (, ), ,, ., :, CC, CC+, CS, CS+, CSN,
CST, DB, DD, EX, GE, II, II+, JJ, JJ+,
JJR, JJT, MC, NN, NN+, NNP, NNS, PN,
PND, PNG, PNR, RR, RR+, RRR, RRT, SYM,
TO, VBB, VBD, VBG, VBI, VBN, VBZ, VDB,
VDD, VDN, VDZ, VHB, VHD, VHG, VHI, VHZ,
VM, VVB, VVD, VVG, VVGJ, VVGN, VVI,
VVN, VV]
It first trains on 170,000 characters (see the Toks before
eval figure above). This takes about ten seconds on my desktop
machine.
The rest of the output consists of evaluation case reports and cumulative evaluation reports. These are printed out per evaluation case. The following example is for the seventh evaluation sentence:
Test Case 7 In II | II 0.975:II 0.021:NN 0.002:JJ 0.001:VVNJ 0.001:CS patients NNS | NNS 1.000:NNS with II | II 1.000:II chronic JJ | JJ 1.000:JJ pure JJ | NN XX 0.580:NN 0.419:JJ red JJ | VVD XX 0.435:VVNJ 0.264:VVD 0.210:JJ 0.074:NN 0.011:VVN 0.002:VVGJ 0.002:VVB 0.001:VVZ cell NN | NN 0.996:NN 0.004:JJ ? aplasia NN | NN 0.998:NN 0.002:NNS the DD | DD 0.989:DD 0.005:NN 0.003:II+ 0.002:PND in JJ+ | JJ+ 0.977:JJ+ 0.014:II 0.006:NN 0.003:JJ 0.001:VVNJ vitro JJ | JJ 0.999:JJ 0.001:RR study NN | NN 1.000:NN of II | II 1.000:II erythroid NN | NN 1.000:NN precursors NNS | NNS 1.000:NNS has VHZ | VHZ 0.999:VHZ a DD | DD 0.999:DD 0.001:RR prognostic JJ | JJ 0.999:JJ 0.001:NN value NN | NN 1.000:NN . . | . 1.000:. N-Best Rank=3
Note that the correct analysis was ranked 4th on the n-best list (like
typical computer scientists, we begin counting from zero). The double
question marks (??) in front of the token
aplasia indicates that the word was not in the training
corpus. The first column of part-of-speech categories contains the
reference categories for the tokens. the column after the vertical
bar (|) indicates the first-best hypothesis returned by
the model trained on the first 6240 sentences (numbered 0 to 6239, of
course). The pair of Xs (XX) next to those categories
indicate that the first best hypothesis got this tag wrong; in this
case pure and red, both adjectives
(JJ) are assigned to a nominal (NN) and
verbal category (VVD) instead. The remaining columns
constitute all category assignments with greater than 0.001 estimated
likelihood given the input sequence. For the first error, the
confidence of the returned tag, NN, was only 0.580.
Surprisingly, the tag returned in the first-best hypothesis for the
token "red" is not the same as that with the
highest confidence. The highest confidence tag is VVNJ
at 0.435, whereas the tag VVD assigned by the first-best
hypothesis has only a 0.264 confidence. This illustrates
the way in which the first-best sequence hypothesis (the one right
after the vertical bars) may differ from the sequence of tags
with highest confidence. The sequence tags derive from the model's
estimate of the most likely sequence of tags given the input.
The confidence tags derive from the model's independent estimates
of the most likely tag for each position. These sequences
are actually scored separately in the cumulative evaluation, to
which we turn now.
After the dump of a result for a sentence, a running cumulative total is provided:
Cumulative Evaluation
Estimator: #Train Cases=6240 #Train Toks=170178]
Evaluator: #Cases=7 #Toks=184 Tok Acc=0.957
Case Acc=0.571 Lattice Acc=0.957
Unknown Toks=13 Unknown Tok Acc=0.846
This report begins with an indication of the amount of training data seen by the estimator in number of cases and number of tokens. The second line provides the actual cumulative statistics. After seven test sentences (computer scientists count actual things like training cases just like other people) and 184 training tokens, the accuracy on a per-token basis is 95.7%. The accuracy judging entire cases at a time is 57.1%; in other words, 57.1% of the first 7 test sentences were labeled perfectly. The lattice accuracy of 95.7% is the accuracy of the sequence formed of the most confident tags (the third column of tags above). As we saw in the last example, these may be different, though in practice, they usually track each other very closely in high-accuracy and highly-lexicalized arangements like our part-of-speech HMMs. The last two columns report the number of unknown tokens which appeared in a test case without appearing in a training case. The accuracy for the first 13 unknown tokens is 84.6%.
The Evaluation Command and Ant Task
The evaluation command may be run using the following ant
task, drawn from the
eval-medpost target
in the ant build file build.xml:
<target name="eval-medpost"
depends="compile">
<java classname="EvaluatePos"
fork="true">
<jvmarg value="-server"/>
<classpath refid="classpath.standard"/>
<arg value="1"/> <!-- sent eval rate -->
<arg value="170000"/> <!-- toks before eval -->
<arg value="100"/> <!-- max n-best -->
<arg value="8"/> <!-- n-gram size -->
<arg value="256"/> <!-- num characters -->
<arg value="8.0"/> <!-- interpolate ratio -->
<arg value="MedPostPosCorpus"/> <!-- corpus impl class -->
<arg value="${medpost-dir}"/> <!-- baseline for data -->
</java>
</target>
The arguments are all required, and simply supplied in order.
The first argument is the frequency with which to evaluate sentences.
The value of 1 means every sentence. The second value is
the number of tokens to use for training before evaluating the first
sentence. In this case, 170,000. The third argument is the size of
n-gram to use, in this case 8. The fourth argument is
the number of characters in the training and test data, in this case a
conservative estimate of 256. The fifth argument,
8.0, is for the language model interpolation factor.
Tweaking the last three numbers will affect the performance of the
tagger, and this task is defined to show you how to do that. The
final two arguments, argument six and seven, pick out the name of the
corpus class and the directory in which it is set. We include the
property value ${medpost-dir} in order to allow users to
set it in a properties file or on the command line. We could have
also specified the other variables in the same way in order to allow
the external caller to set them; or the command can be pulled out of
ant and run standalone.
Corpus Parsing Interface and Implementations
Before turning to the code for the evaluation, we first pause to abstract the features of our corpus into a general interface with two methods:
public interface PosCorpus {
public Parser parser();
public Iterator sourceIterator() throws IOException;
}
The first method returns the parser for a corpus. The second method
returns an iterator over input sources and may throw an I/O exception
if it gets in trouble on the I/O front. The code can be found in src/PosCorpus.java.
We provide three implementations, one for each corpus described above.
Genia POS Corpus Parser
The simplest is the GENIA corpus, because it only involves reading a
single input source from a file. The following code for doing this is
drawn from src/GeniaPosCorpus.java:
public class GeniaPosCorpus implements PosCorpus {
private final File mGeniaGZipFile;
public GeniaPosCorpus(File geniaZipFile) {
mGeniaGZipFile = geniaZipFile;
}
public Iterator sourceIterator() throws IOException {
FileInputStream fileIn = new FileInputStream(mGeniaGZipFile);
InputSource in = new InputSource(fileIn);
return new Iterators.Singleton(in);
}
public Parser parser() {
return new GeniaPosParser();
}
}
The return is through a LingPipe utility for singleton
iterators in com.aliasi.util.Iterators. Singleton
iterators return a single item once, just as if iterating over a
singleton (one element) set. The parser method simply returns an
instance of the GENIA corpus part-of-speech parser.
Note that an instance is constructed from a single file, which provides the basis for a relative location of the corpus for all of the implementations. Also note that nothing ever closes the input source. This problem would have to be fixed for a robust implementation through a more sophisticated iterator implementation that knows when it's done and can close the input streams, or by reading the file into memory and closing it before wrapping it as an input source and providing it as a singleton iterator.
MedPost POS Parser
The MedPost corpus consists of a directory of files, each
of which is simply in a text-based format. The non-trivial bit of the
implementation of src/MedPostPosCorpus.java
is the iteration over input sources:
public Iterator sourceIterator() {
return new MedPostSourceIterator(mMedPostDir);
}
public static class MedPostSourceIterator
extends Iterators.Buffered {
private final File[] mFiles;
private int mNextFileIndex = 0;
public MedPostSourceIterator(File medPostDir) {
mFiles
= medPostDir
.listFiles(new FileExtensionFilter("ioc"));
}
public Object bufferNext() {
if (mNextFileIndex >= mFiles.length) return null;
try {
String url
= Files
.fileToURLName(mFiles[mNextFileIndex++]);
return new InputSource(url);
} catch (IOException e) {
return null;
}
}
}
Here the iterator stores an array of files that end in the suffix
"ioc"; these are returned using the LingPipe
utitlity io.FileExtensionFilter.
We then keep the variable mNextFileIndex as a pointer to
the next file to return through the iterator. The iterator itself is
implemented by extending the very handy utility class util.Iterators.Buffered.
This abstract class defines the tricky bits of the has-next and next
logic of iterators through a single method bufferNext().
This allows implementations to concentrate on returning the next
object in the iteration rather than the logic buffering and returing
has-next information. A return of null indicates to the
buffered iterator that there are no more elements. To return an
element as in input source, the name of the file is converted to
a URL using util.Files.fileToURLName(File).
Brown Corpus POS Parser
The Brown corpus, as distributed with the Natural Language Tool Kit
(NLTK), is in yet another format -- a zipped directory of files. Zip
files are a very nice way to pack a lot of files because Java supports
their unpacking. In this way, they're a better choice than the
standard unix combination of tar and gzip.
Most of src/BrownPosCorpus.java is
just like the previous classes, with the following source iterator:
static class BrownSourceIterator
extends Iterators.Buffered {
private ZipInputStream mZipIn = null;
public BrownSourceIterator(File brownZipFile)
throws IOException {
FileInputStream fileIn
= new FileInputStream(brownZipFile);
mZipIn = new ZipInputStream(fileIn);
}
public Object bufferNext() {
ZipEntry entry = null;
try {
while ((entry = mZipIn.getNextEntry())
!= null) {
if (entry.isDirectory()) continue;
String name = entry.getName();
if (name.equals("brown/CONTENTS")
|| name.equals("brown/README"))
continue;
return new InputSource(mZipIn);
}
} catch (IOException e) {
// fall through on purpose
}
Streams.closeInputStream(mZipIn);
return null;
}
}
Here the file input stream is wrapped in a zip input stream. To get
the actual input sources, we extract the files in the input stream one
by one. To do this, we use the zip iteration method
getNextEntry(). If the entry is not a directory and
does not share a name with one of the non-data read-me files, then
the whole input stream is wrapped in an input source and passed to
the iterator to return. The zip input stream provides the actual bytes
and will have an end-of-stream marker that is only reset after the
next call to getNextEntry().
The Evaluation Code
The Parser/Handler Pattern
As evidenced by the command invocation in the last section, the
the top-level main(String[]) method is located in
src/EvaluatePos.java,
and it's quite simple:
public static void main(String[] args)
throws IOException, ClassNotFoundException {
new EvaluatePos(args).run();
}
It just constructs an EvaluatePos object
out of the command-line arguments and runs it. The constructor
merely sets a bunch of local variables given the arguments:
public EvaluatePos(String[] args) {
mSentEvalRate = Integer.parseInt(args[0]);
mToksBeforeEval = Integer.parseInt(args[1]);
mNGram = Integer.parseInt(args[2]);
mNumChars = Integer.parseInt(args[3]);
mLambdaFactor = Double.parseDouble(args[4]);
String constructorName = args[5];
File corpusFile = new File(args[6]);
Object[] consArgs = new Object[] { corpusFile };
mCorpus
= (PosCorpus)
Reflection
.newInstance(constructorName,consArgs);
}
Note that LingPipe's com.aliasi.util.Reflection
utility class is being used to construct the corpus given its name and
an argument constructed out of the corpus file. This utility merely
infers the constructor's argument types by the arguments and captures
any exceptions, returning null if the instance could not
be created due to any of the half-dozen or so exceptions Java's
reflection package may throw. A more robust version wold probably not
use this reflection shortcut, or would at least provide an error
message in the constructor if the corpus could not be created.
The real action begins in the run() method, which begins
by printing out the parameters. It then constructs a corpus profile
handler and passes it off to the corpus parser method which will call
the handler.
CorpusProfileHandler profileHandler
= new CorpusProfileHandler();
parseCorpus(profileHandler);
The profile handler inner class is worth noting merely as a simple example of what can be done with LingPipe's handler framework:
class CorpusProfileHandler implements TagHandler {
public void handle(String[] toks, String[] whitespaces,
String[] tags) {
++mTrainingSentenceCount;
mTrainingTokenCount += toks.length;
for (int i = 0; i < tags.length; ++i)
mTagSet.add(tags[i]);
}
}
Because it is not static, it is able to manipulate member
variables for counting in the EvaluatePos class.
The parseCorpus(TagHandler) method is a utility that
simply parses the corpus by iterating through the input
sources and applying the parser to them:
void parseCorpus(TagHandler handler) throws IOException {
Parser parser = mCorpus.parser();
parser.setHandler(handler);
Iterator it = mCorpus.sourceIterator();
while (it.hasNext()) {
InputSource in = (InputSource) it.next();
parser.parse(in);
}
}
Recall the mCorpus variable is set to the relevant
implementation of PosCorpus. The corpus supplies a
parser through its parser() method and an iterator over
sources through its sourceIterator() method. For
MedPost, these return an instance of MedPostPosParser
and an iterator over the input sources over the input files.
Parsing the corpus with the corpus profile handler simply records the number of training sentences, tokens and collects the set of tags.
After the tags are found, we can build the components for evaluation, namely an HMM estimator and an HMM evaluator (the latter of which requires the tags).
Set tagSet = profileHandler.mTagSet;
String[] tags = (String[]) tagSet.toArray(new String[0]);
Arrays.sort(tags);
...
mEstimator
= new HmmCharLmEstimator(mNGram,mNumChars,mLambdaFactor);
mEvaluator
= new HmmEvaluator(mEstimator,tags,mMaxNBest);
The estimator and evaluator are both instances of com.aliasi.corpus.Parser. They are
assigned to local variables which whill be available when
the learning curve handler is run in the last and final
statements of EvaluatePos's run()
method:
LearningCurveHandler evaluationHandler
= new LearningCurveHandler();
parseCorpus(evaluationHandler);
The actual work is all done by the learning curve handler, which we describe next.
The Learning Curve Handler
The learning curve handler class is an inner class in src/EvaluatePos.java.
class LearningCurveHandler implements TagHandler {
public void handle(String[] toks, String[] whites,
String[] refTags) {
if (mEstimator.numTrainingTokens() > mToksBeforeEval
&& mEstimator.numTrainingCases() % mSentEvalRate == 0) {
mEvaluator.handle(toks,whites,refTags);
System.out.println("\nTest Case "
+ mEvaluator.evaluation().numCases());
System.out.println(mEvaluator.lastCaseToString());
System.out.println("Cumulative Evaluation");
System.out.print(" Estimator: #Train Cases="
+ mEstimator.numTrainingCases());
System.out.println(" #Train Toks="
+ mEstimator.numTrainingTokens() + "]");
System.out.print(" Evaluator: ");
System.out.println(mEvaluator.evaluation().toString());
}
mEstimator.handle(toks,whites,refTags);
for (int i = 0; i < toks.length; ++i)
mEvaluator.evaluation().addKnownToken(toks[i]);
}
}
The only method implemented is TagHandler's interface
method handle. The handler first does evaluation on
the input and then does training.
Evaluation is only carried out under two conditions. First, the
number of training tokens must exceed the minimum specified on the
command line. Second, the number of training cases must be evenly
divisible by the sentence evaluation rate; this ensures that the
specified rate is in fact the rate of evaluation over sentences. If
the case is being evaluated, then it is simply passed on to the
evaluator's handle method (recall that the evaluator also
implements TagHandler, thus making the learning curve handler
a composite filter in the lingo of design patterns). The evaluator
actually does the evaluation and keeps a running total accessible
through its evaluation() method. The evaluator is used
to print out the last case run through it as well as a cumultaive
total from the evaluation.
After the optional evaluation, the estimator's handle
method is given the case. One tricky part of this evaluation is
that to alert the evaluator's evaluation to the set of known tokens
for the evaluation, they must be added after training as shown in
the final loop of the method.
Warning: Slow Curves
Here's how long running 12,385 test tokens took (using less than 100MB of memory):
BUILD SUCCESSFUL Total time: 39 minutes 13 seconds
Not counting the few seconds used for training, this took a long time.
Only about 4 tokens/second. The machine's a P4 with 3.0GHz and a dual
400MHz FSB running Java 1.4.2 in -server mode, so it's not
that this is a crudely out-of-date setup.
No, the slow pace is because this demo presents a no-expenses-spared, time-and-memory-are-no-object evaluation. It runs every aspect of a decoder: Viterbi first-best, Viterbi/A* n-best, and lattice-based forward-backward confidence. And, it does it all with the models in the estimator. If a model is compiled and read back in, it runs orders of magnitudes more quickly.
In addition to the runs over the data, each confidence result array is sorted and printed, and then used to create a classification instance which is added to a scored classification evaluation.
Running only the first-best evaluation component on a compiled model will run in seconds rather than hours.
For those who are curious, here's the final result of this evaluation setup:
Cumulative Evaluation
Estimator: #Train Cases=6699 #Train Toks=182365]
Evaluator: #Cases=466 #Toks=12385 Tok Acc=0.963
Case Acc=0.487 Lattice Acc=0.962
Unknown Toks=1264 Unknown Tok Acc=0.819
Noun and Verb Chunking
Part-of-speech tagging is often used as the basis for
extracting higher-level structure, such as phrases. In this
section, we show how to create a chunk.Chunker
implementation that finds nouns and verb chunks based on
part-of-speech tags.
Tags to Chunks
The usual method of deriving chunks from underlying tags is to define a pattern of tags that derive a chunk. In this tutorial, we will consider very simple patterns, but the same technique would apply to more complex patterns.
We will employ the Brown corpus part-of-speech tagger
for English as the basis of our phrasal chunker. See
the corpus.parsers.BrownPosParser
for explanations of all of the categories and links to
the original documentation.
Definining Noun Chunks
We create noun and verb patterns the same way, with a set of possible initial categories and a set of possible continuation categories. Nouns may start with determiners, adjectives, common nouns or pronouns. Nouns may be continued with any category that may start a noun, and also by adverbs or punctuation.
These sets are defined statically; here is a fragment of the set of determiner tags:
static final Set<String> DETERMINER_TAGS = new HashSet<String>();
static {
DETERMINER_TAGS.add("abn");
DETERMINER_TAGS.add("abx");
DETERMINER_TAGS.add("ap");
DETERMINER_TAGS.add("ap$");
DETERMINER_TAGS.add("at");
...
}
The start tags and continuation tags are defined similarly:
START_NOUN_TAGS.addAll(DETERMINER_TAGS); START_NOUN_TAGS.addAll(ADJECTIVE_TAGS); START_NOUN_TAGS.addAll(NOUN_TAGS); START_NOUN_TAGS.addAll(PRONOUN_TAGS); CONTINUE_NOUN_TAGS.addAll(START_NOUN_TAGS); CONTINUE_NOUN_TAGS.addAll(ADVERB_TAGS); CONTINUE_NOUN_TAGS.addAll(PUNCTUATION_TAGS);
Defining Verb Chunks
We allow verbs to start with verbs, auxiliaries, or adverbs; they may be continued with any of these tags, or with punctuation.
The Chunker Implementation
We provide an implementation of the
chunk.Chunker
interface in src/PhraseChunker.java.
Note that this is the same interface we use for named
entity and other chunkers; see the Named Entity Tutorial
for more information.
The constructor simply stores a part-of-speech tagger (in the form of an HMM decoder) along with a tokenizer factory:
private final HmmDecoder mPosTagger;
private final TokenizerFactory mTokenizerFactory;
public PhraseChunker(HmmDecoder posTagger,
TokenizerFactory tokenizerFactory) {
mPosTagger = posTagger;
mTokenizerFactory = tokenizerFactory;
}
The chunk method is implemented in several stages. The first step is to tokenize the input and compute the part-of-speech tags using the decoder:
public Chunking chunk(char[] cs, int start, int end) {
// tokenize
List<String> tokenList = new ArrayList<String>();
List<String> whiteList = new ArrayList<String>();
Tokenizer tokenizer = mTokenizerFactory.tokenizer(cs,start,end-start);
tokenizer.tokenize(tokenList,whiteList);
String[] tokens
= tokenList.<String>toArray(new String[tokenList.size()]);
String[] whites
= whiteList.<String>toArray(new String[whiteList.size()]);
// part-of-speech tag
String[] tags = mPosTagger.firstBest(tokens);
...
Next, we walk over the tags, keeping track of the positions of the chunks, and waiting for the start of a noun or a verb. Skeletally, this looks like:
...
ChunkingImpl chunking = new ChunkingImpl(cs,start,end);
int startChunk = 0;
for (int i = 0; i < tags.length; ) {
startChunk += whites[i].length();
if (START_NOUN_TAGS.contains(tags[i])) {
// extend noun to completion
...
} else if (START_VERB_TAGS.contains(tags[i])) {
// extend verb to completion
..
} else {
startChunk += tokens[i].length();
++i;
}
}
return chunking;
}
The real work is done in the ellided blocks above. We only consider the noun case, as the verb case is identical.
...
for (int i = 0; i < tags.length; ) {
startChunk += whites[i].length();
if (START_NOUN_TAGS.contains(tags[i])) {
int endChunk = startChunk + tokens[i].length();
++i;
while (i < tokens.length && CONTINUE_NOUN_TAGS.contains(tags[i])) {
endChunk += whites[i].length() + tokens[i].length();
++i;
}
...
Here, once we find the start of the noun at index i,
we track where it starts (always on the first character of a token,
not on whitespace). We then extend it one token at a time
if the corresponding tag is a legal noun continuation. All the while,
we keep track of the end position and the overall index.
Once we have a chunk, we work backward peeling off any final punctuation. We define a new trimmed end chunk variable and update it going backward. If the whole thing turns out to be punctuation (shouldn't actually happen), then we ignore the resulting chunk.
int trimmedEndChunk = endChunk;
for (int k = i;
PUNCTUATION_TAGS.contains(tags[--k]);
trimmedEndChunk -= (whites[k].length() + tokens[k].length())) ;
if (startChunk == trimmedEndChunk) continue;
Otherwise, we use the chunk factory to create a new chunk,
add it to our chunking, and update our position tracking
variable startChunk.
Chunk chunk
= ChunkFactory.createChunk(startChunk,trimmedEndChunk,"noun");
chunking.add(chunk);
startChunk = endChunk;
} ...
Running the Program
We have implemented a main() method in
PhraseChunker.java
to allow the chunker to be tested from the command line. This
may be run using the ant target phrases in
the ant build.xml file:
> cd $LINGPIPE/demos/tutorial/posTags
> ant phrases
Buildfile: build.xml
compile:
phrases:
[java]
[java] After months of coy hints, Prime Minister Tony Blair made the announcement today as part of a closely choreographed and protracted farewell.
[java] noun(6,12) months
[java] noun(16,25) coy hints
[java] noun(27,52) Prime Minister Tony Blair
[java] verb(53,57) made
[java] noun(58,80) the announcement today
[java] noun(84,88) part
[java] noun(92,101) a closely
[java] verb(102,115) choreographed
[java] verb(120,130) protracted
[java] noun(131,139) farewell
[java]
[java] The attorney general appeared before the House Judiciary Committee to discuss the dismissals of U.S. attorneys.
[java] noun(0,20) The attorney general
[java] verb(21,29) appeared
[java] noun(37,66) the House Judiciary Committee
[java] verb(67,77) to discuss
[java] noun(78,92) the dismissals
[java] noun(96,110) U.S. attorneys
[java]
[java] Nascar's most popular driver announced that his future would not include racing for Dale Earnhardt Inc.
[java] noun(0,6) Nascar
[java] verb(7,8) s
[java] noun(14,28) popular driver
[java] verb(29,38) announced
[java] noun(44,54) his future
[java] verb(55,79) would not include racing
[java] noun(84,102) Dale Earnhardt Inc
[java]
[java] Purdue Pharma, its parent company, and three of its top executives today admitted to understating the risks of addiction to the painkiller.
[java] noun(0,13) Purdue Pharma
[java] noun(15,33) its parent company
[java] noun(39,44) three
[java] noun(48,72) its top executives today
[java] verb(73,81) admitted
[java] verb(85,97) understating
[java] noun(98,107) the risks
[java] noun(111,120) addiction
[java] noun(124,138) the painkiller
[java]
[java] After a difficult stretch for the airline, David Neeleman will give way to David Barger, the No. 2 executive.
[java] noun(6,25) a difficult stretch
[java] noun(30,41) the airline
[java] noun(43,57) David Neeleman
[java] verb(58,67) will give
[java] noun(68,71) way
[java] noun(75,87) David Barger
[java] noun(89,108) the No. 2 executive
The Main Method
The main() method driving this demo is trivial; it
just reads in the models, sets up the chunker, and then runs it
on the remaining command-line arguments:
public static void main(String[] args) {
// parse input params
File hmmFile = new File(args[0]);
int cacheSize = Integer.parseInt(args[1]);
FastCache cache = new FastCache(cacheSize);
// read HMM for pos tagging
HiddenMarkovModel posHmm;
try {
posHmm
= (HiddenMarkovModel)
AbstractExternalizable.readObject(hmmFile);
} catch (IOException e) {
System.out.println("Exception reading model=" + e);
e.printStackTrace(System.out);
return;
} catch (ClassNotFoundException e) {
System.out.println("Exception reading model=" + e);
e.printStackTrace(System.out);
return;
}
// construct chunker
HmmDecoder posTagger = new HmmDecoder(posHmm,null,cache);
TokenizerFactory tokenizerFactory = new IndoEuropeanTokenizerFactory();
PhraseChunker chunker = new PhraseChunker(posTagger,tokenizerFactory);
// apply chunker
for (int i = 2; i < args.length; ++i) {
Chunking chunking = chunker.chunk(args[i]);
CharSequence cs = chunking.charSequence();
System.out.println("\n" + cs);
for (Chunk chunk : chunking.chunkSet()) {
String type = chunk.type();
int start = chunk.start();
int end = chunk.end();
CharSequence text = cs.subSequence(start,end);
System.out.println(" " + type + "(" + start + "," + end + ") " + text);
}
}
}
Just Proper Nouns
Given the Brown corpus's tagging, it's possible to pull back
just the proper noun chunks. These would take the noun starting
categories to be just the proper noun category (np in
the Brown corpus). This may not produce the desired result, though,
considering the underlying taggings of the above examples:
After/in months/nns of/in coy/jj hints/nns ,/, Prime/jj Minister/nn Tony/np Blair/np made/vbd the/at announcement/nn today/nr as/cs part/nn of/in a/at closely/rb choreographed/vbn and/cc protracted/vbn farewell/nn ./. The/at attorney/nn general/nn appeared/vbd before/cs the/at House/nn Judiciary/nn Committee/nn to/to discuss/vb the/at dismissals/nn of/in U/np ./. S/nrs ./. attorneys/nns ./. Nascar/np '/' s/vbz most/ql popular/jj driver/nn announced/vbd that/cs his/pp$ future/nn would/md not/* include/vb racing/vbg for/in Dale/np Earnhardt/np Inc/np ./. Purdue/np$ Pharma/nn ,/, its/pp$ parent/jj company/nn ,/, and/cc three/cd of/in its/pp$ top/jjs executives/nns today/nr admitted/vbd to/in understating/vbg the/at risks/nns of/in addiction/nn to/in the/at painkiller/nn ./. After/in a/at difficult/jj stretch/nn for/in the/at airline/nn ,/, David/np Neeleman/np will/md give/vb way/nn to/in David/np Barger/np ,/, the/at No/rb ./. 2/cd executive/nn ./.
Note that Prime Minister is not considered part of
the proper noun, nor is House Judiciary Committee.
Only the U in U.S. is assigned a proper
noun tag. Proper person names, on the other hand, are usually
analyzed as category np, as was Nascar
and Purdue (though note that Pharma
in Purdue Pharma is not considered a proper noun
by the tagger).
Noun and Verb Chunks with Confidence
The n-best output for taggers could be used to define chunks. Rather than running over just the first-best output, use n-best output. Rather than returning unscored chunks, add the conditional probabilities of the whole chunkings to determine the likelihood of the chunks, although keep in mind that this will be an underestimate that gets better the larger the n in the n-best list.
A more elaborate method of doing this would be to follow
the approach to named-entity chunking in the HMM-based chunker
implementations in the chunk package.
A final possibility would be to use the simple noun and verb chunker to create a large set of training data that could be used to train a rescoring chunker. Simply use the chunkings that are output to train a chunker.
References
Being one of the premiere techniques for both written and spoken language, there is a wealth of information available on HMMs, including applications to part-of-speech tagging. We'd recommend the two major natural language processing texts:
- Chris Manning and Hinrich Schuetze. 1999. Foundations of Statistical Natural Language Processing. MIT Press.
- Dan Jurafsky and James H. Martin. 2000. Speech and Language Processing. Prentice-Hall.
as well as the two standard speech recognition texts:
- Larry Rabiner and Fred Juang. 1993. Fundamentals of Speech Recognition. Prentice Hall.
- Fred Jelinek. 1998. Statistical Methods for Speech Recognition. MIT Press.
Appendix: Additional Corpora
The other part-of-speech training corpora of which we are aware are:
Freely Downloadable
The following data may be downloaded over the web and used for "scientific", "non-commercial", or "evaluation" purposes:
- Chinese: Lancaster Corpus of Mandarin Chinese
- Dutch: CoNLL 2003 (
ned.*.gz) - English:
- CoNLL
2002 (
train*,test*) -
CoNLL 2001 (
*.txt.gz); - Geoffrey Sampson's SUSANNE, CHRISTINE AND LUCE corpora
- CoNLL
2002 (
- German: Negra Corpus
- Spanish: CoNLL 2003 with POS Tags by Xavier Carreras
Restrictively Licensed
These range in cost from hundreds to thousands of (US) dollars:
- Arabic: LDC Arabic Treebank 3
- Chinese: LDC Chinese Treebank 5
- Czech & English Aligned: LDC Prague Czech-English Dependency Treebank 1
- Dutch: ELRA PAROLE Dutch
- English:
- Treebank 3;
- BLIIP WSJ (auto annotated)
- British National Corpus (Full & "baby" versions)
- Lancaster-Oslo-Bergen (LOB) Corpus (their site doesn't make much sense)
- English-French-German-Italian-Spanish Aligned ELRA MULTEXT JOC Corpus
- English-French-Spanish Aligned ELRA CRATER Corpus; (some Sample Files are available)
- German: ELRA MTP German Corpus
- Greek: ELRA ILSP/ELEFTHEROTYPIA Corpus
- Korean: ELRA Qualified POS Tagged Corpus
- Portuguese: ELRA PAROLE Portuguese