What is Part-of-Speech Tagging?

Part-of-speech tagging is a process whereby tokens are sequentially labeled with syntactic labels, such as "finite verb" or "gerund" or "subordinating conjunction". This tutorial shows how to train a part-of-speech tagger and compile its model to a file, how to load a compiled model from a file and perform part-of-speech tagging, and finally, how to evaluate and tune models.

What is Phrase Chunking?

Phrase chunking is the process of recovering the phrases (typically base noun phrases and verb phrases) constructed by the part-of-speech tags. For instance, in the sentence John Smith will eat the beans., there is a proper noun phrase John Smith, a verb phrase will eat and a common noun phrase, the beans. Note that this notion of phrase may not line up with any theoretically motivated linguistic analysis.

In the second part of this tutorial, we show how to generate phrase chunkings based on a part-of-speech tagger.

Downloading Training Corpora

We use three different freely downloadable English corpora as examples, though the entire tutorial may be completed with only the MedPost corpus:

POS Corpora
Corpus Domain Tags Toks Parser Link >>target file(s)
Brown Balanced 93 1.1M BrownPosParser nltk-data-0.3.zip
GENIA Biomed 48 501K GeniaPosParser GENIA 3.02p
MedPost Biomed 63 182K MedPostPosParser medtag.tar.gz

Each corpus is listed with a link to the JavaDoc for the corresponding parser class in com.aliasi.corpus.parsers, such as BrownPosParser for the Brown corpus. The class documentation for these parsers details the corpus contents including domain and POS tag set, the corpus format, and provides links to download and get more information about the corpus.

The last column provides a download link, as well as an indication of the target file(s) for training. For instance, the link to the NLTK distribution of the Brown corpus should be unzipped and the file nltk-data-0.3/brown.zip is the relevant file for training. These names are reflected in the build.xml file for this tutorial.

Training Part-of-Speech Models

Like the other statistical packages in LingPipe (e.g. named entity detection, language-model classifiers, spelling correction, etc.), part-of-speech labeling is based on statistical models that are trained from a corpus of labeled data. For this illustration, we use the MedPost data, which is availble from the link above.

The Training Corpus

We downloaded medtag.tar.gz to this tutorial's directory /lingpipe/trunk/demos/tutorial/posTags

When we unpacked it, it created the directories medtag/medpost/.

> tar -xzf medtag.tar.gz
> ls medtag/medpost
medpost.db   tag_mb01.ioc  tag_mb04.ioc  tag_mb07.ioc  tag_mb10.ioc
medpost.sql  tag_mb02.ioc  tag_mb05.ioc  tag_mb08.ioc  tag_mb.ioc
tag_cl.ioc   tag_mb03.ioc  tag_mb06.ioc  tag_mb09.ioc  tag_ml01.ioc

The files ending with the suffix .ioc make up the text-formatted actual corpus of training files. For example, using the tail command to print the last few lines of the training file tag_mb.ioc produces (with our ellipses to shorten lines):

> tail /data1/data/medtag/medpost/tag_mb.ioc
The_DD N-terminal_JJ region_NN had_VHD high_JJ homology_NN with_II ...
Several_JJ sequences_NNS were_VBD identified_VVN in_II the_DD libra...
Our_PNG findings_NNS indicate_VVB that_CST CRCL_NN has_VHZ prominen...
The_DD corresponding_VVGJ mRNA_NN of_II 3.5_MC kb_NN is_VBZ compose...
A_DD few_JJ examples_NNS of_II heterologous_JJ expression_NN of_II ...

The data is formatted with a PubMed identifier (e.g. P12569660) and sentence position (e.g. A06) on their own line, followed by the text of the sentence represented as a sequence of token/tag pairs, such as The_DD, which indicates that the token The is assigned the determiner part-of-speech DD.

Training a Model

Given the location of the data directory, the training program src/TrainMedPost.java can be run using the ant task:

     > ant -Ddata.pos.medpost=/data1/data/medtag/medpost train-medpost

The -D option sets a system property that is picked up by Ant to indicate the location of the MedPost data directory. The italicized portion of the above command should be replaced with the path to where you unpacked the MedPost distribution. The output produces is:

     Buildfile: build.xml


Training file=/data1/data/medtag/medpost/tag_cl.ioc
Training file=/data1/data/medtag/medpost/tag_mb.ioc
Training file=/data1/data/medtag/medpost/tag_mb01.ioc
Training file=/data1/data/medtag/medpost/tag_mb02.ioc
Training file=/data1/data/medtag/medpost/tag_mb03.ioc
Training file=/data1/data/medtag/medpost/tag_mb04.ioc
Training file=/data1/data/medtag/medpost/tag_mb05.ioc
Training file=/data1/data/medtag/medpost/tag_mb06.ioc
Training file=/data1/data/medtag/medpost/tag_mb07.ioc
Training file=/data1/data/medtag/medpost/tag_mb08.ioc
Training file=/data1/data/medtag/medpost/tag_mb09.ioc
Training file=/data1/data/medtag/medpost/tag_mb10.ioc
Training file=/data1/data/medtag/medpost/tag_ml01.ioc

Total time: 10 seconds

and creates a model file of rougly 5MB:

> ls -l ../../models/pos-en-bio-medpost.HiddenMarkovModel
-rw-rw-r--  1 carp carp 4974338 Sep 20 14:02 ../../models/pos-en-bio-medpost.HiddenMarkovModel

The Training Code

The actual code making up the sample is an almost trivial sequence of lines in a single main(String[]) method. First, it creates an estimator for a hidden Markov model (HMM):

HmmCharLmEstimator estimator
    = new HmmCharLmEstimator(N_GRAM, NUM_CHARS, LAMBDA_FACTOR);

The parameters are for the HMM and determine how many characters to use as the basis for the model, the total number of characters, and an interpolation parameter for smoothing:

static int N_GRAM = 8;
static int NUM_CHARS = 256;
static double LAMBDA_FACTOR = 8.0;

These are reasonable default values. Their behavior is outlined in hmm.HmmCharLmEstimator and described in detail in lm.NGramBoundaryLM. These are reasonable default settings for English data.

This estimator implements the corpus.ObjectHandler interface for generic type Tagging<String>, through which it receives its training events in the form of aligned token and tag arrays. The handler acts as a visitor over the training data. It is escorted by a parser, which does the actual data parsing and then provides the arrays to the handler. The parser for this demo is in src/MedPostPosParser.java. The code to set it up and make sure the taggings it extracts are sent to the estimator is just two lines:

Parser<ObjectHandler<Tagging<String>>> parser = new MedPostPosParser();

The generic type indicates that the MedPost parser provides events for a tag handler.

The next step is to find the actual training files and walk over them. This is done with an instance of io.FileExtensionFilter that is set to pick out just the files ending in "ioc":

File dataDir = new File(args[0]);
File[] files = dataDir.listFiles(new FileExtensionFilter("ioc"));

We then loop over the files, parsing each one:

for (int i = 0; i < files.length; ++i) {
    System.out.println("Training file=" + files[i]);

That's it. At this point, the estimator is trained. Because the HmmCharLmEstimator class implements hmm.HiddenMarkovModel, it may be used immediately to do part-of-speech tagging. Rather than do that in the tutorial, instead we demonstrate how to compile the model to a file using an object output stream with a utility method in com.aliasi.util.AbstractExternalizable:

File modelFile = new File(args[1]);

HMM estimators implement the util.Compilable interface, which is used to write them to an object output. Note that the method util.Streams.closeOutputStream(ObjectOut) is used to close the output stream; in more robust settings this would be done inside a try/finally block to makes sure the streams were closed. That's it.

Running Part-of-Speech Taggers

Now that we have a model file, we can use it to assign part-of-speech tags to phrases. The code to run a compiled model is in src/RunMedPost.java. We set it up to run interactively, which doesn't play very nicely with Ant (it plays a bit better with a println() rather than print() for the prompt). Instead, we can call it from the command-line directly:

>  java -cp build/classes:../../../lingpipe-4.1.2.jar RunMedPost ../../models/pos-en-bio-medpost.HiddenMarkovModel

The demo prompts for input sentences and then returns their part-of-speech tagging on the next line in the same form as the input was found (with line-breaks inserted for readability):

Reading model from file=../../models/pos-en-bio-medpost.HiddenMarkovModel

INPUT (return)> A good correlation was found between the
    grade of Barrett's esophagus dysplasia and high p53 positivity.

A_DD good_JJ correlation_NN was_VBD found_VVN between_II
    the_DD grade_NN of_II Barrett's_NNP esophagus_NN dysplasia_NN
    and_CC high_JJ p53_NN positivity_NN ._.

INPUT (return)> This correlation was also confirmed by
    detection of early carcinoma in patients with "preventive"
    extirpation of the esophagus due to a high-grade dysplasia.

This_DD correlation_NN was_VBD also_RR confirmed_VVN by_II
    detection_NN of_II early_JJ carcinoma_NN in_II patients_NNS
    with_II "_`` preventive_JJ "_'' extirpation_NN of_II the_DD
    esophagus_NN due_II+ to_II a_DD high-grade_NN dysplasia_NN ._.

INPUT (return)> exit

Note that an empty line or the term exit will cause the system to exit gracefully.


The code to do decoding is nearly as simple as that to do training. First, our input will come in the form of lines and we thus need to tokenize the input to break it down into similar chunks to the training-data. We do this by creating a tokenizer factory that allows defines a token as the longest contiguous non-empty sequence of letter characters (\p{L}), numerals (\d), hyphens (-) and apostrophes ('); it also allows single non-whitespace characters (\S).

static TokenizerFactory TOKENIZER_FACTORY
    = new RegExTokenizerFactory("(-|'|\\d|\\p{L})+|\\S");

In general, the trick with pre-tokenized corpora is developing a tokenizer to match the corpus. The above tokenizer is only an approximate guess as to what the real MedPost tokenizer looks like. The MMTx Tokenization page points to NLM's tokenizer, hints that it's highly heuristic and context sensitive, but does not provide a grammar for it.

Reading the Model

Actually reading in the model and constructing the decoder requires just a bit of stream manipulation, casting and wrapping:

FileInputStream fileIn = new FileInputStream(args[0]);
ObjectInputStream objIn = new ObjectInputStream(fileIn);
HiddenMarkovModel hmm = (HiddenMarkovModel) objIn.readObject();
HmmDecoder decoder = new HmmDecoder(hmm);

An object input stream wraps a file input stream that points to the model (args[0] on the command line). The HMM is then just read using the standard java.io.ObjectInput.readObject() method; this method may throw an IOException or a ClassNotFoundException. The object read from the stream is then cast to instance of com.aliasi.hmm.HiddenMarkovModel. The input stream is closed; again, a robust approach would do this in a finally block. Finally, the decoder is created by wrapping the HMM read from the input stream.

Standard Input Loop

The next part of the code just goes into a loop to read characters a line at a time from the standard input and quit if there's no input or the input is command to stop:

InputStreamReader isReader = new InputStreamReader(System.in);
BufferedReader bufReader = new BufferedReader(isReader);
while (true) {
    System.out.println("\n\nINPUT (return)> ");
    String line = bufReader.readLine();
    if (line == null || line.length() < 1
        || line.equalsIgnoreCase("quit") || line.equalsIgnoreCase("exit"))
    char[] cs = line.toCharArray();

The real work then happens once we have the characters cs to tag. First, we generate the array and then list of tokens (this could be done more efficiently with a little more work):

    Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer(cs,0,cs.length);
    String[] tokens = tokenizer.tokenize();
    List<String> tokenList = Arrays.asList(tokens);

First-Best Results

With the tokens in hand, retrieving the first-best tags is trivial:

Tagging<String> tagging = decoder.tag(tokenList);

We then just print these out in the same format as the training data in a simple loop (actually, a bit more complicated because of pretty printing):

for (int i = 0; i < tagging.size(); ++i)
    System.out.print(tagging.token(i) + "_" + tagging.tag(i) + " ");

N-best Results

The following code will print the best analyses up to the maximum number of analyses MAX_N_BEST (modulo a little padding and decimal formatting):

static final int MAX_N_BEST = 5;

Iterator<ScoredTagging<String>> nBestIt = decoder.tagNBest(tokenList,MAX_N_BEST);
for (int n = 0; n < MAX_N_BEST && nBestIt.hasNext(); ++n) {
    ScoredTagging<String> scoredTagging = nBestIt.next();
    double score = scoredTagging.score();
    System.out.print(n + "   " + format(score) + "  ");
    for (int i = 0; i < tokenList.size(); ++i)
        System.out.print(scoredTagging.token(i) + "_" + pad(scoredTagging.tag(i),5));

Here's an example of the demo's print out for the above code with an input that's a shortened form of the one above.

INPUT> This correlation was also confirmed by detection of early carcinoma.
#   JointLogProb         Analysis
0     -90.265  This_DD   correlation_NN   was_VBD  also_RR   confirmed_VVN  by_II   detection_NN   of_II   early_JJ   carcinoma_NN   ._.
1     -94.072  This_DD   correlation_NN   was_VBD  also_RR   confirmed_VVD  by_II   detection_NN   of_II   early_JJ   carcinoma_NN   ._.
2     -99.905  This_PND  correlation_NN   was_VBD  also_RR   confirmed_VVN  by_II   detection_NN   of_II   early_JJ   carcinoma_NN   ._.
3    -101.574  This_DD   correlation_NN   was_VBD  also_RR   confirmed_VVN  by_II   detection_NN   of_II   early_RR   carcinoma_NN   ._.
4    -102.253  This_DD   correlation_NN   was_VBD  also_RR   confirmed_VVN  by_II   detection_NN   of_II   early_NN   carcinoma_NN   ._.

The method HmmDecoder.tagNBest(List<String>) returns an iterator over the top scoring tag sequences for the specified tokens. The iterator produces instances of LingPipe's tag.ScoredTagging class. This class simply extends a tagging with score information. The score is retrieved with the score() method, and otherwise the scored tagging works just like an ordinary tagging. The score consists of a joint log (base 2) probability for the tags and tokens together. Another method, HmmDecoder.tagNBestConditional(List<String>,int) returns the tag sequences in the same order but provides a conditional log (base 2) probability of the tag sequence given the tokens rather than a joint probability.

The difference between the probabilities in the n-best analyses gives a good first-approximation of the tagger's confidence in its tag assignments. In the example above, note that the first-best analysis has a log (base 2) joint probability of -90.2 whereas the second ranking analysis is at -94.1; this means that the model estimates the probability of the first answer as being 23.9 (roughly 15) times more likely than the second. For strings that are more confusable to the tagger, the gap will be narrower. For strings in which the tagger is highly confident of the total tagging, the gap will be higher. Further note that the only difference in the second analysis is in the form of the verb "confirmed". The third analysis, ranked as almost 1000 times less likely than the first, only varies from the first in assigning This to a pronoun rather than a determiner. Looking at the positions that vary also gives you a measure of confidence on a tag-by-tag basis. In this case, it's clear the analyzer is very sure of its analysis of all but two tokens.

Confidence-Based Results

In addition to extracting n-best results one at a time, the entire statistical analysis can be returned in one go through the HmmDecoder.lattice(List<String>) method. This method returns an instance of tag.TagLattice. For those familiar with HMM decoding, this is quite simply the lattice of forward/backward scores (including boundary conditions).

The code in the demo to print out confidences for tag assignments to individual tokens is also quite simple (again modulo formatting):

TagLattice<String> lattice = decoder.tagMarginal(tokenList);
for (int tokenIndex = 0; tokenIndex < tokenList.size(); ++tokenIndex) {
    ConditionalClassification tagScores = lattice.tokenClassification(tokenIndex);
    for (int i = 0; i < 4; ++i) {
        double conditionalProb = tagScores.score(i);
        String tag = tagScores.category(i);
        System.out.print(" " + format(conditionalProb) 
                         + ":" + pad(tag,4));

Run on our simplified demo sentence, this produces the following output, consisting of a row for each token, with its top 4 tags with their joint probabilities.

INPUT> This correlation was also confirmed by detection of early carcinoma.

#   Token               (Prob:Tag)*
0   This                0.999:DD       0.001:PND      0.000:PNG      0.000:NN
1   correlation         1.000:NN       0.000:RR       0.000:NNS      0.000:VVN
2   was                 1.000:VBD      0.000:NNS      0.000:VVZ      0.000:II
3   also                1.000:RR       0.000:PND      0.000:VVN      0.000:JJR
4   confirmed           0.933:VVN      0.067:VVD      0.000:VVNJ     0.000:VVB
5   by                  1.000:II       0.000:NN       0.000:RR       0.000:JJ
6   detection           1.000:NN       0.000:VVGN     0.000:VVI      0.000:VVB
7   of                  1.000:II       0.000:VVZ      0.000:RR       0.000:MC
8   early               0.999:JJ       0.000:RR       0.000:NN       0.000:VVGJ
9   carcinoma           1.000:NN       0.000:NNS      0.000:JJ       0.000:VVGN
10  .                   1.000:.        0.000:)        0.000:NN       0.000:,

The decoder's 99.9% sure of its estimates in all cases but for the form of the verb "confirmed", for which it estimates 93.3% for probability of the tag being VVN, reserving 6.6% for the probability it is VVD. In fact, the tagger picked up on a fundamental ambiguity of English verbs between simple past and past participles. This case is confusing to a bigram HMM decoder (like ours), because the previous word is also with the modifier tag RR; this doesn't disambiguate. We'd need to go back to the auxiliary was with category VBD.

Note that the ratio of probabilities from the confidence-based results (0.933/0.067=14.1) is very close to the estimate given by inspecting the top two full analyses in the n-best results. This is due to a deep mathematical link that says the confidences are equal to the limit of doing n-best for an unlimited n (as opposed to just the top two).

Breaking down the code, the key method to compute the confidences is the first one called, HmmDecoder.lattice(String[]):

TagLattice<String> lattice = decoder.tagMarginal(tokenList);

This returns what is known as a forward-backward lattice in the HMM decoder literature, as an instance of tag.TagLattice.

To extract a confidence-ordered list of tags for a particular token index, we use:

for (int tokenIndex = 0; tokenIndex < tokenList.size(); ++tokenIndex) {
    ConditionalClassification tagScores = lattice.tokenClassification(tokenIndex);

This returns the result as a conditional classification. This is because the result of tagging a particular token is just a classification of that token.

Given the return result, we just iterate over the tags and print them along with their scores:

    for (int i = 0; i < 4; ++i) {
        double conditionalProb = tagScores.score(i);
        String tag = tagScores.category(i);
        System.out.print(" " + format(conditionalProb) 
                         + ":" + pad(tag,4));

3. Evaluating and Tuning Tagging Models

In the final part of this tutorial, we show how to evaluate HMM part-of-speech models and how to tune their parameters.

Running a Part-of-Speech Evaluation

In this section, we show how to dump out a large, but by no means comprehensive, set of statistics on part-of-speech tagging.

Train-a-Little, Evaluate-a-Little

The way in which we will evaluate is to train-a-little and evaluate-a-little. Specifically, as we parse the corpus we extract reference taggings. We then take the underlying text and extract taggings with our model for a response tagging and then add it to a cumulative evaluation. After evaluating on a reference tagging, we add it as training data before evaluating the next sentence of data.

Running an Evaluation

For us, with the MedPost corpus in /data1/data/medtag/medpost, we run the MedPost evaluation with the following invocation of Ant:

ant -Dmedpost-dir=/data1/data/medtag/medpost eval-medpost

The output begins with a dump of the parameters of the evaluation:

  Sent eval rate=1
  Toks before eval=170000
  Max n-best eval=100
  Max n-gram=8
  Num chars=256
  Lambda factor=8.0

Any of these may be changed through the ant target.

It then collects data from the corpus itself in a first-pass run-through and prints it out.

  Corpus class=MedPostPosCorpus
  Tags=['', (, ), ,, ., :, CC, CC+, CS, CS+, CSN,
        CST, DB, DD, EX, GE, II, II+, JJ, JJ+,
        JJR, JJT, MC, NN, NN+, NNP, NNS, PN,
        PND, PNG, PNR, RR, RR+, RRR, RRT, SYM,
        TO, VBB, VBD, VBG, VBI, VBN, VBZ, VDB,
        VVN, VVNJ, VVZ, ``]

It first trains on 170,000 characters (see the Toks before eval figure above). This takes two or three seconds on my desktop machine.

The rest of the output consists of evaluation case reports and cumulative evaluation reports. These are printed out per evaluation case. The following example is for the seventh evaluation sentence. First, we get the report on the first-best output from the tag() method.

Test Case 7
First Best Last Case Report
Known  Token     Reference | Response  ?correct
    In                  II    |  II
    patients            NNS   |  NNS
    with                II    |  II
    chronic             JJ    |  JJ
    pure                JJ    |  NN     XX
    red                 JJ    |  VVD    XX
    cell                NN    |  NN
  ? aplasia             NN    |  NN
    the                 DD    |  DD
    in                  JJ+   |  JJ+
    vitro               JJ    |  JJ
    study               NN    |  NN
    of                  II    |  II
    erythroid           NN    |  NN
    precursors          NNS   |  NNS
    has                 VHZ   |  VHZ
    a                   DD    |  DD
    prognostic          JJ    |  JJ
    value               NN    |  NN
    .                   .     |  .

The tokens are printed in a column on the left, with unknown tokens marked with question marks. In t his case, the token "aplasia" was not seen in the training data. The next two columns contain first the reference category on the left then the system response category on the right. System errors are marked with a double X (XX). In this case, both the tokens "pure" and "red" red are assigned the wrong category, with "pure" being assigned to the common noun category (NN) instead of the adjective (JJ) category and "red" being tagged with a verb category (VVD) instead of the adjective (JJ) category.

Next, we get the n-best output analysis, which shows the top N results and marks the correct one, if it is on the list, with three asterisks (***).

N-Best Last Case Report
Last case n-best reference rank=3
Last case 5-best:
     0    -214.424  In_II   patients_NNS  with_II   chronic_JJ   pure_NN   red_VVD  cell_NN   aplasia_NN   the_DD   in_JJ+  vitro_JJ   study_NN   of_II   erythroid_NN   precursors_NNS  has_VHZ  a_DD   prognostic_JJ   value_NN   ._.
     1    -214.544  In_II   patients_NNS  with_II   chronic_JJ   pure_JJ   red_VVNJ cell_NN   aplasia_NN   the_DD   in_JJ+  vitro_JJ   study_NN   of_II   erythroid_NN   precursors_NNS  has_VHZ  a_DD   prognostic_JJ   value_NN   ._.
     2    -214.837  In_II   patients_NNS  with_II   chronic_JJ   pure_NN   red_VVNJ cell_NN   aplasia_NN   the_DD   in_JJ+  vitro_JJ   study_NN   of_II   erythroid_NN   precursors_NNS  has_VHZ  a_DD   prognostic_JJ   value_NN   ._.
 *** 3    -215.299  In_II   patients_NNS  with_II   chronic_JJ   pure_JJ   red_JJ   cell_NN   aplasia_NN   the_DD   in_JJ+  vitro_JJ   study_NN   of_II   erythroid_NN   precursors_NNS  has_VHZ  a_DD   prognostic_JJ   value_NN   ._.
     4    -216.364  In_II   patients_NNS  with_II   chronic_JJ   pure_NN   red_JJ   cell_NN   aplasia_NN   the_DD   in_JJ+  vitro_JJ   study_NN   of_II   erythroid_NN   precursors_NNS  has_VHZ  a_DD   prognostic_JJ   value_NN   ._.

Here the top 5 are reported. For each of the top results, we see its rank (here counting from zero, so the ranks are 0 to 4). We see three asterisks in front of the correct analysis, if it's on the list. Here, the rank 3 (or 4th best) result is the correct one. Next, we see log joint probabilities. In this case, the top few answers have very close probabilities, with the best result, at -214.4 log (base 2) joint probability is only a factor of four times more likely than the fifth best result at -216.3 log (base 2) joint probability. Finally, we see the tokens, followed by an underscore, followed by their tags. It's exactly in the places where the first best analysis made an error, in the analysis of "pure" and "red", where we see the uncertainty in the analysis.

The log probabilities could be normalized to conditional probabilities by using the alternative evaluation method for n-best conditional outputs.

Next, we get an evaluation of the marginal tags assigned, as follows:

Marginal Last Case Report
Index Token  RefTag  (Prob:ResponseTag)*
0   In             II           0.907:II  *      0.063:NN         0.011:JJ         0.009:VVNJ       0.005:CS
1   patients       NNS          1.000:NNS *      0.000:VVZ        0.000:NN         0.000:JJ         0.000:VVNJ
2   with           II           1.000:II  *      0.000:NN         0.000:VVGN       0.000:JJ         0.000:RR
3   chronic        JJ           0.999:JJ  *      0.001:NN         0.000:RR         0.000:VVGJ       0.000:VVNJ
4   pure           JJ           0.553:NN         0.442:JJ  *      0.002:NNS        0.001:VVGN       0.001:RR
5   red            JJ           0.361:VVNJ       0.255:VVD        0.218:JJ  *      0.105:NN         0.029:VVN
6   cell           NN           0.977:NN  *      0.022:JJ         0.000:VVNJ       0.000:RR         0.000:VVGJ
7   aplasia        NN           0.987:NN  *      0.011:NNS        0.001:VVZ        0.000:VVD        0.000:VVGN
8   the            DD           0.931:DD  *      0.024:NN         0.016:II+        0.011:PND        0.005:NNS
9   in             JJ+          0.899:JJ+ *      0.046:II         0.025:NN         0.015:JJ         0.005:VVNJ
10  vitro          JJ           0.992:JJ  *      0.008:RR         0.000:VVNJ       0.000:NN         0.000:VVZ
11  study          NN           0.996:NN  *      0.003:VVI        0.001:VVGN       0.000:NNS        0.000:RR
12  of             II           0.999:II  *      0.000:VVZ        0.000:NN         0.000:MC         0.000:VVGN
13  erythroid      NN           1.000:NN  *      0.000:VVNJ       0.000:NNS        0.000:JJ         0.000:VVGJ
14  precursors     NNS          1.000:NNS *      0.000:NN         0.000:VVGN       0.000:VVZ        0.000:JJ
15  has            VHZ          0.990:VHZ *      0.004:CS         0.002:VVZ        0.001:II         0.001:CSN
16  a              DD           0.991:DD  *      0.005:RR         0.001:VVB        0.001:JJ         0.001:VVN
17  prognostic     JJ           0.993:JJ  *      0.007:NN         0.000:VVNJ       0.000:VVI        0.000:VVGJ
18  value          NN           0.998:NN  *      0.001:JJ         0.001:NNS        0.000:VVGN       0.000:NNP
19  .              .            1.000:.   *      0.000:)          0.000:''         0.000:(          0.000:,

Here we have the index of the token in the first column, with the token itself in the next column. Then we have the reference tag. For instance, the second token has index 1, token "patients" and reference category NNS. Then we get a ranked list of possible categories for each token, with a model-based estimate of the conditional probability of the listed tag for the token given the entire input. For instance, we see that the 7th token, "cell", has a 0.977 probablity of being a common noun (NN), and a 0.022 chance of being an adjective (JJ).

There is an asterisk (*) after the correct category if it is listed. Here, all of the highest ranked categories are correct other than for "pure" and "red". Note the high uncertainty of the model in those categories; for "red", the model estimates only a 0.361 probability it is a VVNJ, and reserves a 0.218 chance that it's of the correct category, JJ.

The values of the marginal probabilities are just the normalized sum of the probabilities in the n-best analysis. That means that uncertainty in the n-best analysis is reflected as uncertainty in the marginal tag probabilities and vice-versa.

Note that it is not always the case that the most likely tag for a token is returned in the first-best analysis. For instance, the most likely category for the token "red" is VVNJ, but the first best analysis assigns it to VVD. This is possible because the n-best analyses involve whole sequence probabilities, not the marginalization to a single category. So while VVNJ may be most likely overall, the single best analysis involves VVD, presumably because it makes a better sequence with the preceding noun assignment. Also note that the second and third best analyses, which have very close probability to the first-best analysis, both assign the most likely category, VVD.

After the dump of a result for a sentence, a running cumulative total is provided for number of training sentences and tokens, and overall accuracy and accuracy restricted to tokens not seen in the training data:

Cumulative Evaluation
    Estimator:  #Train Cases=6240 #Train Toks=170178
    First Best Accuracy (All Tokens) = 176/184 = 0.9565217391304348
    First Best Accuracy (Unknown Tokens) = 11/13 = 0.8461538461538461

It is also possible to generate more results, such as the accuracy of the first-best guesses for each token instead of the best sequence, and the accuracy evaluated at a whole sentence level.

The Evaluation Command and Ant Task

The evaluation command may be run using the following ant task, drawn from the eval-medpost target in the ant build file build.xml:

<target name="eval-medpost"
  <java classname="EvaluatePos"
    <jvmarg value="-server"/>
    <classpath refid="classpath.standard"/>
    <arg value="1"/>                 <!-- sent eval rate -->
    <arg value="170000"/>            <!-- toks before eval -->
    <arg value="100"/>               <!-- max n-best -->
    <arg value="8"/>                 <!-- n-gram size -->
    <arg value="256"/>               <!-- num characters -->
    <arg value="8.0"/>               <!-- interpolate ratio -->
    <arg value="MedPostPosCorpus"/>  <!-- corpus impl class -->
    <arg value="${data.pos.medpost}"/>    <!-- baseline dir for data -->

The arguments are all required, and simply supplied in order. The first argument is the frequency with which to evaluate sentences. The value of 1 means every sentence. The second value is the number of tokens to use for training before evaluating the first sentence. In this case, 170,000. The third argument is the size of n-gram to use, in this case 8. The fourth argument is the number of characters in the training and test data, in this case a conservative estimate of 256. The fifth argument, 8.0, is for the language model interpolation factor. Tweaking the last three numbers will affect the performance of the tagger, and this task is defined to show you how to do that. The final two arguments, argument six and seven, pick out the name of the corpus class and the directory in which it is set. We include the property value ${medpost-dir} in order to allow users to set it in a properties file or on the command line. We could have also specified the other variables in the same way in order to allow the external caller to set them; or the command can be pulled out of ant and run standalone.

Corpus Parsing Interface and Implementations

Before turning to the code for the evaluation, we first pause to abstract the features of our corpus into a general interface with two methods:

public interface PosCorpus {
    public Parser<ObjectHandler<Tagging<String>>> parser();
    public Iterator<InputSource> sourceIterator() throws IOException;

The first method returns the parser for a corpus. The second method returns an iterator over input sources and may throw an I/O exception if it gets in trouble on the I/O front. The code can be found in src/PosCorpus.java.

We provide three implementations, one for each corpus described above.

Genia POS Corpus Parser

The simplest is the GENIA corpus, because it only involves reading a single input source from a file. The following code for doing this is drawn from src/GeniaPosCorpus.java:

public class GeniaPosCorpus implements PosCorpus {
    private final File mGeniaGZipFile;
    public GeniaPosCorpus(File geniaZipFile) {
        mGeniaGZipFile = geniaZipFile;
    public Iterator<InputSource> sourceIterator() throws IOException {
        FileInputStream fileIn = new FileInputStream(mGeniaGZipFile);
        InputSource in = new InputSource(fileIn);
        return Iterators.singleton(in);
    public Parser<ObjectHandler<Tagging<String>>> parser() {
        return new GeniaPosParser();

The return is through a LingPipe utility for singleton iterators in com.aliasi.util.Iterators. Singleton iterators return a single item once, just as if iterating over a singleton (one element) set. The parser method simply returns an instance of the GENIA corpus part-of-speech parser.

Note that an instance is constructed from a single file, which provides the basis for a relative location of the corpus for all of the implementations. Also note that nothing ever closes the input source. This problem would have to be fixed for a robust implementation through a more sophisticated iterator implementation that knows when it's done and can close the input streams, or by reading the file into memory and closing it before wrapping it as an input source and providing it as a singleton iterator.

MedPost POS Parser

The MedPost corpus consists of a directory of files, each of which is simply in a text-based format. The non-trivial bit of the implementation of src/MedPostPosCorpus.java is the iteration over input sources:

public Iterator sourceIterator() {
    return new MedPostSourceIterator(mMedPostDir);

public static class MedPostSourceIterator
    extends Iterators.Buffered<InputSource> {

    private final File[] mFiles;
    private int mNextFileIndex = 0;
    public MedPostSourceIterator(File medPostDir) {
            = medPostDir
             .listFiles(new FileExtensionFilter("ioc"));
    public InputSource bufferNext() {
        if (mNextFileIndex >= mFiles.length) return null;
        try {
            File file = mFiles[mNextFileIndex++]);
            String url = file.toURI().toURL().toString();
            return new InputSource(url);
        } catch (IOException e) {
            return null;

Here the iterator stores an array of files that end in the suffix "ioc"; these are returned using the LingPipe utitlity io.FileExtensionFilter. We then keep the variable mNextFileIndex as a pointer to the next file to return through the iterator. The iterator itself is implemented by extending the very handy utility class util.Iterators.Buffered. This abstract class defines the tricky bits of the has-next and next logic of iterators through a single method bufferNext(). This allows implementations to concentrate on returning the next object in the iteration rather than the logic buffering and returing has-next information. A return of null indicates to the buffered iterator that there are no more elements. To return an element as in input source, the name of the file is converted to a URL using util.Files.fileToURLName(File).

Brown Corpus POS Parser

The Brown corpus, as distributed with the Natural Language Tool Kit (NLTK), is in yet another format -- a zipped directory of files. Zip files are a very nice way to pack a lot of files because Java supports their unpacking. In this way, they're a better choice than the standard unix combination of tar and gzip. Most of src/BrownPosCorpus.java is just like the previous classes, with the following source iterator:

static class BrownSourceIterator
    extends Iterators.Buffered<InputSource> {

    private ZipInputStream mZipIn = null;

    public BrownSourceIterator(File brownZipFile)
        throws IOException {

        FileInputStream fileIn
            = new FileInputStream(brownZipFile);
        mZipIn = new ZipInputStream(fileIn);
    public InputSource bufferNext() {
        ZipEntry entry = null;
        try {
            while ((entry = mZipIn.getNextEntry())
                   != null) {
                if (entry.isDirectory()) continue;
                String name = entry.getName();
                if (name.equals("brown/CONTENTS")
                    || name.equals("brown/README"))

                return new InputSource(mZipIn);
        } catch (IOException e) {
            // fall through on purpose
        return null;

Here the file input stream is wrapped in a zip input stream. To get the actual input sources, we extract the files in the input stream one by one. To do this, we use the zip iteration method getNextEntry(). If the entry is not a directory and does not share a name with one of the non-data read-me files, then the whole input stream is wrapped in an input source and passed to the iterator to return. The zip input stream provides the actual bytes and will have an end-of-stream marker that is only reset after the next call to getNextEntry().

The right way to close the zip input stream would be in a larger try/finally block, but we kept it simple for sake of readability here.

The Evaluation Code

The Parser/Handler Pattern

As evidenced by the command invocation in the last section, the the top-level main(String[]) method is located in src/EvaluatePos.java, and it's quite simple:

public static void main(String[] args)
    throws Exception {

    new EvaluatePos(args).run();

It just constructs an EvaluatePos object out of the command-line arguments and runs it. The constructor merely sets a bunch of local variables given the arguments:

    public EvaluatePos(String[] args) throws Exception {
        mSentEvalRate = Integer.valueOf(args[0]);
        mToksBeforeEval = Integer.valueOf(args[1]);
        mMaxNBest = Integer.valueOf(args[2]);
        mNGram = Integer.valueOf(args[3]);
        mNumChars = Integer.valueOf(args[4]);
        mLambdaFactor = Double.valueOf(args[5]);
        String constructorName = args[6];
        File corpusFile = new File(args[7]);
        Object[] consArgs = new Object[] { corpusFile };
        @SuppressWarnings("rawtypes") // req for cast
        PosCorpus corpus 
            = (PosCorpus) 
            .getConstructor(new Class[] { File.class })
        mCorpus = corpus;

The use of reflection in constructing the corpus throws a range of exceptions (see the documentation), but we have just thrown a single Exception, which is sloppy but convenient.

The real action begins in the run() method, which begins by printing out the parameters, then setting up the corpus profile.

    void run() throws IOException {
        ... // prints

and then sets up the the corpus profile:

        CorpusProfileHandler profileHandler = new CorpusProfileHandler();

The profile handler inner class is worth noting merely as a simple example of what can be done with LingPipe's handler framework:

class CorpusProfileHandler implements ObjectHandler<Tagging<String>> {
    public void handle(Tagging<String> tagging) {
        mTrainingTokenCount += tagging.size();
        for (int i = 0; i < tagging.size(); ++i)

The parseCorpus(ObjectHandler<Tagging<String>>) method is a utility that simply parses the corpus by iterating through the input sources and applying the parser to them:

void parseCorpus(ObjectHandler<Tagging<String>> handler) throws IOException {
    Parser<ObjectHandler<Tagging<String>>> parser = mCorpus.parser();
    Iterator<InputSource> it = mCorpus.sourceIterator();
    while (it.hasNext()) {
        InputSource in = it.next();

Because the handler is not static, it is able to manipulate member variables for counting in the EvaluatePos class. Thus when we're done with the profile handler, we have access to the tag set back in the run() method:

        String[] tags = mTagSet.toArray(Strings.EMPTY_STRING_ARRAY);
        Set<String> tagSet = new HashSet<String>();
        for (String tag : tags)


Next, we create the HMM estimator and make sure it knows about all the tags up front:

            = new HmmCharLmEstimator(mNGram,mNumChars,mLambdaFactor);
        for (int i = 0; i < tags.length; ++i)

Recall the mCorpus variable is set to the relevant implementation of PosCorpus. The corpus supplies a parser through its parser() method and an iterator over sources through its sourceIterator() method. For MedPost, these return an instance of MedPostPosParser and an iterator over the input sources over the input files.

Parsing the corpus with the corpus profile handler simply records the number of training sentences, tokens and collects the set of tags.

We next set up the decoder based on the HMM and all the evaluators:

        HmmDecoder decoder
            = new HmmDecoder(mEstimator); // no caching
        boolean storeTokens = true;
            = new TaggerEvaluator<String>(decoder,storeTokens);
            = new NBestTaggerEvaluator<String>(decoder,mMaxNBest,mMaxNBest);
            = new MarginalTaggerEvaluator<String>(decoder,tagSet,storeTokens);

Note that we have three different evaluators, one for first best, one for n-best, and one for marginal tags. They take arguments specifying n-best sizes to find and to report, whether ot not they store tokens in the valuation, etc.

The estimator and evaluator are both instances of com.aliasi.corpus.Parser. They are assigned to local variables which whill be available when the learning curve handler is run in the last and final statements of EvaluatePos's run() method:

        LearningCurveHandler evaluationHandler
            = new LearningCurveHandler();

The actual work is all done by the learning curve handler, which we describe next. After we've visited the whole corpus, we provide a final report of n-best and token results.


The final token evaluation is provided as a confusion matrix, presenting total counts, correct counts, accuracies, and a confusion matrix for errors:

First Best Evaluation
Total Count=12385
Total Correct=11923
Total Accuracy=0.9626968106580541
95% Confidence Interval=0.9626968106580541 +/- 0.003337534892266179
Confusion Matrix
reference \ response

After some not-so useful results for a tagging problem, there is a category-by-category report of one-versus-all behavior. For instance, for the common noun category we have:

First-Best Precision/Recall Evaluation
  True Positive=2864
  False Negative=104
  False Positive=146
  True Negative=9271
  Positive Reference=2968
  Positive Response=3010
  Negative Reference=9417
  Negative Response=9375
  Rejection Recall=0.9844961240310077
  Rejection Precision=0.9889066666666667

Here we see that there were 2986 common nouns in the reference, of which the system found 2864 (true positives), for a recall of 0.964. There were 104 instances the system assigned the wrong category (false negatives), and 146 instances that were other categories in the reference but assigned to common noun by mistake (false positives). We see that the specificity is 0.985 (here reported as rejection recall); sensitivity is just recall, which is 0.965.

If we go back to the confusion matrix report, we can see what happened for common nouns:

reference \ response

For instance, in one case, a token marked NN in the reference is assigned JJ+, in one other case, a token marked NN in the reference was assigned to VVGJ, and 39 times an NN token was erroneously labeled JJ (as we saw in our earlier example).

The final dump is of the n-best histogram from the n-best evaluation:

N Best Evaluation

This list provides counts of the number of times the n-best result had the correct answer at the specified rank. For instance, 0=227 says that for 227 of the sentences, the first-best answer was completely correct. The line 1=52 indicates that for 52 sentences, the second-best result was correct, and 3=11 indicates that for 11 sentences, the fourth best result was correct. The line -1=75 indicates that for 75 of the sentences, the correct response was not on the n-best list.

The Learning Curve Handler

The learning curve handler class is an inner class in src/EvaluatePos.java.

    class LearningCurveHandler implements ObjectHandler<Tagging<String>> {
        Set<String> mKnownTokenSet = new HashSet<String>();
        int mUnknownTokensTotal = 0;
        int mUnknownTokensCorrect = 0;

Note that the class keeps track of the known tokens, and keeps running totals for unknown token accuracy. Most of the work's in the handle() method implementing the corpus.ObjectHandler<Tagging<String>> interface:

        public void handle(Tagging<String> tagging) {
            if (mEstimator.numTrainingTokens() > mToksBeforeEval
                && mEstimator.numTrainingCases() % mSentEvalRate == 0) {


The handle method checks to see that we're in the training part of the system, and then only considers every mSentEvalRate sentence. If the sentence is in the evaluation, the body gets called. Inside the evaluation list, we simply create the reference tagging and then send it to the three evaluators to handle.

After handling the case, we print out the various reports:

                System.out.println("\nTest Case "
                                   + mTaggerEvaluator.numCases());

                System.out.println("First Best Last Case Report");

                System.out.println("N-Best Last Case Report");

                System.out.println("Marginal Last Case Report");

                System.out.println("Cumulative Evaluation");
                System.out.print("    Estimator:  #Train Cases="
                                 + mEstimator.numTrainingCases());

                System.out.println(" #Train Toks="
                                   + mEstimator.numTrainingTokens());

                ConfusionMatrix tokenEval = mTaggerEvaluator.tokenEval().confusionMatrix();
                System.out.println("    First Best Accuracy (All Tokens) = "
                                   + tokenEval.totalCorrect() 
                                   + "/" + tokenEval.totalCount()
                                   + " = " + tokenEval.totalAccuracy());


After these dumps, we calculate and print the unknown token evaluation directly:

                ConfusionMatrix unkTokenEval = mTaggerEvaluator.unknownTokenEval(mKnownTokenSet).confusionMatrix();
                mUnknownTokensTotal += unkTokenEval.totalCount();
                mUnknownTokensCorrect += unkTokenEval.totalCorrect();
                System.out.println("    First Best Accuracy (Unknown Tokens) = "
                                   + mUnknownTokensCorrect
                                   + "/" + mUnknownTokensTotal
                                   + " = " + (mUnknownTokensCorrect/(double)mUnknownTokensTotal));

This'd be more easily done once-and-for-all if we weren't doing an online evaluation.

In all cases, the estimator is trained after the sentence is evaluated, and its tokens added to the known token set. Repeating the if/then structure, we have:

        public void handle(Tagging<String> tagging) {
            if (mEstimator.numTrainingTokens() > mToksBeforeEval
                && mEstimator.numTrainingCases() % mSentEvalRate == 0) {

            // train after eval
            for (int i = 0; i < tagging.size(); ++i)

In this sense, the evaluation is online and provides a learning curve. To get a more realistic learning curve, you should start before 170,000 tokens and only evaluate every 10th sentence or so thereafter to help with speed (or evaluate them all for more accuracy, but it'll take a few minutes).

Noun and Verb Chunking

Part-of-speech tagging is often used as the basis for extracting higher-level structure, such as phrases. In this section, we show how to create a chunk.Chunker implementation that finds nouns and verb chunks based on part-of-speech tags.

Tags to Chunks

The usual method of deriving chunks from underlying tags is to define a pattern of tags that derive a chunk. In this tutorial, we will consider very simple patterns, but the same technique would apply to more complex patterns.

We will employ the Brown corpus part-of-speech tagger for English as the basis of our phrasal chunker. See the corpus.parsers.BrownPosParser for explanations of all of the categories and links to the original documentation.

Definining Noun Chunks

We create noun and verb patterns the same way, with a set of possible initial categories and a set of possible continuation categories. Nouns may start with determiners, adjectives, common nouns or pronouns. Nouns may be continued with any category that may start a noun, and also by adverbs or punctuation.

These sets are defined statically; here is a fragment of the set of determiner tags:

static final Set<String> DETERMINER_TAGS = new HashSet<String>();

static {

The start tags and continuation tags are defined similarly:



Defining Verb Chunks

We allow verbs to start with verbs, auxiliaries, or adverbs; they may be continued with any of these tags, or with punctuation.

The Chunker Implementation

We provide an implementation of the chunk.Chunker interface in src/PhraseChunker.java. Note that this is the same interface we use for named entity and other chunkers; see the Named Entity Tutorial for more information.

The constructor simply stores a part-of-speech tagger (in the form of an HMM decoder) along with a tokenizer factory:

private final HmmDecoder mPosTagger;
private final TokenizerFactory mTokenizerFactory;

public PhraseChunker(HmmDecoder posTagger,
                     TokenizerFactory tokenizerFactory) {
    mPosTagger = posTagger;
    mTokenizerFactory = tokenizerFactory;

The chunk method is implemented in several stages. The first step is to tokenize the input and compute the part-of-speech tags using the decoder:

public Chunking chunk(char[] cs, int start, int end) {
    // tokenize
    List<String> tokenList = new ArrayList<String>();
    List<String> whiteList = new ArrayList<String>();
    Tokenizer tokenizer = mTokenizerFactory.tokenizer(cs,start,end-start);

    String[] tokens
        = tokenList.<String>toArray(new String[tokenList.size()]);
    String[] whites
        = whiteList.<String>toArray(new String[whiteList.size()]);

    // part-of-speech tag
    Tagging<String> tagging = mPosTagger.tag(tokenList);

Next, we walk over the tags, keeping track of the positions of the chunks, and waiting for the start of a noun or a verb. Skeletally, this looks like:

        ChunkingImpl chunking = new ChunkingImpl(cs,start,end);
        int startChunk = 0;
        for (int i = 0; i < tagging.size(); ) {
            startChunk += whites[i].length();

            if (START_NOUN_TAGS.contains(tagging.tag(i))) {
                // extend noun to completion and add
            } else if (START_VERB_TAGS.contains(tagging.tag(i))) {
                // extend verb to completion and add

            } else {
                startChunk += tokens[i].length();
        return chunking;

The real work is done in the ellided blocks above. We only consider the noun case, as the verb case is structurally identical.

                // extend noun to completion and add
                int endChunk = startChunk + tokens[i].length();
                while (i < tokens.length && CONTINUE_NOUN_TAGS.contains(tags[i])) {
                    endChunk += whites[i].length() + tokens[i].length();

Here, once we find the start of the noun at index i, we track where it starts (always on the first character of a token, not on whitespace). We then extend it one token at a time if the corresponding tag is a legal noun continuation. All the while, we keep track of the end position and the overall index.

Once we have a chunk, we work backward peeling off any final punctuation. We define a new trimmed end chunk variable and update it going backward. If the whole thing turns out to be punctuation (shouldn't actually happen), then we ignore the resulting chunk.

                int trimmedEndChunk = endChunk;
                for (int k = i;
                     --k >= 0 && PUNCTUATION_TAGS.contains(tagging.tag(k)); ) {
                    trimmedEndChunk -= (whites[k].length() + tokens[k].length());
                if (startChunk >= trimmedEndChunk) {
                    startChunk = endChunk;

Otherwise, we use the chunk factory to create a new chunk, add it to our chunking, and update our position tracking variable startChunk.

                Chunk chunk
                    = ChunkFactory.createChunk(startChunk,trimmedEndChunk,"noun");
                startChunk = endChunk;

Running the Program

We have implemented a main() method in PhraseChunker.java to allow the chunker to be tested from the command line. This may be run using the ant target phrases in the ant build.xml file:

> cd $LINGPIPE/demos/tutorial/posTags
> ant phrases

Buildfile: build.xml



After months of coy hints, Prime Minister Tony Blair made the announcement today as part of a closely choreographed and protracted farewell.
  noun(6,12) months
  noun(16,25) coy hints
  noun(27,52) Prime Minister Tony Blair
  verb(53,57) made
  noun(58,80) the announcement today
  noun(84,88) part
  noun(92,101) a closely
  verb(102,115) choreographed
  verb(120,130) protracted
  noun(131,139) farewell

The attorney general appeared before the House Judiciary Committee to discuss the dismissals of U.S. attorneys.
  noun(0,20) The attorney general
  verb(21,29) appeared
  noun(37,66) the House Judiciary Committee
  verb(67,77) to discuss
  noun(78,92) the dismissals
  noun(96,110) U.S. attorneys

Nascar's most popular driver announced that his future would not include racing for Dale Earnhardt Inc.
  noun(0,6) Nascar
  verb(7,8) s
  noun(14,28) popular driver
  verb(29,38) announced
  noun(44,54) his future
  verb(55,79) would not include racing
  noun(84,102) Dale Earnhardt Inc

Purdue Pharma, its parent company, and three of its top executives today admitted to understating the risks of addiction to the painkiller.
  noun(0,13) Purdue Pharma
  noun(15,33) its parent company
  noun(39,44) three
  noun(48,72) its top executives today
  verb(73,81) admitted
  verb(85,97) understating
  noun(98,107) the risks
  noun(111,120) addiction
  noun(124,138) the painkiller

After a difficult stretch for the airline, David Neeleman will give way to David Barger, the No. 2 executive.
  noun(6,25) a difficult stretch
  noun(30,41) the airline
  noun(43,57) David Neeleman
  verb(58,67) will give
  noun(68,71) way
  noun(75,87) David Barger
  noun(89,108) the No. 2 executive

The Main Method

The main() method driving this demo is trivial; it just reads in the models, sets up the chunker, and then runs it on the remaining command-line arguments:

public static void main(String[] args) {

    // parse input params
    File hmmFile = new File(args[0]);
    int cacheSize = Integer.parseInt(args[1]);
    FastCache<String,double[]> cache = new FastCache<String,double[]>(cacheSize);

    // read HMM for pos tagging
    HiddenMarkovModel posHmm;
    try {
            = (HiddenMarkovModel)
    } catch (IOException e) {
        System.out.println("Exception reading model=" + e);
    } catch (ClassNotFoundException e) {
        System.out.println("Exception reading model=" + e);

    // construct chunker
    HmmDecoder posTagger  = new HmmDecoder(posHmm,null,cache);
    TokenizerFactory tokenizerFactory = new IndoEuropeanTokenizerFactory();
    PhraseChunker chunker = new PhraseChunker(posTagger,tokenizerFactory);

    // apply chunker
    for (int i = 2; i < args.length; ++i) {
        Chunking chunking = chunker.chunk(args[i]);
        CharSequence cs = chunking.charSequence();
        System.out.println("\n" + cs);
        for (Chunk chunk : chunking.chunkSet()) {
            String type = chunk.type();
            int start = chunk.start();
            int end = chunk.end();
            CharSequence text = cs.subSequence(start,end);
            System.out.println("  " + type + "(" + start + "," + end + ") " + text);

Just Proper Nouns

Given the Brown corpus's tagging, it's possible to pull back just the proper noun chunks. These would take the noun starting categories to be just the proper noun category (np in the Brown corpus). This may not produce the desired result, though, considering the underlying taggings of the above examples:

After/in months/nns of/in coy/jj hints/nns ,/, Prime/jj Minister/nn Tony/np Blair/np made/vbd the/at announcement/nn today/nr as/cs part/nn of/in a/at closely/rb choreographed/vbn and/cc protracted/vbn farewell/nn ./.

The/at attorney/nn general/nn appeared/vbd before/cs the/at House/nn Judiciary/nn Committee/nn to/to discuss/vb the/at dismissals/nn of/in U/np ./. S/nrs ./. attorneys/nns ./.

Nascar/np '/' s/vbz most/ql popular/jj driver/nn announced/vbd that/cs his/pp$ future/nn would/md not/* include/vb racing/vbg for/in Dale/np Earnhardt/np Inc/np ./.

Purdue/np$ Pharma/nn ,/, its/pp$ parent/jj company/nn ,/, and/cc three/cd of/in its/pp$ top/jjs executives/nns today/nr admitted/vbd to/in understating/vbg the/at risks/nns of/in addiction/nn to/in the/at painkiller/nn ./.

After/in a/at difficult/jj stretch/nn for/in the/at airline/nn ,/, David/np Neeleman/np will/md give/vb way/nn to/in David/np Barger/np ,/, the/at No/rb ./. 2/cd executive/nn ./.

Note that Prime Minister is not considered part of the proper noun, nor is House Judiciary Committee. Only the U in U.S. is assigned a proper noun tag. Proper person names, on the other hand, are usually analyzed as category np, as was Nascar and Purdue (though note that Pharma in Purdue Pharma is not considered a proper noun by the tagger).

Noun and Verb Chunks with Confidence

The n-best output for taggers could be used to define chunks. Rather than running over just the first-best output, use n-best output. Rather than returning unscored chunks, add the conditional probabilities of the whole chunkings to determine the likelihood of the chunks, although keep in mind that this will be an underestimate that gets better the larger the n in the n-best list.

A more elaborate method of doing this would be to follow the approach to named-entity chunking in the HMM-based chunker implementations in the chunk package.

A final possibility would be to use the simple noun and verb chunker to create a large set of training data that could be used to train a rescoring chunker. Simply use the chunkings that are output to train a chunker.


Being one of the premiere techniques for both written and spoken language, there is a wealth of information available on HMMs, including applications to part-of-speech tagging. We'd recommend the two major natural language processing texts:

as well as the two standard speech recognition texts:

Appendix: Additional Corpora

The other part-of-speech training corpora of which we are aware are:

Freely Downloadable

The following data may be downloaded over the web and used for "scientific", "non-commercial", or "evaluation" purposes:

Restrictively Licensed

These range in cost from hundreds to thousands of (US) dollars: