What is Word Sense Disambiguation?

Word sense disambiguation (WSD) is the task of determing which meaning of a polysemous word is intended in a given context.

Some words, such as English "run", are highly ambiguous. The American Heritage Dictionary, 4th Edition lists 28 intransitive verb senses, 31 transitive verb senses, 30 nominal senses and 46 adjectival senses. The word "gallop" has a mere 4 nominal senses, and the word "subroutine" only 1 nominal sense.

Where Do Senses Come From?

It would be convenient if we could trust dictionaries as the arbiter of word senses. Unfortunately, language presents harder problems than that. Words are fluid, living things that change meanings through metaphor, extension, adaptation, and just plain randomness. Attempting to carve the meaning of a word into a set of discrete categories with well-defined boundaries is doomed to fail for a number of reasons.

In practice, dictionaries can be useful. They might be good enough for practical purposes even if there are tail-of-the-distribution or boundary cases they don't adequately capture.

Supervised vs. Unsupervised WSD

We will assume for the rest of this tutorial that the words we care about will have finitely many disjoint senses. If we have training data, word sense disambiguation reduces to a classification problem. Additional training data may be supplied in the form of dictionary definitions, ontologies such as Medical Subject Headings (MeSH), or lexical resources like WordNet.

If there is no training data, word sense disambiguation is a clustering problem. Hierarchical clusterings may make sense; the dictionaries sited above break meanings of the word "run" down into senses and sub-senses.

For this demo, we will be doing supervised word sense disambiguation. That is, we will have training data consisting of examples of words in context and their meanings. We will compare several LingPipe classifiers on this task.

Senseval & SemEval

Many teams have competed in a series of word sense disambiguation bakeoffs sponsored by the Association for Computational Linguistic's lexical special interest group SIGLEX.

Senseval 3

Senseval 3 took place in Barcelona in 2004. It is the latest bakeoff to have posted freely available (public domain) data. There is data for English, Italian, Basque, Catalan, Chinese, Italian, Romanian and Spanish, all in the same format. The top-level data link is here:

Senseval English Data

We'll begin with English data, which resides in the following two gzipped tarballs:

Unpack both tarballs into a directory called $senseval-en. A listing of directory $senseval-en should show the following 9 files, which is all we'll need.

EnglishLS.README                  EnglishLS.sensemap  EnglishLS.train
EnglishLS.dictionary.mapping.xml  EnglishLS.test      EnglishLS.train.key
EnglishLS.dictionary.xml          EnglishLS.test.key  EnglishLS.words

Senseval Scoring

The Senseval 3 distribution also contains scoring software written in C. We'll provide output in their official format, but use LingPipe's classification scorer for purposes of this demo.

Results of the Evaluation

The entire task and a report on the 27 submitted systems and their coarse- and fine-grained scores is conveniently available in a single paper:

Given that a fairly straightforward regularized regression approach with feature selection won the bakeoff, we don't understand why the authors conclude that the evaluation "shows once again that voting schemes that combine several learning algorithms outperform the accuracy of individual classifiers".

Training, Test and Answer Formats

The Senseval data looks superficially like XML, but is not, in fact, well-formed XML data.

Even more problematic is the fact that it's been tokenized by the task organizers with single whitespaces inserted between tokens.


The first file we'll consider is the English "dictionary":

This file contains definitions of lexical items and their senses. There's no top-level element, just a sequence of lexelt elements. Here's the first entry for the verb activate (with line breaks inserted between attributes for readability; in the file, each sense element is on its own line):

<lexelt item="activate.v">
<sense id="38201" source="ws" 
       synset="activate actuate energize start stimulate" 
       gloss="to initiate action in; make active."/>
<sense id="38202" source="ws" 
       gloss="in chemistry, to make more reactive, as by heating."/>
<sense id="38203" source="ws" 
       synset="activate assign ready" 
       gloss="to assign (a military unit) to active status."/>
<sense id="38204" source="ws" 
       gloss="in physics, to cause radioactive properties in (a substance)."/>
<sense id="38205" source="ws" 
       synset="activate aerate oxygenate" 
       gloss="to cause decomposition in (sewage) by aerating."/>

The dictionary defines the words and their senses. For instance, the verb activate has five senses, numbered 38201 to 38205.

The source of the word senses is indicated by the source attribute. the value for this entry, ws, indicates the senses are drawn drawn from WordSmyth, which has the following entry for "activate".

Entries for English nouns have their source attribute set to wn, with word senses was drawn from WordNet. Other sources were used for the other languages.

The synset attribute provides a list of synonyms for the word, drawn from the source indicated.

The gloss attribute provides a dictionary style deifnition for the term.

Training Data

The training data is in the following file:

Like the dictionary, there is no top-level element wrapping everything, as required in XML. Instead, for each word for which there is training data is enlosed in lexelt pseudo-tags. Here's the first example, with line-breaks in place of some single spaces in the context element for readability:

<lexelt item="activate.v">

<instance id="activate.v.bnc.00024693" docsrc="BNC">
<answer instance="activate.v.bnc.00024693" senseid="38201"/>
Do you know what it is , and where I can get one ?  We suspect you had
seen the Terrex Autospade , which is made by Wolf Tools .  It is quite
a hefty spade , with bicycle - type handlebars and a sprung lever at
the rear , which you step on to <head>activate</head> it
. Used correctly , you should n't have to bend your back during
general digging , although it wo n't lift out the soil and put in a
barrow if you need to move it !  If gardening tends to give you
backache , remember to take plenty of rest periods during the day ,
and never try to lift more than you can easily cope with .



Note the tokenization in strings like wo n't and the day , . In the header, we have highlighted in bold the name of the word, in this case activate.v, the identifier for the instance, 00024693, and the sense identifier for the intended sense of the word, 38201. In the text sample, within the context element, the mention of the word in question is wrapped with a head element, namley <head>activate</head>.

Test Data

The test data is just like the training data, only without the element answer providing the sense identification.

<instance id="eat.v.bnc.00064338" docsrc="BNC">
My 10&frac12 ; - month - old Bearded Collie is obsessed with wood .
At home he has chewed the door frame and part of the back door , as
well as the kitchen cupboards , although these were done when he was a
lot younger .  However , he is still obsessed with wood and
<head>eating</head> it in the park . He picks up sticks and sits down
to eat them .  I try to get them off him but am not always successful .

Note, in particular, that the word to disambiguate is wrapped in a head element. Furthermore, note that the word here is the present participle form of "eat", namely "eating". This test case provides an example of the other issue with the not-quite-XML, namely that there are unescaped ampersands. This looks like the data was derived from something that was in SGML newswire format with the frac12 entity, which would have looked like &frac12; in the original. In the course of tokenizing the input, the Senseval data organizers separated out the semicolon. There are dozens of such entity escapes remaining in the corpus.

Answer Key Format

The answer key is a simple line-oriented format, with each line containing a lexical element (e.g. activate.v), an instance identifier (e.g. activate.v.bnc.00008457), and a list of one or more sense identifiers (e.g. 38201 38203):

activate.v activate.v.bnc.00008457 38201
activate.v activate.v.bnc.00061340 38201
activate.v activate.v.bnc.00066993 38201 U
activate.v activate.v.bnc.00067831 38201
activate.v activate.v.bnc.00134704 38201 38203
activate.v activate.v.bnc.00146646 38201
disc.n disc.n.bnc.00184283 disc%1:06:00:: disc%1:25:00::
disc.n disc.n.bnc.00184284 disc%1:06:00:: disc%1:25:00::
disc.n disc.n.bnc.00213452 disc%1:06:03::

Note that the format of sense identifiers is different for nouns and verbs. When there is more than one sense identifier on a line, it is treated as if either answer is equally acceptable. We're guessing that the U appearing in the third line above is for "unknown". It is not mentioned in the scoring documentation.

System Answer Format

For the official evaluation, there is a description of the official format for answers. System output for each sample must be provided on a single line, as in:

brother.n 00001 501566
brother.n 00002 501566     999997     !!
brother.n 00006 501566/0.5 501573/0.4 503751/0.1
brother.n 00015 503751/94  999999/87             !! comment . . .

The elements of the example correspond to the bold elements of the sample above: brother.n is the lexical element, 00001 is an instance ID, and 501566, 99997, and 501573 are all instances of sense tags.

What's interesting about this format is that systems may provide weighted answers. The second line indicates that the system reduced the choices to one of two sense tags, 501566 or 999997, whereas in the third line, three tags are given with real-valued weights. The fourth line illustrates that the weights need not be scaled. If there are no weights, a uniform distribution is assumed in which each answer is assumed to be equally likely. In all cases, the weights are scaled to probabilities (so that they sum to 1.0).


Although the Senseval scoring document is very confusing, Dan Melamed and Phil Resnik's scoring proposal is very clear. The score for a given test case is simply the sum of weights provided by the system response for each sense that is among the valid answers.

There is an interesting discussion of scoring, including Senseval, in the following paper:

Klein and Manning contrast Senseval's sum of scores approach to the more usual product of scores approach. We'll present both evaluations here.

Demo Walkthrough

For the purposes of this demo, we approach word sense disambiguation as a simple text classification problem. We will use the full text in the context around the training examples to train a text classifier. We will then simply run that classifier on the new examples. Almost all of the work in the code is in I/O; there are only a handful of lines that actually involve classification.

The Source Code

The tutorial assumes you are running inside the lingpipe distribution at $lingpipe/lingpipe/demos/tutorial/wordSense/. The code is in a single file, which contains multiple class definitions:

The build file we use for running the tutorial is:

Everything's run statically in the code from the main() method.

Running the Demo

The Ant target run will perform a complete Senseval 3 run. It's necessary to provide a pointer to $senseval-en, which is the location of the unpacked Senseval 3 data. In our case, we put it in e:\data\senseval3\unpacked\english, so we use the following code to run, with an explicit property specification -Dsenseval.dir=...:

cd $demo-dir
ant -Dsenseval.dir=e:\data\senseval3\unpacked\english run

Running it out of the box produces the following (ellided) output:

Building jar: C:\carp\mycvs\lingpipe\demos\tutorial\wordSense\wordSense.jar

Dictionary File=E:\data\senseval3\unpacked\english\EnglishLS.dictionary.xml
Training File=E:\data\senseval3\unpacked\english\EnglishLS.train
Testing File=E:\data\senseval3\unpacked\english\EnglishLS.test
Answer Key File=E:\data\senseval3\unpacked\english\EnglishLS.test.key
System Response File=C:\carp\mycvs\lingpipe\demos\tutorial\wordSense\systemAnswer.txt

Reading Dictionary.

Reading Training Data.
     #training words=57

Reading Test Data.
     #test cases=3944

Training Model.
    training different.a [5 senses]
    training rule.v [3 senses]
    training expect.v [3 senses]
    finished training.

Running Model over Test Data.
     finished test data.


The result is a file systemAnswer.txt that looks as follows (with most of the content elided):

activate.v activate.v.bnc.00008457 38201/1000
activate.v activate.v.bnc.00061340 38201/1000
activate.v activate.v.bnc.00061975 38201/1000
appear.v appear.v.bnc.00041843 190902/635 190901/365
write.v write.v.bnc.00007127 4753408/1000
write.v write.v.bnc.00007179 4753408/1000
write.v write.v.bnc.00007381 4753407/1000

We have illustrated one example (00041843) for which we provide multiple answers. Not all classifiers are set up to do this, as we discuss shortly.

Evaluating the Results

Retrieving the Evaluation Code

Rather than writing our own evaluation code, we use the C code supplied by the organizers. The code's a single C source file which is easy to compile. Here's a link for the tarball download:

After downloading, just unpack the gzipped tarball into a convenient location, which we'll call $senseval-score-dir.

Compiling the Evaluation Code

Luckily, it's just a single C file, so it's easy to compile:

> cd $senseval-score-dir
> gcc -o scorer2 scorer2.c

There is no standard output from the compile, but it produces an executable for Windows, scorer2.exe, or scorer2 for Linux/Unix, that sits in the same directory as the source scorer2.c.

Running the Evaluator

Running the evaluator is straightforward. To get so-called "fine-grained" scores, all you need is:

> $senseval-score-dir/scorer2.exe systemAnswer.txt $senseval-en/EnglishLS.test.key

Or, on a linux/Unix machine:

> senseval-score-dir/scorer2 systemAnswer.txt senseval-en/EnglishLS.test.key

The fine-grained scores treat every sense as distinct. Here's what it looks like on my machine:

> e:\data\senseval3\unpacked\scoring\scorer2.exe systemAnswer.txt e:\data\senseval3\unpacked\english\EnglishLS.test.key

Fine-grained score for "systemAnswer.txt" using key "e:\data\senseval3\unpacked\english\EnglishLS.test.key":
 precision: 0.660 (2601.46 correct of 3944.00 attempted)
 recall: 0.660 (2601.46 correct of 3944.00 in total)
 attempted: 100.00 % (3944.00 attempted of 3944.00 in total)

Note that if you ran the default classifier, this is the result for a character language model-based classifier with default parameters (n-gram length 5, 216 characters, 5.0 interpoloation parameter). This is our recommended out-of-the-box configuration for classification.

It is also possible to derive coarse-grained scores in which senses are conflated into a smaller set of senses. For this, the conflation file is needed as input, as well as a command-line parameter:

> $senseval-score-dir/scorer2.exe systemAnswer.txt $senseval-en/EnglishLS.test.key $senseval-en/EnglishLS.sensemap -g coarse

In our case, here's what it looks like:

> e:\data\senseval3\unpacked\scoring\scorer2.exe systemAnswer.txt e:\data\senseval3\unpacked\english\EnglishLS.test.key e:\data\senseval3\unpacked\english\EnglishLS.sensemap -g coarse

Coarse-grained score for "systemAnswer.txt" using key "e:\data\senseval3\unpacked\english\EnglishLS.test.key":
 precision: 0.708 (2792.18 correct of 3944.00 attempted)
 recall: 0.708 (2792.18 correct of 3944.00 in total)
 attempted: 100.00 % (3944.00 attempted of 3944.00 in total)

Our coarse-grained score puts us further down the pack of scores. In fact, looking at the results table, the coarse-grained ordering is rather different from the fine-grained one.

How'd we Do?

An F-measure score of 0.660 would've put us dead in the middle of the pack for the actual evaluation. Looking at the The Senseval-3 English lexical sample task, which includes the results (pages 4 and 5), the best fine-grained score was 0.729. There would've been 21 submissions ahead of ours, one that tied with ours, and 24 teams behind ours. With roughly 2500 test cases and a 0.7 performance, the standard deviation approximated by a binomial is:

sqrt(0.7 * (1 - 0.7) / 2500) = 0.0091

That's a 95% confidence interval of roughly 0.018. Not considering the multi-way test aspects, roughly the top 13 systems or so are not "significantly" pairwise distinguishable at 95% confidence. Unfortunately, the best systems are significantly better than our off-the-shelf classifiers.

How Hard is WSD?

The most-frequent sense baseline here is quite high, at 55.2%. System scores in the 60-70% range are not doing much better than chance. We'd actually need a better evaluation setup in order to generate the κ statistic, but it'd certainly be in the not-so-relevant category.

Inter-annotator agreement on the tagging task is only 67%, which is about how well our default system performed (66%). It's not clear how the data was adjudicated to create the final training data given the low inter-annotator agreement (Mihalcea et al., section 2.4). Their computation of kappa is non-standard in that they set the expectation to be a uniform distribution rather than the data distribution. (See Carletta's CL squib or Passonneau's LREC paper on kappa variants).

Command-line Parameters

There are five command-line arguments, each with its own property in the ant build file build.xml. We summarize these in the following table:

Command-line Arguments
ArgPropertyDefault ValueDescription
0 senseval.dir ${data.dir}/senseval3/unpacked/english Directory containing all the data ($senseval-en)
1 train.dictionary.file ${senseval.dir}/EnglishLS.train Dictionary of word senses, synonyms and definitions.
2 train.file ${senseval.dir}/EnglishLS.dictionary.xml Training texts with words in context.
3 test.file ${senseval.dir}/EnglishLS.test File of test cases with words in context.
3 answer.file systemAnswer.txt Output file produced by system run.
5 classifier.id 0 Number indicating which classifier to use.

Code Walkthrough

The code for the entire Senseval entry is in one file, src/Senseval3.java. We have done our best to separate the data parsing component of the exercise from the actual classification of senses.

Main Method

Everything runs from a single main() method.

public static void main(String[] args) 
    throws ClassNotFoundException, IOException {

    File dictFile = new File(args[0]);
    File trainFile = new File(args[1]);
    File testFile = new File(args[2]);
    File responseFile = new File(args[3]);
    sClassifierNumber = Integer.parseInt(args[4]);

    SenseEvalDict dict = new SenseEvalDict(dictFile);
    TrainingData trainingData = new TrainingData(trainFile);
    TestData testData = new TestData(testFile);

    SenseEvalModel model = new SenseEvalModel(dict,trainingData);


The first few lines merely read in command-line parameters. The next lines read in all of the data for the evaluation. First, the dictionary is read from the dictionary file, then the training data is read from the training file. Finally, the test data is read from the test file. The formats of these files were discussed in the previous sections.

Senseval Data Structures

Rather than get into the ugly details of parsing these data formats, we instead present the static classes that are constructed from the data. These are designed to be easy to use as the basis of further exploration into word-sense disambiguation without mixing up the data format details and the classifier and training details.

Dictionary Format

The dictionary is simply mapping from strings consisting of words plus categories (e.g. activate.v) to an array of senses.

static class SenseEvalDict extends HashMap<String,Sense[]>

The sense class contains all of the information in the Senseval dictionary:

static class Sense {
    String mId;
    String mSource;
    String mSynset;
    String mGloss;

Training Data Format

Like the dictionary, the text training data is represented as a map from words plus categories (e.g. activate.v) to the training data for that word. word

    static class TrainingData extends HashMap<String,Map<String,List<String>>>

The training data for a word plus category is of type Map<String,List<String>>. This mapping is from a sense ID (e.g. 38202 or argument%1:10:02::, the format varies by source) to the list of textual contexts for that sense.

Test Data Format

The test data class consists of three parallel lists:

static class TestData {
    List<String> mWordsPlusCats = new ArrayList<String>();
    List<String> mInstanceIds = new ArrayList<String>();
    List<String> mTextsToClassify = new ArrayList<String>();

Each list has a length that is equal to the number of test cases. The first list contains the words plus categories of the test cases, the second contains the instance ID for the test case (e.g. difference.n.bnc.00002450), and the third the actual textual contexts containing the instance of the word to be classified.


We use the dictionary and training data to build a complete Senseval model using the model's constructor:

    static class SenseEvalModel extends HashMap<String,Classifier> {

        SenseEvalModel(SenseEvalDict dict, TrainingData trainingData)
            throws ClassNotFoundException, IOException  {

            for (String wordPlusCat : trainingData.keySet()) {
                Map<String,List<String> senseToTextList = trainingData.get(wordPlusCat);
                String[] senseIds = senseToTextList.keySet().<String>toArray(new String[0]);

                ObjectHandler<Classified<CharSequence>> trainer
                    = createClassifierTrainer(senseIds);

                for (String senseId : senseToTextList.keySet()) {
                    Classification classificationForSenseId = new Classification(senseId);
                    List<String> trainingTextList = senseToTextList.get(senseId);
                    for (String trainingText : trainingTextList) {
                        Classified<CharSequence> classified
                            = new Classified<CharSequence>(trainingText,classificationForSenseId);

                BaseClassifier<CharSequence> classifier
                    = (BaseClassifier<CharSequence>)


Although this code looks dense, most of it's simply negotiating the data structures previously introduced. The outer loop walks over all of the words plus categories in the training data. Then, it extracts their sense to text list and the array of sense ids for it. these sense ids are used as the categories in constructing a trainable classifier. We have assigned it to the type of classification handler whose classified elements are character sequences (CharSequence) and the classification is a simple first-best classification result (Classification). This just means it can take classified character sequences as training data. We come back to how the classifiers are constructed in the next section.

Next, for each sense for the word, we take a classification corresponding to that sense and use it to train the classifier trainer with each trianing text.

Finally, the classifier is compiled and added to the mapping (using the superclass's put() method).

Constructing the Classifier

The actual classifier construction is broken off into a separate method, because it allows so many options. It's basically a large integer switch statement, the bodies of the cases of which construct and return classifiers.

Character LM Classifiers

The first cases are for character language-model based classifiers. As the texts are long and the begins/ends not special, we have used standard process LM classifiers rather than boundary models.

static ObjectHandler<Classified<CharSequence>> createClassifierTrainer(String[] senseIds) {

    switch (sClassifierNumber) {


        return DynamicLMClassifier.createNGramProcess(senseIds,5);

The first case is our default recommended classification model for English, a 5-gram character language model classifier with default parameters.

The second case is a configurable version of the process n-gram language model classifiers:


    LanguageModel.Dynamic[] lms5 = new LanguageModel.Dynamic[senseIds.length];
    for (int i = 0; i < lms5.length; ++i)
        lms5[i] = new NGramProcessLM(6,     // n-gram
                                     128,   // num chars
                                     1.0);  // interpolation ratio
    return new DynamicLMClassifier<LanguageModel.Dynamic>(senseIds,lms5);

This allows the n-gram process language models to be configured for n-gram (6 here), number of characters (set to 128 for ASCII input), and the interpolation ratio (set to 1.0, which is less smoothingd).

Token LM and Naive Bayes Classifiers

If the features are just token counts, naive Bayes classification is just a token unigram model. We have a special class for this case with its own constructor:


    return new NaiveBayesClassifier(senseIds,SPACE_TOKENIZER_FACTORY);

In order to construct a token-based classifier, we need a tokenizer. Here we've used one we'll explain in the next section that simply breaks on spaces. This is fine for this context as the organizers already decided on token boundaries for us and inserted spaces between them.

Further note that LingPipe's naive Bayes classifier is configured to using boundary character n-gram smoothing for the tokens.

The default token unigram and bigram are too agressive at unseen words. Ideally, they should be trained with some kind of explicit add-one (Laplace prior) smoothing. Here's the default version:


    return DynamicLMClassifier.createTokenized(senseIds,


    return DynamicLMClassifier.createTokenized(senseIds,

The final tokenized language model classifier is fully configurable. We set it up to use the default smoothing character language models, the bounded n-gram character language models with default params for n-gram length (5), the appropriate number of characters for ASCII (128), and the default smoothing (5.0, the length of the n-gram), and the default boundary character, the non-code-point character '\uFFFF'.


    LanguageModel.Dynamic[] lms2 = new LanguageModel.Dynamic[senseIds.length];
    for (int i = 0; i < lms2.length; ++i)
        lms2[i] = new TokenizedLM(SPACE_TOKENIZER_FACTORY,
                                  2, // n-gram length
                                  new NGramBoundaryLM(5,128,5.0,'\uFFFF'),
                                  new NGramBoundaryLM(5,128,5.0,'\uFFFF'),
                                  1.0); // interpolation param
    return new DynamicLMClassifier<LanguageModel.Dynamic>(senseIds,lms2);

Note that a tokenizer factory is required. We have included a special tokenizer factory with this class, which breaks on spaces. We return to tokenization in the next section.

TF/IDF Classifier

The term frequency (TF) and inverse document frequency (IDF) classifier requires a tokenizer factory in its construction.


    FeatureExtractor<CharSequence> featureExtractor5
        = new TokenFeatureExtractor(NORM_TOKENIZER_FACTORY);
    return new TfIdfClassifierTrainer<CharSequence>(featureExtractor5);

Note that this class is using the normalized tokenizer factory. This factory lower cases, stems and stoplists entries, as we describe in the next section.

Nearest Neighbors Classifier

Finally, we consider the K nearest neighbors (KNN) classifier. This requires several parameters. First, we require a feature extractor to convert input character sequences into tokens. We use a normalizing tokenizer factory here, which we describe more fully in the next section. The other explicit parameter is the number K of neighbors, here set to 20. For a KNN classifier, the 20 closest neighbors to an input are consulted and their results averaged.

    FeatureExtractor<CharSequence> featureExtractor6
        = new TokenFeatureExtractor(NORM_TOKENIZER_FACTORY);
    return new KnnClassifier<CharSequence>(featureExtractor6, 
                                           20);  // num neighbors to average

KNN classification will have trouble with this data set because for some senses, the number of training instances is so small that it can never dominate with 20 neighbors. The problem with reducing the number of neighbors is that it greatly increases the variance of the classifier.

In general, KNN classifiers require a specified distance metric to find the nearest neighbors. The default is Euclidean distance, as used in th e previous classifier. We may also consider other distances, such as the cosine distance, which is popular for bag-of-words processing for natural language classification:

    FeatureExtractor<CharSequence> featureExtractor6
        = new TokenFeatureExtractor(NORM_TOKENIZER_FACTORY);
    return new KnnClassifier<CharSequence>(featureExtractor6, 
                                           new CosineDistance(),
                                           20);  // num neighbors to average


For the token-sensitive models, we investigate two tokenizers, each with its own factory.

Space-based Tokenization

We can tokenize on spaces using a simple regular-expression-based tokenizer factory, which is completely thread safe and serializable.

static final TokenizerFactory SPACE_TOKENIZER_FACTORY
    = new RegExTokenizerFactory("\\S+");

Tokens are defined as maximally long sequences of non-whitespace characters (\S+, with appropriately escaped backslash).

Also note that we have made the tokenizer factory serializable so that it may be compiled along with the models.

Normalizing Tokenization

Normalizing tokenization applies several filter tokenizers to the result of space tokenization:

static final TokenizerFactory NORM_TOKENIZER_FACTORY
    = normTokenizerFactory();

static TokenizerFactory normTokenizerFactory() {
    TokenizerFactory factory = SPACE_TOKENIZER_FACTORY;
    factory = new LowerCaseTokenizerFactory(factory);
    // factory = EnglishStopTokenizerFactory(factory);
    // factory = PorterStemmerTokenizerFactory(factory);
    return factory;

This takes the result of the space-based tokenizer, lower cases all the tokens, removes English stop words, then applies the Porter stemmer to the output. We've commented out the stoplisting and the stemming, but these can be added back in to see the difference.

Character N-gram Tokenization

LingPipe implements a tokenizer based on taking n-grams of input characters. These are specified by a range of sizes, such as the following:

static final TokenizerFactory NGRAM_TOKENIZER_FACTORY
    = new NGramTokenizerFactory(3,4);

As specified, this tokenizer returns all length 3 and length 4 substrings of the input being tokenized.

Running the Test Cases

The code for running the test cases is as simple as usual. We just feed the text to a classifier to get a classification result.

static void respond(SenseEvalModel model, TestData testData, File file)
    for (int i = 0; i < testData.mWordsPlusCats.size(); ++i) {
        String wordPlusCat = testData.mWordsPlusCats.get(i);
        Classifier classifier = model.get(wordPlusCat);

        String instanceId = testData.mInstanceIds.get(i);

        String textToClassify = testData.mTextsToClassify.get(i);

        Classification classification = classifier.classify(textToClassify);

        if (classification instanceof ConditionalClassification) {
            ConditionalClassification condClassification
                = (ConditionalClassification) classification;
            for (int rank = 0; rank < condClassification.size(); ++rank) {
                int conditionalProb = (int) java.lang.Math.round(1000.0
                                                                 * condClassification.conditionalProbability(rank));
                if (rank > 0 &&conditionalProb < 1) break;
                    String category = condClassification.category(rank);
        } else {
            String category = classification.bestCategory();

The code is more complex, as we are given two data structures, the classifiers in the form of a SenseEvalModel and the test data in the from of an instance of TestData. The file is where the output is written, but we have ellided the formatting information here (ellipses [...] in text).

The code first has to iterate over the test categories, which are indexed by the word plus category for the task. Then we just retrieve the classifier from the model. We then extract the text to classify from the test data. Next, we just run the classifier on the text being classified.

N-best Output

Finally, if the result is a conditional classification, we return all n-best results (quantized to a 0-1000 scale). This is allowed by Senseval scoring, though given their arithmetic mean scoring (probabilistic log loss scoring involves the log of the geometric average), it doesn't help much. (Without this, the score of our default model is 0.659 rather than 0.660). Given that most results of the conditional classifiers are highly skewed due to lack of intra-text dependency modeling, we don't have highly accurate confidence estimates.

Confidence Thresholded Output

The Senseval evaluation allows a system to not provide an answer. No answers count against recall performance, but not against precision. Several of the classifiers provide scores that can be used for such thresholding. For instance, we can require at least n votes out of m in our KNN implementations, or we could require cross-entropy above a given threshold in our language model classifiers.


Here are the resulting scores from the scorer for the various models. Because they all try every example, precision and recall are equal, and we report them as accuracy. Note that these are the fine-grained sense evaluations. For some classifiers, we've reported results for multiple parameterizations without defining new cases (that is, we just edited the source and re-ran).

WSD Results
Accuracy (fine)TimeIDClassifier
0.660 45s 0 default character n-gram (5-gram, 16K chars, interp=5.0)
0.660 82s 1 character process 6-gram, 128 chars, interp=1.0
0.629 9s 2 Naive Bayes (space tokenizer)
0.638 12s 2b Naive Bayes (norm tokenizer)
0.629 9s 3 Default Token Unigram (space tokenizer)
0.638 11s 3b Default Token Unigram (norm tokenizer)
0.631 11s 4 Default Token bigram (space tokenizer)
0.638 13s 4b Default Token bigram (norm tokenizer)
0.653 24s 5 Configurable token n-gram (norm tokenizer, bigram, n-gram boundary lm 4/128/1.0, 0.25 interp)
0.654 26s 5b Configurable token n-gram (norm tokenizer, trigram, n-gram boundary lm 4/128/1.0, 0.1 interp)
0.650 5s 6 TF/IDF classifier (space tokenizer)
0.651 8s 6b TF/IDF classifier (norm tokenizer)
0.630 11s 6c TF/IDF classifier (character 3-gram tokenizer)
0.660 22s 6d TF/IDF classifier (character 4-gram tokenizer)
0.666 36s 6e TF/IDF classifier (character 5-gram tokenizer)
0.662 54s 6f TF/IDF classifier (character 6-gram tokenizer)
0.668 134s 6g TF/IDF classifier (character 4-, 5-, and 6-gram tokenizer)
0.587 8s 7 KNN classifier (space tokenizer, Euclidean distance, k=20)
0.565 10s 7b KNN classifier (norm tokenizer, Euclidean distance, k=20)
0.521 10s 7c KNN classifier (space tokenizer, Euclidean distance, k=1)
0.504 10s 7c KNN classifier (space tokenizer, Euclidean distance, k=2)
0.553 10s 7d KNN classifier (space tokenizer, Euclidean distance, k=4)
0.580 10s 7e KNN classifier (space tokenizer, Euclidean distance, k=8)
0.581 10s 7f KNN classifier (space tokenizer, Euclidean distance, k=16)
0.585 10s 7g KNN classifier (space tokenizer, Euclidean distance, k=32)
0.569 10s 7g KNN classifier (space tokenizer, Euclidean distance, k=64)
0.587 8s 8 KNN classifier (space tokenizer, cosine distance, k=20
0.581 9s 8b KNN classifier (norm tokenizer, cosine distance, k=20
0.598 41s 8c KNN classifier (character 5-gram tokenizer, cosine distance, k=20
0.611 41s 8d KNN classifier (character 5-gram tokenizer, cosine distance, k=10
0.618 43s 8d KNN classifier (character 5-gram tokenizer, cosine distance, k=10, distance weighted)

Overall, the best accuracy (66.8%) is achieved with a TF/IDF classifier over character 4,5,6-grams. Unfortunately, this model also required 2 gigabytes of memory (on a 64-bit Java) to hold the entire set of training models in memory. The more modest requirements for speed and time of the simple 5-gram character tokenizer make it more attactive. The TF/IDF classifier with the simple space tokenizer is the fastest, and only 1.8% less accurate than the best classifier, although it's more than 20 times as fast and uses much less memory.

In other text classification experiments, we've found the character language model classifiers to work better than any choice of token n-gram for TF/IDF.

We have also reported rough timings reported by Ant. The test machine is a dual AMD Opteron 1.8GHz, 512MB RAM, PC2700 ECC Memory, JDK 1.6, -server mode, WinXP x64 SP2).

Improving WSD Performance

The systems in the Senseval 3 bakeoff that performed better than our approach imported much richer feature sets. They used everything from part-of-speech taggers to full syntactic parsers, dictionaries and thesauri (such as WordNet), and large corpora over which they trained using semi-supervised methods. On top of this, there was a lot of work on feature selection and weighting.

The better-scoring systems also tended to use discriminitive classification methods, such as SVMs, perceptrons or various forms of linear regression.

Unsupervised Word Sense Disambiguation

Unsupervised WSD is essentially a clustering problem. From the performance of KNN classifiers, it'd seem that a character 5-gram tokenizer and cosine distance are the best bets for basic textual features.

Then it is simply a matter of taking the input texts and running them through clustering. Rather than doing that here, we refer you to the clustering tutorial, where we work through an example of proper name clustering:


The best place to look is the proceedings of the Senseval workshops:

Here's the overview paper for Senseval 3 and a link to the table of contents (which has links to the PDFs).

Both of the textbooks in the field have whole chapters devoted to word sense disambiguation:

Here's a link to a powerpoint presentation of a tutorial on WSD: