What is Word Sense Disambiguation?
Word sense disambiguation (WSD) is the task of determing which meaning of a polysemous word is intended in a given context.
Some words, such as English "run", are highly ambiguous. The American Heritage Dictionary, 4th Edition lists 28 intransitive verb senses, 31 transitive verb senses, 30 nominal senses and 46 adjectival senses. The word "gallop" has a mere 4 nominal senses, and the word "subroutine" only 1 nominal sense.
Where Do Senses Come From?
It would be convenient if we could trust dictionaries as the arbiter of word senses. Unfortunately, language presents harder problems than that. Words are fluid, living things that change meanings through metaphor, extension, adaptation, and just plain randomness. Attempting to carve the meaning of a word into a set of discrete categories with well-defined boundaries is doomed to fail for a number of reasons.
- Words do not have well-defined boundaries between their senses. Dictionary definitions attempt to distinguish a discrete set of meanings with examples and definitions, which are themselves vague. Luckily, humans deal with vagueness in their language quite well, so this is not so much a problem with humans using dictionaries.
- A related problem with dictionaries is that they don't agree. A quick glance at more than one dictionary (follow the link for "run", for example) will show that disagreement is not only possible, it's the norm. There is often overlap of meanings with subtle distinctions at the boundaries, which in practice, are actually vague.
- Another problem with dictionaries is that they are incomplete. Today's newspaper or e-mail is likely to contain words or word senses that are not present in today's dictionary.
In practice, dictionaries can be useful. They might be good enough for practical purposes even if there are tail-of-the-distribution or boundary cases they don't adequately capture.
Supervised vs. Unsupervised WSD
We will assume for the rest of this tutorial that the words we care about will have finitely many disjoint senses. If we have training data, word sense disambiguation reduces to a classification problem. Additional training data may be supplied in the form of dictionary definitions, ontologies such as Medical Subject Headings (MeSH), or lexical resources like WordNet.
If there is no training data, word sense disambiguation is a clustering problem. Hierarchical clusterings may make sense; the dictionaries sited above break meanings of the word "run" down into senses and sub-senses.
For this demo, we will be doing supervised word sense disambiguation. That is, we will have training data consisting of examples of words in context and their meanings. We will compare several LingPipe classifiers on this task.
Senseval & SemEval
Many teams have competed in a series of word sense disambiguation bakeoffs sponsored by the Association for Computational Linguistic's lexical special interest group SIGLEX.
Senseval 3
Senseval 3 took place in Barcelona in 2004. It is the latest bakeoff to have posted freely available (public domain) data. There is data for English, Italian, Basque, Catalan, Chinese, Italian, Romanian and Spanish, all in the same format. The top-level data link is here:
Senseval English Data
We'll begin with English data, which resides in the following two gzipped tarballs:
Unpack both tarballs into a directory called
$senseval-en
. A listing of directory
$senseval-en
should show the following
9 files, which is all we'll need.
EnglishLS.README EnglishLS.sensemap EnglishLS.train EnglishLS.dictionary.mapping.xml EnglishLS.test EnglishLS.train.key EnglishLS.dictionary.xml EnglishLS.test.key EnglishLS.words
Senseval Scoring
The Senseval 3 distribution also contains scoring software written in C. We'll provide output in their official format, but use LingPipe's classification scorer for purposes of this demo.
Results of the Evaluation
The entire task and a report on the 27 submitted systems and their coarse- and fine-grained scores is conveniently available in a single paper:
- Rada Mihalcea, Timothy Chklovsky, and Adam Kilgarriff. 2004. The Senseval-3 English lexical sample task. In Senseval-3 Workshop.
Given that a fairly straightforward regularized regression approach with feature selection won the bakeoff, we don't understand why the authors conclude that the evaluation "shows once again that voting schemes that combine several learning algorithms outperform the accuracy of individual classifiers".
Training, Test and Answer Formats
The Senseval data looks superficially like XML, but is not, in fact, well-formed XML data.
Even more problematic is the fact that it's been tokenized by the task organizers with single whitespaces inserted between tokens.
Dictionary
The first file we'll consider is the English "dictionary":
$senseval-en/EnglishLS.dictionary.xml
This file contains definitions of lexical items and their senses. There's
no top-level element, just a sequence of
lexelt
elements.
Here's the first entry for the verb activate (with line breaks
inserted between attributes for readability; in the file, each
sense
element is on its own line):
... <lexelt item="activate.v"> <sense id="38201" source="ws" synset="activate actuate energize start stimulate" gloss="to initiate action in; make active."/> <sense id="38202" source="ws" synset="activate" gloss="in chemistry, to make more reactive, as by heating."/> <sense id="38203" source="ws" synset="activate assign ready" gloss="to assign (a military unit) to active status."/> <sense id="38204" source="ws" synset="activate" gloss="in physics, to cause radioactive properties in (a substance)."/> <sense id="38205" source="ws" synset="activate aerate oxygenate" gloss="to cause decomposition in (sewage) by aerating."/> </lexelt> ...
The dictionary defines the words and their senses. For instance, the
verb activate
has five senses, numbered 38201 to 38205.
The source of the word senses is indicated by the source
attribute.
the value for this entry, ws
, indicates the senses are drawn
drawn from WordSmyth,
which has the following entry for "activate".
Entries for English nouns have their source attribute set to
wn
, with word senses was drawn from WordNet. Other sources were
used for the other languages.
The synset
attribute provides a list of synonyms for
the word, drawn from the source indicated.
The gloss
attribute provides a dictionary style
deifnition for the term.
Training Data
The training data is in the following file:
$senseval-en/EnglishLS.train
Like the dictionary,
there is no top-level element wrapping everything, as required in XML.
Instead, for each word for which there is training data is enlosed in
lexelt
pseudo-tags. Here's the first example,
with line-breaks in
place of some single spaces in the context
element for
readability:
<lexelt item="activate.v"> <instance id="activate.v.bnc.00024693" docsrc="BNC"> <answer instance="activate.v.bnc.00024693" senseid="38201"/> <context> Do you know what it is , and where I can get one ? We suspect you had seen the Terrex Autospade , which is made by Wolf Tools . It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly , you should n't have to bend your back during general digging , although it wo n't lift out the soil and put in a barrow if you need to move it ! If gardening tends to give you backache , remember to take plenty of rest periods during the day , and never try to lift more than you can easily cope with . </context> </instance> ... </lexelt>
Note the tokenization in strings like wo n't
and
the day ,
.
In the header, we have highlighted in bold the name of the word, in
this case activate.v
, the identifier for the
instance, 00024693
, and the sense identifier for
the intended sense of the word, 38201
. In the
text sample, within the context
element, the mention of
the word in question is wrapped with a head
element,
namley <head>activate</head>
.
Test Data
The test data is just like the training data, only without the
element answer
providing the sense identification.
<instance id="eat.v.bnc.00064338" docsrc="BNC"> <context> My 10½ ; - month - old Bearded Collie is obsessed with wood . At home he has chewed the door frame and part of the back door , as well as the kitchen cupboards , although these were done when he was a lot younger . However , he is still obsessed with wood and <head>eating</head> it in the park . He picks up sticks and sits down to eat them . I try to get them off him but am not always successful . </context> </instance>
Note, in particular, that the word to disambiguate is wrapped in a head
element. Furthermore, note that the word here
is the present participle form of "eat", namely
"eating". This test case provides an example of the other
issue with the not-quite-XML, namely that there are unescaped
ampersands. This looks like the data was derived from something that
was in SGML newswire format with the frac12
entity, which
would have looked like ½
in the original. In
the course of tokenizing the input, the Senseval data organizers
separated out the semicolon. There are dozens of such entity escapes
remaining in the corpus.
Answer Key Format
The answer key is a simple line-oriented format, with each line
containing a lexical element (e.g. activate.v
), an
instance identifier (e.g. activate.v.bnc.00008457
), and a
list of one or more sense identifiers
(e.g. 38201 38203
):
activate.v activate.v.bnc.00008457 38201 activate.v activate.v.bnc.00061340 38201 activate.v activate.v.bnc.00066993 38201 U activate.v activate.v.bnc.00067831 38201 activate.v activate.v.bnc.00134704 38201 38203 activate.v activate.v.bnc.00146646 38201 ... disc.n disc.n.bnc.00184283 disc%1:06:00:: disc%1:25:00:: disc.n disc.n.bnc.00184284 disc%1:06:00:: disc%1:25:00:: disc.n disc.n.bnc.00213452 disc%1:06:03:: ....
Note that the format of sense identifiers is different for nouns and
verbs. When there is more than one sense identifier on a line, it is
treated as if either answer is equally acceptable. We're guessing
that the U
appearing in the third line above is for
"unknown". It is not mentioned in the scoring
documentation.
System Answer Format
For the official evaluation, there is a description of the official format for answers. System output for each sample must be provided on a single line, as in:
brother.n 00001 501566 brother.n 00002 501566 999997 !! brother.n 00006 501566/0.5 501573/0.4 503751/0.1 brother.n 00015 503751/94 999999/87 !! comment . . .
The elements of the example correspond to the bold elements of
the sample above: brother.n
is the lexical element,
00001
is an instance ID, and 501566
,
99997
, and 501573
are all instances of
sense tags.
What's interesting about this format is that systems may provide
weighted answers. The second line indicates that the system reduced
the choices to one of two sense tags, 501566
or
999997
, whereas in the third line, three tags are given
with real-valued weights. The fourth line illustrates that the
weights need not be scaled. If there are no weights, a uniform
distribution is assumed in which each answer is assumed to be equally
likely. In all cases, the weights are scaled to probabilities (so
that they sum to 1.0).
Scoring
Although the Senseval scoring document is very confusing, Dan Melamed and Phil Resnik's scoring proposal is very clear. The score for a given test case is simply the sum of weights provided by the system response for each sense that is among the valid answers.
There is an interesting discussion of scoring, including Senseval, in the following paper:
- Dan Klein and Chris Manning. 2002. Conditional structure versus conditional estimation in NLP models. In EMNLP 2002.
Klein and Manning contrast Senseval's sum of scores approach to the more usual product of scores approach. We'll present both evaluations here.
Demo Walkthrough
For the purposes of this demo, we approach word sense disambiguation as a simple text classification problem. We will use the full text in the context around the training examples to train a text classifier. We will then simply run that classifier on the new examples. Almost all of the work in the code is in I/O; there are only a handful of lines that actually involve classification.
The Source Code
The tutorial assumes you are running inside the lingpipe distribution at
$lingpipe/lingpipe/demos/tutorial/wordSense/
.
The code is in a single file, which contains multiple class definitions:
The build file we use for running the tutorial is:
Everything's run statically in the code from the main()
method.
Running the Demo
The Ant target run
will perform a complete Senseval 3 run.
It's necessary to provide a pointer to $senseval-en
, which
is the location of the unpacked Senseval 3 data. In our case, we put
it in e:\data\senseval3\unpacked\english
, so we use the
following code to run, with an explicit property specification
-Dsenseval.dir=...
:
cd $demo-dir ant -Dsenseval.dir=e:\data\senseval3\unpacked\english run
Running it out of the box produces the following (ellided) output:
Building jar: C:\carp\mycvs\lingpipe\demos\tutorial\wordSense\wordSense.jar Dictionary File=E:\data\senseval3\unpacked\english\EnglishLS.dictionary.xml Training File=E:\data\senseval3\unpacked\english\EnglishLS.train Testing File=E:\data\senseval3\unpacked\english\EnglishLS.test Answer Key File=E:\data\senseval3\unpacked\english\EnglishLS.test.key System Response File=C:\carp\mycvs\lingpipe\demos\tutorial\wordSense\systemAnswer.txt Reading Dictionary. #entries=57 Reading Training Data. #training words=57 Reading Test Data. #test cases=3944 Training Model. training different.a [5 senses] training rule.v [3 senses] ... training expect.v [3 senses] finished training. Running Model over Test Data. finished test data. FINISHED.
The result is a file systemAnswer.txt
that looks as follows
(with most of the content elided):
activate.v activate.v.bnc.00008457 38201/1000 activate.v activate.v.bnc.00061340 38201/1000 activate.v activate.v.bnc.00061975 38201/1000 ... appear.v appear.v.bnc.00041843 190902/635 190901/365 ... write.v write.v.bnc.00007127 4753408/1000 write.v write.v.bnc.00007179 4753408/1000 write.v write.v.bnc.00007381 4753407/1000
We have illustrated one example (00041843
) for which
we provide multiple answers. Not all classifiers are set up to
do this, as we discuss shortly.
Evaluating the Results
Retrieving the Evaluation Code
Rather than writing our own evaluation code, we use the C code supplied by the organizers. The code's a single C source file which is easy to compile. Here's a link for the tarball download:
- Download Senseval 3 Scoring Code
After downloading, just unpack the gzipped tarball into
a convenient location, which we'll call $senseval-score-dir
.
Compiling the Evaluation Code
Luckily, it's just a single C file, so it's easy to compile:
> cd $senseval-score-dir > gcc -o scorer2 scorer2.c
There is no standard output from the compile, but it
produces an executable for Windows,
scorer2.exe
, or scorer2
for Linux/Unix, that sits in the same directory as
the source scorer2.c
.
Running the Evaluator
Running the evaluator is straightforward. To get so-called "fine-grained" scores, all you need is:
> $senseval-score-dir/scorer2.exe systemAnswer.txt $senseval-en/EnglishLS.test.key
Or, on a linux/Unix machine:
> senseval-score-dir/scorer2 systemAnswer.txt senseval-en/EnglishLS.test.key
The fine-grained scores treat every sense as distinct. Here's what it looks like on my machine:
> e:\data\senseval3\unpacked\scoring\scorer2.exe systemAnswer.txt e:\data\senseval3\unpacked\english\EnglishLS.test.key Fine-grained score for "systemAnswer.txt" using key "e:\data\senseval3\unpacked\english\EnglishLS.test.key": precision: 0.660 (2601.46 correct of 3944.00 attempted) recall: 0.660 (2601.46 correct of 3944.00 in total) attempted: 100.00 % (3944.00 attempted of 3944.00 in total)
Note that if you ran the default classifier, this is the result for a character language model-based classifier with default parameters (n-gram length 5, 216 characters, 5.0 interpoloation parameter). This is our recommended out-of-the-box configuration for classification.
It is also possible to derive coarse-grained scores in which senses are conflated into a smaller set of senses. For this, the conflation file is needed as input, as well as a command-line parameter:
> $senseval-score-dir/scorer2.exe systemAnswer.txt $senseval-en/EnglishLS.test.key $senseval-en/EnglishLS.sensemap -g coarse
In our case, here's what it looks like:
> e:\data\senseval3\unpacked\scoring\scorer2.exe systemAnswer.txt e:\data\senseval3\unpacked\english\EnglishLS.test.key e:\data\senseval3\unpacked\english\EnglishLS.sensemap -g coarse Coarse-grained score for "systemAnswer.txt" using key "e:\data\senseval3\unpacked\english\EnglishLS.test.key": precision: 0.708 (2792.18 correct of 3944.00 attempted) recall: 0.708 (2792.18 correct of 3944.00 in total) attempted: 100.00 % (3944.00 attempted of 3944.00 in total)
Our coarse-grained score puts us further down the pack of scores. In fact, looking at the results table, the coarse-grained ordering is rather different from the fine-grained one.
How'd we Do?
An F-measure score of 0.660 would've put us dead in the middle of the pack for the actual evaluation. Looking at the The Senseval-3 English lexical sample task, which includes the results (pages 4 and 5), the best fine-grained score was 0.729. There would've been 21 submissions ahead of ours, one that tied with ours, and 24 teams behind ours. With roughly 2500 test cases and a 0.7 performance, the standard deviation approximated by a binomial is:
sqrt(0.7 * (1 - 0.7) / 2500) = 0.0091
That's a 95% confidence interval of roughly 0.018. Not considering the multi-way test aspects, roughly the top 13 systems or so are not "significantly" pairwise distinguishable at 95% confidence. Unfortunately, the best systems are significantly better than our off-the-shelf classifiers.
How Hard is WSD?
The most-frequent sense baseline here is quite high, at 55.2%. System scores in the 60-70% range are not doing much better than chance. We'd actually need a better evaluation setup in order to generate the κ statistic, but it'd certainly be in the not-so-relevant category.
Inter-annotator agreement on the tagging task is only 67%, which is about how well our default system performed (66%). It's not clear how the data was adjudicated to create the final training data given the low inter-annotator agreement (Mihalcea et al., section 2.4). Their computation of kappa is non-standard in that they set the expectation to be a uniform distribution rather than the data distribution. (See Carletta's CL squib or Passonneau's LREC paper on kappa variants).
Command-line Parameters
There are five command-line arguments, each with its own property in
the ant build file build.xml
. We
summarize these in the following table:
Command-line Arguments | |||
---|---|---|---|
Arg | Property | Default Value | Description |
0 | senseval.dir |
${data.dir}/senseval3/unpacked/english |
Directory containing all the data ($senseval-en ) |
1 | train.dictionary.file |
${senseval.dir}/EnglishLS.train |
Dictionary of word senses, synonyms and definitions. |
2 | train.file |
${senseval.dir}/EnglishLS.dictionary.xml |
Training texts with words in context. |
3 | test.file |
${senseval.dir}/EnglishLS.test |
File of test cases with words in context. |
3 | answer.file |
systemAnswer.txt |
Output file produced by system run. |
5 | classifier.id |
0 |
Number indicating which classifier to use. |
Code Walkthrough
The code for the entire Senseval entry is in one file,
src/Senseval3.java
.
We have done our best to separate the data parsing component
of the exercise from the actual classification of senses.
Main Method
Everything runs from a single main()
method.
public static void main(String[] args) throws ClassNotFoundException, IOException { File dictFile = new File(args[0]); File trainFile = new File(args[1]); File testFile = new File(args[2]); File responseFile = new File(args[3]); sClassifierNumber = Integer.parseInt(args[4]); SenseEvalDict dict = new SenseEvalDict(dictFile); TrainingData trainingData = new TrainingData(trainFile); TestData testData = new TestData(testFile); SenseEvalModel model = new SenseEvalModel(dict,trainingData); respond(model,testData,responseFile); }
The first few lines merely read in command-line parameters. The next lines read in all of the data for the evaluation. First, the dictionary is read from the dictionary file, then the training data is read from the training file. Finally, the test data is read from the test file. The formats of these files were discussed in the previous sections.
Senseval Data Structures
Rather than get into the ugly details of parsing these data formats, we instead present the static classes that are constructed from the data. These are designed to be easy to use as the basis of further exploration into word-sense disambiguation without mixing up the data format details and the classifier and training details.
Dictionary Format
The dictionary is simply mapping from strings consisting of
words plus categories (e.g. activate.v
) to
an array of senses.
static class SenseEvalDict extends HashMap<String,Sense[]>
The sense class contains all of the information in the Senseval dictionary:
static class Sense { String mId; String mSource; String mSynset; String mGloss; ...
Training Data Format
Like the dictionary, the text training data is represented as a map
from words plus categories (e.g. activate.v
)
to the training data for that word.
word
static class TrainingData extends HashMap<String,Map<String,List<String>>>
The training data for a word plus category is of type
Map<String,List<String>>
. This mapping
is from a sense ID (e.g. 38202
or argument%1:10:02::
,
the format varies by source) to the list of textual contexts for
that sense.
Test Data Format
The test data class consists of three parallel lists:
static class TestData { List<String> mWordsPlusCats = new ArrayList<String>(); List<String> mInstanceIds = new ArrayList<String>(); List<String> mTextsToClassify = new ArrayList<String>();
Each list has a length that is equal to the number of test
cases. The first list contains the words plus categories
of the test cases, the second contains the instance ID
for the test case (e.g. difference.n.bnc.00002450
),
and the third the actual textual contexts containing the
instance of the word to be classified.
Training
We use the dictionary and training data to build a complete Senseval model using the model's constructor:
static class SenseEvalModel extends HashMap<String,Classifier> { SenseEvalModel(SenseEvalDict dict, TrainingData trainingData) throws ClassNotFoundException, IOException { for (String wordPlusCat : trainingData.keySet()) { Map<String,List<String> senseToTextList = trainingData.get(wordPlusCat); String[] senseIds = senseToTextList.keySet().<String>toArray(new String[0]); ObjectHandler<Classified<CharSequence>> trainer = createClassifierTrainer(senseIds); for (String senseId : senseToTextList.keySet()) { Classification classificationForSenseId = new Classification(senseId); List<String> trainingTextList = senseToTextList.get(senseId); for (String trainingText : trainingTextList) { Classified<CharSequence> classified = new Classified<CharSequence>(trainingText,classificationForSenseId); trainer.handle(classified); } } BaseClassifier<CharSequence> classifier = (BaseClassifier<CharSequence>) AbstractExternalizable.compile((Compilable)trainer); put(wordPlusCat,classifier); } } }
Although this code looks dense, most of it's simply negotiating
the data structures previously introduced. The outer loop walks
over all of the words plus categories in the training data.
Then, it extracts their sense to text list and the array of sense ids
for it. these sense ids are used as the categories in
constructing a trainable classifier. We have assigned it to the type
of classification handler whose classified elements are character
sequences (CharSequence
) and the classification is a simple
first-best classification result (Classification
).
This just means it can take classified character sequences as training
data. We come back to how the classifiers are constructed in the next
section.
Next, for each sense for the word, we take a classification corresponding to that sense and use it to train the classifier trainer with each trianing text.
Finally, the classifier is compiled and added to the mapping
(using the superclass's put()
method).
Constructing the Classifier
The actual classifier construction is broken off into a separate method, because it allows so many options. It's basically a large integer switch statement, the bodies of the cases of which construct and return classifiers.
Character LM Classifiers
The first cases are for character language-model based classifiers. As the texts are long and the begins/ends not special, we have used standard process LM classifiers rather than boundary models.
static ObjectHandler<Classified<CharSequence>> createClassifierTrainer(String[] senseIds) { switch (sClassifierNumber) { case 0: // DEFAULT CHARACTER LM CLASSIFIER return DynamicLMClassifier.createNGramProcess(senseIds,5); ...
The first case is our default recommended classification model for English, a 5-gram character language model classifier with default parameters.
The second case is a configurable version of the process n-gram language model classifiers:
case 1: // CONFIGURABLE CHARACTER LM CLASSIFIER LanguageModel.Dynamic[] lms5 = new LanguageModel.Dynamic[senseIds.length]; for (int i = 0; i < lms5.length; ++i) lms5[i] = new NGramProcessLM(6, // n-gram 128, // num chars 1.0); // interpolation ratio return new DynamicLMClassifier<LanguageModel.Dynamic>(senseIds,lms5); }
This allows the n-gram process language models to be configured for n-gram (6 here), number of characters (set to 128 for ASCII input), and the interpolation ratio (set to 1.0, which is less smoothingd).
Token LM and Naive Bayes Classifiers
If the features are just token counts, naive Bayes classification is just a token unigram model. We have a special class for this case with its own constructor:
case 2: // DEFAULT NAIVE BAYES CLASSIFIER return new NaiveBayesClassifier(senseIds,SPACE_TOKENIZER_FACTORY);
In order to construct a token-based classifier, we need a tokenizer. Here we've used one we'll explain in the next section that simply breaks on spaces. This is fine for this context as the organizers already decided on token boundaries for us and inserted spaces between them.
Further note that LingPipe's naive Bayes classifier is configured to using boundary character n-gram smoothing for the tokens.
The default token unigram and bigram are too agressive at unseen words. Ideally, they should be trained with some kind of explicit add-one (Laplace prior) smoothing. Here's the default version:
case 3: // DEFAULT TOKEN UNIGRAM LM CLASSIFIER return DynamicLMClassifier.createTokenized(senseIds, SPACE_TOKENIZER_FACTORY, 1); case 4: // DEFAULT TOKEN BIGRAM LM CLASSIFIER return DynamicLMClassifier.createTokenized(senseIds, SPACE_TOKENIZER_FACTORY, 2);
The final tokenized language model classifier is fully configurable.
We set it up to use the default smoothing character language models,
the bounded n-gram character language models with default params for
n-gram length (5), the appropriate number of characters for ASCII
(128), and the default smoothing (5.0, the length of the n-gram),
and the default boundary character, the non-code-point character
'\uFFFF'
.
case 5: // CONFIGURABLE TOKENIZED LM CLASSIFIER W. CHARACTER BOUNDARY LM SMOOTHING LanguageModel.Dynamic[] lms2 = new LanguageModel.Dynamic[senseIds.length]; for (int i = 0; i < lms2.length; ++i) lms2[i] = new TokenizedLM(SPACE_TOKENIZER_FACTORY, 2, // n-gram length new NGramBoundaryLM(5,128,5.0,'\uFFFF'), new NGramBoundaryLM(5,128,5.0,'\uFFFF'), 1.0); // interpolation param return new DynamicLMClassifier<LanguageModel.Dynamic>(senseIds,lms2);
Note that a tokenizer factory is required. We have included a special tokenizer factory with this class, which breaks on spaces. We return to tokenization in the next section.
TF/IDF Classifier
The term frequency (TF) and inverse document frequency (IDF) classifier requires a tokenizer factory in its construction.
case 6: // TF-IDF CLASSIFIER FeatureExtractor<CharSequence> featureExtractor5 = new TokenFeatureExtractor(NORM_TOKENIZER_FACTORY); return new TfIdfClassifierTrainer<CharSequence>(featureExtractor5);
Note that this class is using the normalized tokenizer factory. This factory lower cases, stems and stoplists entries, as we describe in the next section.
Nearest Neighbors Classifier
Finally, we consider the K nearest neighbors (KNN) classifier. This requires several parameters. First, we require a feature extractor to convert input character sequences into tokens. We use a normalizing tokenizer factory here, which we describe more fully in the next section. The other explicit parameter is the number K of neighbors, here set to 20. For a KNN classifier, the 20 closest neighbors to an input are consulted and their results averaged.
case 7: // K-NEAREST NEIGHBORS DEFAULT CLASSIFIER FeatureExtractor<CharSequence> featureExtractor6 = new TokenFeatureExtractor(NORM_TOKENIZER_FACTORY); return new KnnClassifier<CharSequence>(featureExtractor6, 20); // num neighbors to average
KNN classification will have trouble with this data set because for some senses, the number of training instances is so small that it can never dominate with 20 neighbors. The problem with reducing the number of neighbors is that it greatly increases the variance of the classifier.
In general, KNN classifiers require a specified distance metric to find the nearest neighbors. The default is Euclidean distance, as used in th e previous classifier. We may also consider other distances, such as the cosine distance, which is popular for bag-of-words processing for natural language classification:
case 8: // K-NEAREST NEIGHBORS DEFAULT CLASSIFIER (COSINE DISTANCE) FeatureExtractor<CharSequence> featureExtractor6 = new TokenFeatureExtractor(NORM_TOKENIZER_FACTORY); return new KnnClassifier<CharSequence>(featureExtractor6, new CosineDistance(), 20); // num neighbors to average
Tokenizers
For the token-sensitive models, we investigate two tokenizers, each with its own factory.
Space-based Tokenization
We can tokenize on spaces using a simple regular-expression-based tokenizer factory, which is completely thread safe and serializable.
static final TokenizerFactory SPACE_TOKENIZER_FACTORY = new RegExTokenizerFactory("\\S+");
Tokens are defined as maximally long sequences of non-whitespace
characters (\S+
, with appropriately escaped backslash).
Also note that we have made the tokenizer factory serializable so that it may be compiled along with the models.
Normalizing Tokenization
Normalizing tokenization applies several filter tokenizers to the result of space tokenization:
static final TokenizerFactory NORM_TOKENIZER_FACTORY = normTokenizerFactory(); static TokenizerFactory normTokenizerFactory() { TokenizerFactory factory = SPACE_TOKENIZER_FACTORY; factory = new LowerCaseTokenizerFactory(factory); // factory = EnglishStopTokenizerFactory(factory); // factory = PorterStemmerTokenizerFactory(factory); return factory; }
This takes the result of the space-based tokenizer, lower cases all the tokens, removes English stop words, then applies the Porter stemmer to the output. We've commented out the stoplisting and the stemming, but these can be added back in to see the difference.
Character N-gram Tokenization
LingPipe implements a tokenizer based on taking n-grams of input characters. These are specified by a range of sizes, such as the following:
static final TokenizerFactory NGRAM_TOKENIZER_FACTORY = new NGramTokenizerFactory(3,4);
As specified, this tokenizer returns all length 3 and length 4 substrings of the input being tokenized.
Running the Test Cases
The code for running the test cases is as simple as usual. We just feed the text to a classifier to get a classification result.
static void respond(SenseEvalModel model, TestData testData, File file) ... for (int i = 0; i < testData.mWordsPlusCats.size(); ++i) { String wordPlusCat = testData.mWordsPlusCats.get(i); Classifier classifier = model.get(wordPlusCat); String instanceId = testData.mInstanceIds.get(i); String textToClassify = testData.mTextsToClassify.get(i); Classification classification = classifier.classify(textToClassify); if (classification instanceof ConditionalClassification) { ConditionalClassification condClassification = (ConditionalClassification) classification; for (int rank = 0; rank < condClassification.size(); ++rank) { int conditionalProb = (int) java.lang.Math.round(1000.0 * condClassification.conditionalProbability(rank)); if (rank > 0 &&conditionalProb < 1) break; String category = condClassification.category(rank); ... } } else { String category = classification.bestCategory(); ... } ...
The code is more complex, as we are given two data structures,
the classifiers in the form of a SenseEvalModel
and the test data in the from of an instance of TestData
.
The file is where the output is written, but we have ellided the
formatting information here (ellipses [...
] in text).
The code first has to iterate over the test categories, which are indexed by the word plus category for the task. Then we just retrieve the classifier from the model. We then extract the text to classify from the test data. Next, we just run the classifier on the text being classified.
N-best Output
Finally, if the result is a conditional classification, we return all n-best results (quantized to a 0-1000 scale). This is allowed by Senseval scoring, though given their arithmetic mean scoring (probabilistic log loss scoring involves the log of the geometric average), it doesn't help much. (Without this, the score of our default model is 0.659 rather than 0.660). Given that most results of the conditional classifiers are highly skewed due to lack of intra-text dependency modeling, we don't have highly accurate confidence estimates.
Confidence Thresholded Output
The Senseval evaluation allows a system to not provide an answer. No answers count against recall performance, but not against precision. Several of the classifiers provide scores that can be used for such thresholding. For instance, we can require at least n votes out of m in our KNN implementations, or we could require cross-entropy above a given threshold in our language model classifiers.
Results
Here are the resulting scores from the scorer for the various models. Because they all try every example, precision and recall are equal, and we report them as accuracy. Note that these are the fine-grained sense evaluations. For some classifiers, we've reported results for multiple parameterizations without defining new cases (that is, we just edited the source and re-ran).
WSD Results | |||
---|---|---|---|
Accuracy (fine) | Time | ID | Classifier |
0.660 | 45s | 0 | default character n-gram (5-gram, 16K chars, interp=5.0) |
0.660 | 82s | 1 | character process 6-gram, 128 chars, interp=1.0 |
0.629 | 9s | 2 | Naive Bayes (space tokenizer) |
0.638 | 12s | 2b | Naive Bayes (norm tokenizer) |
0.629 | 9s | 3 | Default Token Unigram (space tokenizer) |
0.638 | 11s | 3b | Default Token Unigram (norm tokenizer) |
0.631 | 11s | 4 | Default Token bigram (space tokenizer) |
0.638 | 13s | 4b | Default Token bigram (norm tokenizer) |
0.653 | 24s | 5 | Configurable token n-gram (norm tokenizer, bigram, n-gram boundary lm 4/128/1.0, 0.25 interp) |
0.654 | 26s | 5b | Configurable token n-gram (norm tokenizer, trigram, n-gram boundary lm 4/128/1.0, 0.1 interp) |
0.650 | 5s | 6 | TF/IDF classifier (space tokenizer) |
0.651 | 8s | 6b | TF/IDF classifier (norm tokenizer) |
0.630 | 11s | 6c | TF/IDF classifier (character 3-gram tokenizer) |
0.660 | 22s | 6d | TF/IDF classifier (character 4-gram tokenizer) |
0.666 | 36s | 6e | TF/IDF classifier (character 5-gram tokenizer) |
0.662 | 54s | 6f | TF/IDF classifier (character 6-gram tokenizer) |
0.668 | 134s | 6g | TF/IDF classifier (character 4-, 5-, and 6-gram tokenizer) |
0.587 | 8s | 7 | KNN classifier (space tokenizer, Euclidean distance, k=20) |
0.565 | 10s | 7b | KNN classifier (norm tokenizer, Euclidean distance, k=20) |
0.521 | 10s | 7c | KNN classifier (space tokenizer, Euclidean distance, k=1) |
0.504 | 10s | 7c | KNN classifier (space tokenizer, Euclidean distance, k=2) |
0.553 | 10s | 7d | KNN classifier (space tokenizer, Euclidean distance, k=4) |
0.580 | 10s | 7e | KNN classifier (space tokenizer, Euclidean distance, k=8) |
0.581 | 10s | 7f | KNN classifier (space tokenizer, Euclidean distance, k=16) |
0.585 | 10s | 7g | KNN classifier (space tokenizer, Euclidean distance, k=32) |
0.569 | 10s | 7g | KNN classifier (space tokenizer, Euclidean distance, k=64) |
0.587 | 8s | 8 | KNN classifier (space tokenizer, cosine distance, k=20 |
0.581 | 9s | 8b | KNN classifier (norm tokenizer, cosine distance, k=20 |
0.598 | 41s | 8c | KNN classifier (character 5-gram tokenizer, cosine distance, k=20 |
0.611 | 41s | 8d | KNN classifier (character 5-gram tokenizer, cosine distance, k=10 |
0.618 | 43s | 8d | KNN classifier (character 5-gram tokenizer, cosine distance, k=10, distance weighted) |
Overall, the best accuracy (66.8%) is achieved with a TF/IDF classifier over character 4,5,6-grams. Unfortunately, this model also required 2 gigabytes of memory (on a 64-bit Java) to hold the entire set of training models in memory. The more modest requirements for speed and time of the simple 5-gram character tokenizer make it more attactive. The TF/IDF classifier with the simple space tokenizer is the fastest, and only 1.8% less accurate than the best classifier, although it's more than 20 times as fast and uses much less memory.
In other text classification experiments, we've found the character language model classifiers to work better than any choice of token n-gram for TF/IDF.
We have also reported rough timings reported by Ant. The test machine is a dual AMD Opteron 1.8GHz, 512MB RAM, PC2700 ECC Memory, JDK 1.6, -server mode, WinXP x64 SP2).
Improving WSD Performance
The systems in the Senseval 3 bakeoff that performed better than our approach imported much richer feature sets. They used everything from part-of-speech taggers to full syntactic parsers, dictionaries and thesauri (such as WordNet), and large corpora over which they trained using semi-supervised methods. On top of this, there was a lot of work on feature selection and weighting.
The better-scoring systems also tended to use discriminitive classification methods, such as SVMs, perceptrons or various forms of linear regression.
Unsupervised Word Sense Disambiguation
Unsupervised WSD is essentially a clustering problem. From the performance of KNN classifiers, it'd seem that a character 5-gram tokenizer and cosine distance are the best bets for basic textual features.
Then it is simply a matter of taking the input texts and running them through clustering. Rather than doing that here, we refer you to the clustering tutorial, where we work through an example of proper name clustering:
- John Smith Clustering (in the clustering tutorial)
References
The best place to look is the proceedings of the Senseval workshops:
Here's the overview paper for Senseval 3 and a link to the table of contents (which has links to the PDFs).
- Rada Mihalcea, Timothy Chklovsky, and Adam Kilgarriff. 2004. The Senseval-3 English lexical sample task. In Senseval-3 Workshop.
- Senseval 3 Proceedings (with hyper links).
Both of the textbooks in the field have whole chapters devoted to word sense disambiguation:
- Chris Manning and Hinrich Schuetze. 1999. Foundations of Statistical Natural Language Processing. MIT Press.
- Dan Jurafsky and James Martin. 2000. Speech and Language Processing. Prentice-Hall.
Here's a link to a powerpoint presentation of a tutorial on WSD:
- Ted Pedersen and Rada Mihalcea. 2005. Advances in Word Sense Disambiguation [.ppt] ACL Tutorial.