Why are Chinese Words Hard?
Unlike Western languages, Chinese is written without spaces between words. Thus to run any word- or token-based linguistic processing on Chinese, it is first necessary to determine word boundaries. This tutorial shows how to segment Chinese into words based on LingPipe's spelling corrector.
How's it Done?
The basic idea is to treat the lack of space between tokens as spelling "mistakes" which the spelling corrector will "correct" with the insertion of spaces.
Who Thought of Doing it This Way?
This is just another way of looking at the compression-based approach of Bill Teahan et al.'s. See the references for more details.
1. Downloading Training Corpora
Luckily for us, there are three publicly available training corpora for Chinese segmentation made available as part of the First International Chinese Word Segmentation Bakeoff. The Second Bakeoff was held in 2005, but the training data is not publicly available. All further discussion will be of the first bakeoff. These bakeoffs are sponsored by SigHan, the Chinese special interest group (SIG) of the Association for Computational Linguistics (ACL).
Step one for the tutorial is to download the test and training data from the six links below (after noting that the data are made available for research purposes only as stated on this page):
First International Chinese Word Segmentation Bakeoff Data Links & Content | |||||
---|---|---|---|---|---|
Corpus Creator | Training | Testing | Encoding | # Train Words | # Test Words |
Academia Sinica (AS) | Training Data (11.8M) (mirror) | Testing Data (60K) (mirror) | CP950 |
5.8M | 12K |
HK CityU (HK) | Training Data (500K) (mirror) | Testing Data (150K) (mirror) | CP936 |
240K | 35K |
Peking University (PK) | Training Data (2.3M) [no longer live] | Testing Data (90K) [no longer live] | Big5_HKSCS |
1.1M | 17K |
Place all six of these files (without unzipping the
.zip
files) into a directory. We'll call the directory
containing the data dataDir
after the Ant property we
will use to specify it.
2. Running the Evaluations
Once the code is compiled, there are three ant tasks which can be used to run the evaluations. Running these scripts produces standard output as well as a file of official evaluation results.
To run the Hong Kong City University training sets, first cd to the demo directory:
cd lingpipe/demos/tutorial/chineseTokens
Then you can either run the evaluation from Ant by specifying the location of the data directory on the command line
ant -DdataDir=dataDir run-cityu
or directly via the following command (with the name you chose for
your data directory substituted for dataDir
, and replacing the colons ":" with semicolons ";" if you are using Windows):
java -cp "../../../lingpipe-4.1.2.jar;zhToksDemo.jar" ChineseTokens dataDir cityu hk cityu.out Big5_HKSCS 5 5.0 5000 256 0.0 0.0
For example, on my machine, we downloaded the six files to
e:\data\chineseWordSegBakeoff03
, so we can run as follows
(please be patient during compilation -- it takes eight minutes or so on my
desktop):
> java -cp "../../../lingpipe-4.1.2.jar;zhToksDemo.jar" ChineseTokens e:\data\chineseWordSegBakeoff03 cityu hk cityu.out Big5_HKSCS 5 5.0 5000 256 0.0 0.0 CHINESE TOKENS DEMO Data Directory=e:\data\chineseWordSegBakeoff03 Train Corpus Name=cityu Test Corpus Name=hk Output File Name=e:\data\chineseWordSegBakeoff03\cityu.out.segments Known Tokens File Name=e:\data\chineseWordSegBakeoff03\cityu.out.knownWords Char Encoding=Big5_HKSCS Max N-gram=5 Lambda factor=5.0 Num chars=5000 Max n-best=256 Continue weight=0.0 Break weight=0.0 Training Zip File=e:\data\chineseWordSegBakeoff03\cityu_training.zip Compiling Spell Checker Testing Results. File=e:\data\chineseWordSegBakeoff03\hk-testref.txt # Training Toks=23747 # Unknown Test Toks=1855 # Training Chars=3649 # Unknown Test Chars=89 Token Length, #REF, #RESP, Diff 1, 16867, 17267, 400 2, 15058, 14740, -318 3, 2126, 2072, -54 4, 703, 721, 18 5, 82, 112, 30 6, 71, 85, 14 7, 19, 29, 10 8, 12, 15, 3 9, 5, 5, 0 Scores EndPoint: P=0.9748424085113665 R=0.9777148415150475 F=0.9762765121759623 Chunk: P=0.935963260881967 R=0.9387212129881276 F=0.9373402082470399
Reading the Output
Hopefully this output is fairly easy to interpret. The first few lines just parrot back the input parameters. We will describe these as we go through the code in the demo. Then there's a note to say that the training is being done using the specified zip file. Training the language model is fairly quick. There's a bit of a wait after the message that says the spell checker is being compiled. That's because highly branching character language models like those for Chinese are slow to compile in LingPipe (this may be optimized in a later version -- the slowness derives from repeatedly summing the counts of the daughter of a node). Then there's a note to say testing is going on and echoing the test file.
Descriptive Token Statistics
The next two lines provide a report on the number of training tokens and characters, along with the number of unknown test tokens and test characters. A token is said to be "unknown" if it appears in the test data without appearing in the training data. There were 89 unknown characters and 1855 unknown tokens in the Hong Kong City University test data.
A file is also populated with the known tokens, one per line. These
are put in the file indicated in the output, which goes in the data
directory with a suffix .knownWords
.
The next few lines provide histograms of token length in the reference (training data) and response (system output), as well as the difference. For instance, the training data contained 15,058 tokens of length two, whereas the output produced only 14,740 tokens, a difference of -318. Our system is producing too many outputs of length 1 and too few outputs of length 2 and 3, and then too many outputs again of lengths longer than 3.
Precision and Recall Results
In addition to all of these descriptive statistics, two sets of precision, recall and f-measure scores are presented for the run. The first of these measures precision and recall of endpoints. The second measures the precision and recall of the words themselves. (This is the same pair of evaluations as we used in the sentence demo. Our chunk scores are computed the same way as the official scoring script from the bakeoff, for which the top scoring system on this corpus had scores of P=0.934, R=0.947, F=0.940 (vs. our P=0.936, R=0.939, F=0.937). Interestingly, computing binomial confidence intervals for these results yields a 95% confidence interval of +/-0.003). Thus we conclude that our approach is a reasonable one (though we also knew about Bill Teahan's paper cited in the references below).
Official Scoring Script
The run also produces an output file cityu.out
, as
specified on the command line. This file acts as official output;
it's what would be sent back to the organizers if we were in time
to actually enter the bakeoff.
We've included the original scoring script with this distribution.
It can be run on the output relative to a dictionary of known and
unknown words. To run it, the following invocation works, assuming
you have the Perl scripting language installed along with the command
diff
(these are typically installed with Linux
distributions; we'd recommend the CygWin distribution of unix tools for
MS Windows users).
With Perl installed, it's easy to run the official script. It's just:
perl bin\score.pl knownWords responseSegments testFile
which for our output named cityu.out
and data directory
e:\data\chineseWordSegBakeoff03
yields the command:
perl bin\score.pl e:\data\chineseWordSegBakeoff03\cityu.out.knownWords e:\data\chineseWordSegBakeoff03\hk-testref.txt e:\data\chineseWordSegBakeoff03\cityu.out.segments
This prints an analysis per test sentence with the actual diff of the response and reference segments. speak Chinese, so it's all Greek to us. Looking at the tail of the file shows us this:
SUMMARY: TOTAL INSERTIONS: 738 TOTAL DELETIONS: 635 TOTAL SUBSTITUTIONS: 1507 TOTAL NCHANGE: 2880 TOTAL TRUE WORD COUNT: 34955 TOTAL TEST WORD COUNT: 35058 TOTAL TRUE WORDS RECALL: 0.939 TOTAL TEST WORDS PRECISION: 0.936 F MEASURE: 0.937 OOV Rate: 0.071 OOV Recall Rate: 0.542 IV Recall Rate: 0.969
In particular, note that the recall and precision figures reportied here matches our own chunk-level precision, recall, and f measures, namely P=0.930,R=0.939 and F=0.937. The script further goes on to calculate performance on out-of-vocabulary words; the out of vocabulary rate is 7 percent (same as what we calculated), and the performance on out-of-vocabulary tokens is only 54.2%. The top-scoring system for the bakeoff had an out-of-vocabularly recall of 62.5%.
Running Other Corpora
The other corpora can be run in exactly the same way. All that is necessary to change are the names of the corpora, the name of the character encoding, and the output files in the command. Note that the other corpora are larger and take more time to process. Here are the results of running these corpora with zero edit costs, a large enough n-best not to make search errors, and length 5 n-grams (the best performing n-gram size in the evaluation run by Bill Teahan; see the references). In other words, these are completely "out of the box" settings. We'll discuss tuning later.
Chunk-Level Scoring | |||||||
---|---|---|---|---|---|---|---|
Corpus | Default LingPipe Results | Winning Closed Bakeoff Result | |||||
Prec | Rec | F | Prec | Rec | F | Winning Site | |
HK City Uni | 0.936 | 0.937 | 0.937 | 0.934 | 0.947 | 0.940 | Ac Sinica |
Beijing U | 0.930 | 0.926 | 0.928 | 0.940 | 0.962 | 0.951 | Inst. of Comp. Tech, CAS |
Academia Sinica | 0.960 | 0.969 | 0.964 | 0.966 | 0.956 | 0.961 | UC Berkeley |
The results in bold are the best scoring for the respective category. The results have a 95 percent confidence interval of roughly +/-0.003 (differing slightly by performance and amount of training data as described in the Sproat and Emerson paper cited below). For two of the three corpora, Hong Kong City University's and Academia Sinica's, LingPipe's F-score was not significantly different than that of the winner of the bakeoff's.
Their official bakeoff also had an "open" category that allowed external resources to be used for training. There was not an open system submitted for the Academia Sinica corpus that performed better than the closed submissions. The best open-system f-measures for the HK corpus was 0.956, and the best for the PK corpus was 0.959, both significantly better than the closed entries.
The Academia Sinica is the largest corpus at 5.1M training data, and the results on that corpus are similar to what is reported in Bill Teahan's paper (cited in the references) for the proprietary RocLing corpus. Our conclusion is that LingPipe's out-of-the-box performance is state of the art for pure learning based systems.
3. Inspecting The Code
The code for the demo is contained in a single file: src/ChineseTokens.java
.
Main and Run
The main program simply creates a new instance from the arguments and calls its run method:
public static void main(String[] args) { try { new ChineseTokens(args).run(); } catch (Throwable t) { System.out.println("EXCEPTION IN RUN:"); t.printStackTrace(System.out); } }
Throwables are caught and their stack traces dumped for debugging.
Rather than using a more complex command-line framework, such
as LingPipe's util.AbstractCommand
, we just pass all
the arguments to the constructor for parsing which just sets
a bunch of member variables of the appropriate type:
public ChineseTokens(String[] args) { mDataDir = new File(args[0]); mTrainingCorpusName = args[1]; mTestCorpusName = args[2]; mOutputFile = new File(mDataDir,args[3]+".segments"); mKnownToksFile = new File(mDataDir,args[3]+".knownWords"); mCharEncoding = args[4]; mMaxNGram = Integer.parseInt(args[5]); mLambdaFactor = Double.parseDouble(args[6]); mNumChars = Integer.parseInt(args[7]); mMaxNBest = Integer.parseInt(args[8]); mContinueWeight = Double.parseDouble(args[9]); mBreakWeight = Double.parseDouble(args[10]); }
The run method just calls the three worker methods in order:
void run() throws ClassNotFoundException, IOException { compileSpellChecker(); testSpellChecker(); printResults(); }
Training and Compiling
The first worker method encapsulates the training and compilation of a spell checker.
Constructing a Trainer
In order to train and compile the spelling checker, we first construct a training instance out of an n-gram process language model and a weighted edit distance:
void compileSpellChecker() throws IOException, ClassNotFoundException { NGramProcessLM lm = new NGramProcessLM(mMaxNGram,mNumChars,mLambdaFactor); WeightedEditDistance distance = new ChineseTokenizing(mContinueWeight,mBreakWeight); TrainSpellChecker trainer = new TrainSpellChecker(lm, distance,null); ...
The n-gram process language model represents the source model for the noisy-channel spelling decoder. It is parameterized by the n-gram size, the number of characters in the underlying training and test set, and an interpolation factor. These are all described in the Language Modeling Tutorial. Each of them may be used to tune performance as indicated below.
The spell checking trainer is constructed from the language model
and a weighted edit distance. In this case, the edit distance is an
instance of the inner class
ChineseTokens.ChineseTokenizing
. This is just a
generalization of the LingPipe constant
CompiledSpellChecker.TOKENIZING
that allows for non-zero
insert and delete weights. Until we consider tunining in the last
section, we will use an instance of ChineseTokenizing
that is identical to CompiledSpellChecker.TOKENIZING
.
That is, the cost of matching is zero, the cost of inserting a single
space character is zero, and all other edit costs are negative
infinity. In the generalized edit distance, the weights for matching
(continuing a token) and inserting a space (ending a token) may be
non-zero negative numbers.
The final argument to the TrainSpellChecker
constructor is null
, meaning that the edits are not going
to be restricted to producing tokens in the training data.
Providing Training Instances
The training process itself is just a matter of looping through the lines of the entries in the zip file:
FileInputStream fileIn = new FileInputStream(trainingFile); ZipInputStream zipIn = new ZipInputStream(fileIn); ZipEntry entry = null; while ((entry = zipIn.getNextEntry()) != null) { String[] lines = extractLines(zipIn,mTrainingCharSet,mTrainingTokenSet); for (int i = 0; i < lines.length; ++i) trainer.handle(lines[i]); } Streams.closeInputStream(zipIn);
The extractLines(InputStream,Set,Set)
takes the input
stream from which to read the lines and two sets. The sets are used
to accumulate the characters and tokens found in the training sets
(and later in the test sets). The extractor is also responsible for
normalizing the whitespace to single space characters between tokens
and a single line-final space character:
while ((refLine = bufReader.readLine()) != null) { String trimmedLine = refLine.trim() + " "; String normalizedLine = trimmedLine.replaceAll("\\s+"," ");
The point is to get the normalized lines to the trainer while accumulating some statistics.
Compiling and Configuring the Spell Checker
After the trainer has been trained on all the lines, the spell
checker is compiled in-memory in one line using the
compile(Compilable)
method in
util.AbstractExternalizable
:
mSpellChecker = (CompiledSpellChecker) AbstractExternalizable.compile(trainer);
The spell checker is tuned by the following series of set method calls:
mSpellChecker.setAllowInsert(true); mSpellChecker.setAllowMatch(true); mSpellChecker.setAllowDelete(false); mSpellChecker.setAllowSubstitute(false); mSpellChecker.setAllowTranspose(false); mSpellChecker.setNumConsecutiveInsertionsAllowed(1); mSpellChecker.setNBest(mMaxNBest);
This tells the spell checker that only insert and match edits are allowed, thus saving it the time of inspecting other edits. The second-to-last method call limits the number of consecutive insertions to one; this is because we only care about single-character inserts of spaces. The last method call establishes the maximum number of hypotheses carried over after finishing processing of a character. Higher values cause less search errors whereas lower values are faster. This value would typically be tuned by empirically tuning it to be as low as possible without causing search errors.
Compiling to and Reading from a File
If memory is at a premium or if the model is going to be reused, it may be written to a file rather than compiled in memory. To write a model to a file, it must be wrapped in an object output stream:
File compiledModelFile = ...; OutputStream out = new FileOutputStream(compiledModelFile) DataOutput dataOut = new DataOutputStream(out); trainer.compileTo(dataOut);
The model may then be read back in by reversing the process:
InputStream in = new FileInputStream(compiledModelFile); ObjectInput objIn = new ObjectInputStream(in); mSpellChecker = (CompiledSpellChecker) objIn.readObject();
After it is read back in, it can have its runtime parameters set as illustrated above.
Tokenizing
The single execution of the main in ChineseTokens
runs a
performance evaluation after training the models. The original SigHan
bakeoff data is divided into a zip file of training data files and a
single test file in the same format. The lines are extracted from the
test file in the same way as the training files and then handed off
one-by-one to the method test(String)
. The test method
starts as follows:
void test(String reference) throws IOException { String testInput = reference.replaceAll(" ",""); String response = mSpellChecker.didYouMean(testInput); response += ' '; ...
This simply removes all the spaces from the testinput using the java
string method replaceAll
. It is then supplied to the
spell checker and the first-best "correction" is returned
and set into a variable. A final space is appended to match
the input format and make evaluating simpler.
Evaluation
The following code is a repetition of the first three lines of the
test(String)
method:
String testInput = reference.replaceAll(" ",""); String response = mSpellChecker.didYouMean(testInput); response += ' ';
Bakeoff Output
The next two lines simply write output in the "official" output format.
mOutputWriter.write(response); mOutputWriter.write("\n");
This is the format that will serve as input to the official scoring script. Note that the output writer was allocated to use the same character encoding as the corpus, a requirement of the bakeoff format.
Break Point Evaluation
The first evaluation in the demo is of break points.
Set<Integer> refSpaces = getSpaces(reference); Set<Integer> responseSpaces = getSpaces(response); prEval("Break Points",refSpaces,responseSpaces,mBreakEval);
These three lines just get a set of Integer
indices
of token-final characters in the original input or output. For
example:
getSpaces("XXX X XXXX XX") = { 2, 3, 7, 9 } getSpaces("XXXXXX XX XX") = { 5, 7, 9}
The call to the prEval
method in the third line
adds the number of true positives, false positives and
false negatives to the break evaluation. Here's the method:
void prEval(String evalName, Set<Integer> refSet, Set<Integer> responseSet, PrecisionRecallEvaluation eval) { for (E e : refSet) eval.addCase(true,responseSet.contains(e)); for (E e : responseSet) if (!refSet.contains(e)) eval.addCase(false,true); }
This first loops over the reference cases, testing whether or not the
case is in the response set. It either calls
eval.addCase(true,true)
, adding a true positive case
appearing in the reference and response, or it calls
eval.addCase(true,false)
, adding a false negative case
appearing in the reference but not the response. The last loop is
through the response set, and it adds a case
eval.addCase(false,true)
for a false positive for a
case that is in the response set but not in the reference set.
At the end of the run, the precision-recall evaluation object can be queried for the precision, recall and f-measure (among other statistics):
System.out.println(" EndPoint:" + " P=" + mBreakEval.precision() + " R=" + mBreakEval.recall() + " F=" + mBreakEval.fMeasure());
This evaluation result tends to be much higher than the chunk evaluation. The reason for this is that chunks that mismatch can lead to multiple false positives and false negatives.
Chunk Evaluation
The evaluation of chunking proceeds in the same way:
Set<Tuple<Integer>> refChunks = getChunks(reference,mReferenceLengthHistogram); Set<Tuple<Integer>> responseChunks = getChunks(response,mResponseLengthHistogram); prEval("Chunks",refChunks,responseChunks,mChunkEval);
The method to extract the chunks is a little trickier because it also computes the histogram of token lengths for the reference and response, as seen above in the method calls:
static Set<Tuple<Integer>> getChunks(String xs, ObjectToCounter<Integer> lengthCounter) { Set<Tuple<Integer>> chunkSet = new HashSet<Tuple<Integer>>(); String[] chunks = xs.split(" "); int index = 0; for (int i = 0; i < chunks.length; ++i) { int len = chunks[i].length(); Object chunk = Tuple.create(new Integer(index), new Integer(index+len)); chunkSet.add(chunk); index += len; lengthCounter.increment(new Integer(len)); } return chunkSet; }
Here we just split the original input on single spaces, and then add tuples to the return set consisting of tuples (ordered pairs of objects) with values given by the start and end indices of the chunk. For instance:
ref = "XXX X XXXX XX" resp = "XXXXXX XX XX" getChunks(ref) = { (0,2), (2,3), (3,7), (7,9) } getChunks(resp) = { (0,5), (5,7), (7,9)}
In this case, there is one true positive, (7,9)
,
three false negatives, (0,2)
, (2,3)
,
and (3,7)
, and
two false positives, (0,5)
and (5,7)
.
Note that the index variable keeps the index into the original character sequence without spaces.
Token Length Histogram
Finally note the increment of the length counter, which provides the final histogram output of token lengths. This is used in the final print out to print the token length histograms using the following code:
System.out.println("Token Length, #REF, #RESP, Diff"); for (int i = 1; i < 10; ++i) { Integer iObj = new Integer(i); int refCount = mReferenceLengthHistogram.getCount(iObj); int respCount = mResponseLengthHistogram.getCount(iObj); int diff = respCount-refCount; System.out.println(" " + i + ", " + refCount + ", " + respCount + ", " + diff); }
This prints the reference coutns, response counts, and the error in terms of a difference.
A Statistical Tokenizer Factory
The demo up to this point has just been concerned with an in-memory evaluation. The file src/StatisticalTokenizerFactory.java contains a simple implementation of a tokenizer factory based on a compiled spell checker. The implementation is simple, but not very efficient, because of its reliance on the regular-expression based tokenizer factory. The code is only a few lines:
public class StatisticalTokenizerFactory extends RegexTokenizerFactory { private final CompiledSpellChecker mSpellChecker; public StatisticalTokenizerFactory(CompiledSpellChecker spellChecker) { super("\\s+"); // break on spaces mSpellChecker = spellChecker; } public Tokenizer tokenizer(char[] cs, int start, int length) { String input = new String(cs,start,length); String output = mSpellChecker.didYouMean(input); char[] csOut = output.toCharArray(); return super.tokenizer(csOut,0,csOut.length); } }
It holds a compiled spell checker in a member variable that's assigned
in the constructor. The class extends
RegexTokenizerFactory
, and the call
super("\\s+")
in the constructor tells the
parent to construct tokens by breaking on non-empty sequences of
whitespaces. The actual tokenizer just converts the input to a
string, runs the spell checker on it, converts the output to a
character array, and returns the result of the parent tokenizer
factory. This result is a tokenizer that separates on the spaces
inserted by the spell checker as a part of the output.
The character offsets in the tokenizer will refer to positions in the output variable; this could be changed by a tighter implementation of a statistical tokenizer factory that also avoided regular expressions by breaking directly on whitespaces. The output is guaranteed to have only single spaces in the output.
A word of warning is in order about using this tokenizer for tasks like information retrieval. Because it relies on statistical context, the same sequence of characters might not always be tokenized the same way. This can have dire consequences in tasks such as information retrieval if a query and corpus have different tokenizations.
Tuning Statistical Tokenizers
There are a number of performance tuning options that control both speed and accuracy.
N-best Size
The most important speed tuning factor is the size of the n-best list. This should be tuned to where it is as small as possible without causing too many search errors.
Pruning Language Models
With large training data sets, the models get very large. The character language models underlying the spell checker may be pruned just as other language models are.
Language Model n-gram
The most significant tuning parameter that affects both accuracy and performance is the size of the n-grams stored in the source language model. Five seems to be a good setting for this parameter. Longer n-grams are not more accurate, shorter ones are less accurate. Shorter n-grams will result in smaller model files, which can seriously affect run-time memory consumption.
Language Model Interpolation
The interpolation paramemter in the language model affects the degree to which longer contexts are weighted against shorter contexts during language model interpolation. This number is just a parameter in the Witten-Bell smoothing formula that also considers the number of possible outcomes and the number of instances seen. In general, the lower this value, the less smoothing. With less smoothing, the training corpus dominates the statistics. With a higher value there is more smoothing, and more weight is given to possibilities that were not seen in the training data.
Edit Weights
It is most tempting to try to tune edit weights. By making space
insertion more costly than 0.0, we can force breaks to be relatively
more expensive than continuing (matching) and thus favor longer
tokens. Similarly, by making matching more costly than 0.0, breaks
are relatively less expensive than continuing, and thus we would favor
shorter tokens. These are fairly easy to implement by following the
pattern provided by
spell.CompiledSpellChecker.TOKENIZING
.
As an example, we have added a general such implementation as an
embedded class called ChineseTokenizing
in the demo. The
demo is configured so that the insert and match weights may be
configured with the last two command-line arguments.
Unfortunately, our token length errors tend to overestimate one-character tokens, underestimate length two- and three-character tokens, and then overestimate tokens longer than three characters. A less naive length model might help here, but such a model is tricky to integrate with the decoder as is.
Another issue arguing against modifying the edit weights significantly is the endpoint precision and recall, which are roughly balanced. By increasing the insert (break) cost, end point recall would go down, even if precision increased. Similarly, by increasing the match cost (continue), the end point precision is likely to increase at the cost of recall.
Dictionary Training
Given a dictionary of tokens, they may be added (followed by a single
space) as training data just like the training data from the corpus,
by using the method handle(String)
. The normalization here
should be the same as that for the other lines, reducing all spaces
to single spaces and ensuring there is no initial space and a single
final space.
E-mail us with Better Settings
If you find settings that work better than ours, please let us know at lingpipe@alias-i.com.
SigHan 2005 Bakeoff
A week after writing the 2003 SigHan demo, the
was held. The organizers again distributed the data for research purposes after the bakeoff. This section describes running LingPipe on that data.
Segmentation Standards
The segmentation standards for the four groups are linked from the following table.
Corpus Creator | Word Segmentation Standards |
---|---|
Academia Sinica | Segmentation Standard (pdf) |
City University Hong Kong | Segmentation Standard (pdf) |
Peking University | Segmentation Standard (pdf) |
Microsoft Research | Segmentation Standard (pdf) |
Downloading Data
The data's available as a single .zip
file:
icwb2-data.zip [50MB]
This time, the organizers transcoded UTF8 versions of the input files. Our code runs straight off the zip, so you don't even need to unpack it.
The zip file contains the following corpora:
2005 SigHan Bakeoff Data Zip File | ||||||
---|---|---|---|---|---|---|
Creator | Train | Test | ||||
Sentences | Uniq Words | Uniq Chars | Sentences | Uniq Unknown Words | Uniq Unknown Chars | |
Academia Sinica | 708,953 | 141,338 | 6115 | 14,432 | 3227 | 85 |
Microsoft | 86,924 | 88,119 | 5167 | 3985 | 1991 | 12 |
HK City Uni | 54,019 | 69,085 | 4923 | 1493 | 1670 | 60 |
Peking Uni | 19,056 | 55,303 | 4698 | 1945 | 2863 | 91 |
Source Code
The source code to run the 2005 examples is in:
It only differs from the earlier code in the way it constructs input streams from which to read the training and test data.
Running the Tests
There's an Ant task for each corpora in the task. They're
distinguished from the others by the suffix 05
.
The Results
The following table presents the LingPipe results as achieved by training character 5-grams in LingPipe. These results would've put us in the "closed" category for the competition, meaning that the only linguistic information used to build the system was the training data (e.g. no dictionaries, no heuristic morphology, no POS taggers trained on other corpora).
2005 SigHan Bakeoff Chunk-Level Scoring | ||||||||
---|---|---|---|---|---|---|---|---|
Corpus | Default LingPipe Results | Winning Bakeoff Result | ||||||
Prec | Rec | F | Prec | Rec | F | Closed | Winning Site | |
Academia Sinica | 0.956 | 0.979 | 0.968 | 0.951 | 0.952 | 0.952 | Yes | Nara Inst |
0.950 | 0.962 | 0.956 | No | Nat Uni Singapore | ||||
Microsoft Res | 0.962 | 0.967 | 0.965 | 0.966 | 0.962 | 0.964 | Yes | Stanford |
0.965 | 0.980 | 0.972 | No | Harbin Inst | ||||
HK City Uni | 0.927 | 0.928 | 0.928 | 0.946 | 0.941 | 0.943 | Yes | Stanford |
0.956 | 0.967 | 0.962 | No | Nat Uni Singapore | ||||
Peking Uni | 0.935 | 0.925 | 0.930 | 0.946 | 0.953 | 0.950 | Yes | Yahoo |
0.969 | 0.968 | 0.969 | No | Nat Uni Singapore |
These results show a substantial amount of variation across corpora. Because most systems were applied to most corpora, this also represents a very diverse range of "best" approaches. With more training data, the statistical confidence intervals are much smaller, especiall
This was a nice bakeoff in that many of the cooks can add to their trophy cases. The best overall system was Wei Jang's closed entry for Harbin Institute on the Microsoft corpus, with an F-measure of 0.972 (and also represents a large error reduction over Jang's own closed submission for that corpus). Hwee Tou Ng from the National University of Singapore swept the closed category for all three other corpora. Huihsin Tseng, a U. Colorado student, made an excellent showing as well, taking two of the closed categories while playing for his advisor's team (Stanford).
LingPipe would've placed first in the closed category for two of
the corpora: Academia Sinica and Microsoft Research. Perhaps not
coincidentally, these are the two largest corpora. Surprisingly,
LingPipe's closed results for the AS corpus are better than the best
open results submitted to the bakeoff. I wonder if some of the other
systems may have been confused by the mixture of unicode half-width
spaces (0x3000
) and regular ASCII single spaces
(0x0020
) in the AS corpus? It required us to generalize
our inter-token whitespace regular expression to
"(\\s|\u3000)+"
.
Official Results
The official results page is:
References
- Teahan, William J., Yingying Wen, Rodger McNab and Ian
H. Witten. 2000. A compression-based
algorithm for Chinese word segmentation. Computational
Linguistics 26(3):375-393.
Bill Teahan's paper motivated our approach, which is equivalent mathematically. Unfortunately, their data isn't avaialble for a direct comparison. - Sproat, Richard and Thomas Emerson. 2003. The first
international Chinese word segmentation bakeoff. In
Proceedings of the Second SigHan Workshop on Chinese Language
Processing. Sapporo, Japan.
This is the official report on the bakeoff, including descriptions of corpora and all system results.