What is Hyphenation and Syllabification?
The simplest way to understand hyphenation and syllabification is through some examples.
Some Examples
Here are three examples taken from dictionary.reference.com, which federates several popular dictionaries, including their own. In the following table, the top row here is the word, with a hyphenation and syllabification below it:
side | landed | publisher | |
---|---|---|---|
Hyphenation | side | land - ed | pub - lish - er |
Syllabification | sɪd | læn + dɪd | pʌb + lɪ + ʃər |
Hyphenation
The hyphenation process inserts hyphens at various points in the word itself. These hyphenations are used primarily in typesetting, where words may be broken over a line and a hyphen inserted at a legal hyphenation point. Thus hyphenation is an orthographic process, meaning it works over the spellings of words in a language.
Syllabification
Syllables, on the other hand, break the sequence of spoken sounds into units. The sounds of a language are here written in the International Phonetic Alphabet (IPA), a widely used system for writing down the distinctive sounds of the world's languages. The number of syllables in a word roughly correspond to the number of spoken vowel clusters in a word. For instance, the word "side" has two vowels, but only the first is spoken, so it only has a single syllable.
Boundary Ambiguity
Exactly where the boundaries go for hyphenation and syllabification is another matter. There are no "official" definitions of syllables or hyphens. Hyphenation may vary by dictionary or publishing standard (e.g. The Chicago Manual of Style). Hyphenation tends to follow both the word form and the syllabification. But consider "landed", where the hyphenation and syllabification disagree about where the first "d" should be. As is typical with syllabification, consonants tend to pile up at the front of syllables, whereas in hyphenation, words tend to get split at morphological boundaries (that is, between stems and affixes in English). Some linguists admit so-called ambisyllabicity, which involve sounds at boundaries that belong to two syllables at once.
Words vs. Usages
It's not actually possible to solve the hyphenation problem for words, per se, because hyphenation really depends on the form of the word used. Consider the word "invalid", which when used as a noun is hyphenated in-va-lid and when used as an adjective is hyphenated in-val-id. Note that there's lexical stress involved at the pronounciation level, with the noun stressing the first syllable and the adjective the second.
For the purposes of this tutorial, we'll just fix a dictionary and take its definitions as the reference standard, and then build systems that can recreate the dictionary's decisions.
How is it Done?
Rules and Taggers
There are several approaches to hyphenation and syllabification in the literature, ranging from traditional rule-based transducers to k-nearest neighbors approaches up through structured support vector machines.
As a tagging problem, the spaces between letters need to be assigned as either a break point or a non-break point. This should be able to produce the best tagger, as it can use all of the information we include in the noisy channel model and also emphasize the discriminative role of the boundaries. Having said that, the model we present here is state-of-the-art in terms of accuracy.
The Noisy Channel Model
Our approach follows that of our Chinese word tokenization, which in turn is based on spelling correction. Specifically, we use what's known as a noisy channel model. A noisy channel model has two components, a source and a channel. The source represents a probabilisitic message generator. In this case, the hyphenated words. The channel represents noise in transmission. In our case, we use a deterministic channel that simply deletes the hyphenations. The recipient's job is then to decode the original message with hyphens from the received message, which has no hyphens.
Given a source model p(h)
for hyphenated words, and a channel
model p(w|h)
defined so that p(w|h) = 1
if
w
is equal to h
with the hyphens removed and 0
otherwise. We then seek to find the most likely source message h
to have produced message w
by:
ARGMAXh p(h|w) = ARGMAXh p(w|h) p(h) / p(w) = ARGMAXh p(w|h) p(h) = ARGMAXh s.t. strip(h)=w p(h)
where we use strip(h) = w
to mean that
w
is equal to h
with the hyphenations
stripped out (in Java terms, h.replaceAll(" ","").equals(w)
). Thus with a deterministic channel, we wind up looking
for the most likely hyphenation h
according to p(h)
,
restricting our search to h
that produce w
when the hyphens are stripped out.
The Moby Corpus
For our quick getting started example, we'll use a free corpus of English hyphenations.
Downloading the Corpus
Grady Ward produced a free corpus of English hyphenations, which is available from Project Gutenberg:
- Project Gutenberg: Moby Corpus Home
- Project Gutenberg: Download Moby Corpus
We'll assume you've downloaded the corpus and unzipped it into a
file $MOBY_RAW_DATA
(the data unpackes into a file called
mhyph.txt
).
> cd $MOBY_RAW_DATA > unzip mhyph10.zip
Munging the Corpus
The corpus is encoded as bytes, with byte sequence
0xFF
0xFD
used to indicate hyphens.
There's a simple program src/MobyHyphenCorpus.java
which we wrote to munge the corpus into our official input
format, which is one word per line separated by spaces. It
splits out space-separated compounds into their own entries.
The munger can be run through the ant target moby-hyphens
(be sure to replace $MOBY_RAW_DATA
as described above):
> ant -Dmoby.raw.file=$MOBY_RAW_DATA/mhyph.txt -Dhyphens.file=moby-hyphens.txt moby-hyphens Raw Input=e:\data\moby-hyphens\unpacked\mhyph.txt Hyphenation Output=moby-hyphens.txt Found #input lines=187175 Found #hyphenations=173781
We don't discuss this corpus in much detail as it
seems to be very noisy, including all sorts of non-English
words and non-words with fairly inconsistent hyphenations. Here's
a small sampleof the resulting file moby-hyphens.txt
:
... a bid ing ness a bid jan a bide a bie a bil i ty a bim e lech a bin o am a bin o em ...
Running Hyphenation Cross-Validation
We can then take this corpus and run it through our hyphenation
cross-validation program (fully explained below), using the
ant target xval-hyphens
(it'll take a while to run
without spitting out any feedback in non-verbose mode):
> ant -Dhyphens.file=moby-hyphens.txt -Dverbose=FALSE xval-hyphens Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\moby-hyphens.txt #words=173732 #hyphens=396978 #hyphen insertion points=1439791 ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=1024 VERBOSE=false OVERALL ACCURACIES ACROSS FOLDS PER HYPHENATION: PREC=0.949 RECALL=0.950 F(1)=0.949 PER TAGGING DECISION: ACC=0.9721 #DECISIONS=1439791 WHOLE WORD: ACCURACY=0.877 DEV=0.002
We describe the code used for training and testing the model below, including a description of what all the evaluation numbers mean. We also discuss a couple other modes of operation other than the simple language-model-based noisy channel that can be used.
Webster's Unabridged Dictionary
A quirky encoding of the 1890 version of Webster's Unabridged Dictionary is also available from Project Gutenberg:
- Project Gutenberg: Webster's Unabridged Dictionary Home
- Project Gutenberg: Webster's Unabridged Dictionar Download
We'll assume you've unzipped the distribution into a file.
Munging the Corpus
A program to extract the hyphenations may be found in
WebstersHyphensCorpus.java
.
It's just a bunch of scraping code that's not worth discussing,
other than to mention that we decompound compounds and hyphenated
forms, and remove any residual characters containing non-ASCII letters.
The munging program may be run from the ant target websters-hyphens
(be sure to replace $MOBY_RAW_DATA
as decribed above):
ant -Dhyphens.file=websters-hyphens.txt -Dwebsters.raw.file=$MOBY_RAW_DATA/pgwht04.txt websters-hyphens BAD: . BAD: 1 BAD: ab b\'82 BAD: ac a le ph\'91 BAD: a ch\'91 an BAD: ach ro \'94 dex trin ... BAD: zo \'94 troph ic BAD: zu &ntil;is BAD: zyg o dac ty l\'91 # rejected entries=1433 # retained entries=95349 DONE.
It prints out the entries that were rejected, reporting that
a total of 1433 entries were rejected, while 95,349 were
retained. A better version of this program would figure out
which Latin1 characters the numerical escapes referred to and
do a replace. For instance, \'91
is
clearly unicode character U+00E6 (æ).
Evaluating Hyphenation
We can cross-validate on the Webster's dictionary data
the same way as for the Moby corpus, with the xval-hyphens
ant target:
> ant -Dhyphens.file=websters-hyphens.txt -Dverbose=FALSE xval-hyphens Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\websters-hyphens.txt #words=93750 #hyphens=191118 #hyphen insertion points=708320 ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=1024 VERBOSE=false NGRAM= 8 INTERP= 4.0 ACC=0.880 P=0.940 R=0.941 F=0.941 OVERALL ACCURACIES ACROSS FOLDS PER HYPHENATION: PREC=0.940 RECALL=0.941 F(1)=0.941 PER TAGGING DECISION: ACC=0.9679 #DECISIONS= 708320 WHOLE WORD: ACCURACY=0.880 DEV=0.003
This doesn't look very good from the whole word perspective, though the per-hyphenation scores are OK. This may be due to the number of non-English words and names in Webster's dictionary.
> ant -Dhyphens.file=websters-hyphens.txt -Dverbose=TRUE xval-hyphens Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\websters-hyphens.txt Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\h Rejecting word=fellowship Rejecting word=griev'ance Rejecting word=thermometer Rejecting word=worthwhile #words=93750 #hyphens=191118 #hyphen insertion points=708320 ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=1024 VERBOSE=true EVALUATING FOLD=0/10 Training Hyphenator Compiling Hyphenator REFERENCE RESPONSE CORRECT @duc ti ble@ @duct i ble@ false @deu sed@ @deused@ false @ble ton ism@ @ble to nism@ false @lil y@ @lily@ false @dil u ent@ @di lu ent@ false @sau te@ @saute@ false @dem o ni a cism@ @de mo ni a cism@ false @secant@ @se cant@ false @in flat ed@ @in fla ted@ false @var den@ @vard en@ false @mon te@ @monte@ false @je ron y mite@ @jer on y mite@ false @sec u rif e ra@ @se cu rif e ra@ false @bur let ta@ @burl et ta@ false @non med ul la ted@ @non me dul la ted@ false @fe rie@ @fer ie@ false @cist ic@ @cis tic@ false @piste@ @pis te@ false @ruhm korff's@ @ruhm korff 's@ false
On the left is the reference value in the dictionary and on the right the system responses. Note that four words were rejected because they were very long and had no hyphneation. The heuristic obviously misses errors like the word "secant" above, which should have at least one hyphen.
The CELEX2 Corpus
Rather than spidering and scraping a web site, for the purposes of this tutorial we use the CELEX2 corpus from the Linguistic Data Consortium. CELEX2 contains data for English, German and Dutch, including spelling, syllabification and hyphenation, created by a consortium of Dutch research institutes. The CELEX2 corpus is only US$300 for non-members, but even for LDC members, comes with a very restrictive research-only license.
We will assume CELEX2 has been downloaded and then unpacked
from its gzipped tarball into a directory we'll call $CELEX2
.
CELEX2 provides various annotations, each in their own files.
For English hyphenation, the relevant file is
$CELEX2/english/eow/eow.cd
. Each entry is
on its own numbered line. Here's a sample:
... 147\abolish\27\78\1\B\27\78\a-bol-ish 148\abolished\31\78\1\B\31\125\a-bol-ished 149\abolishes\3\78\1\B\3\0\a-bol-ish-es ...
It's very easy to pull out the hyphenated versions of the words for our data set. There is a small complication in that both hyphenated words and compounds are allowed as inputs, with spaces and double-hyphens in the output respectively:
... 75889\run-around\2\39595\1\B\2\0\run--a-round 75890\run around\0\39594\1\B\0\0\run a-round ...
To handle this issue, we'll just collect all the single words, splitting on double-hyphens and spaces.
The final complication is that there can be more than one annotation per line, as in:
... 235\abridgments\0\124\2\B\0\0\a-bridg-ments\abridgements\B\0\0\a-bridge-ments ...
In this case, an alternate spelling ("abridgments" vs. "abridgements"), with corresponding hyphenations.
CELEX Guidelines for English, Dutch and German
The tagging guides for Dutch, English, and German may be found at the top-level directory for each language:
$CELEX/dutch/dug_let.ps
$CELEX/english/eug_let.pdf
$CELEX/german/gug_let.ps
There's also an online description of the contents of the corpus.
Among other resources, these guides (at least the English one) contain a mapping between IPA symbols and the symbols used in CELEX for phonemes.
Creating the Input Data
Running the Extractor
We pull the input data out of the CELEX2 file through the
ant target celex-hyphens
, with input file and
output file specified with a -D
property declaration:
> ant -Dcelex.ow.file=e:\data\celex2\unpacked\celex2\english\eow\eow.cd -Dhyphens.file=celex-english-hyphens.txt celex-hyphens # hyphenations=71476 # unique words=71363 [naafi, naaf i] [de serts, des erts] [coun ci lors, coun cil ors] [reb el, re bel] [cursed, curs ed] ... [in valid, in val id, in va lid] ... RESIDUAL ESCAPES=[] BUILD SUCCESSFUL Total time: 6 seconds
Multiple Hyphenations
Note that there are 71,476 unique hyphenations, but only 71,363 words from which they were drawn. The difference is made up by words with multiple hyphenations, which are listed below the counts (truncated here). The word "invalid" is unique in having three hyphenations, though the first one appears to be an error (unless that last syllable's pronounced like "vlid", but in general, reductions like that aren't listed in the corpus). The system we will produce will generate a first-best hyphenation for each word; under this scheme, the system will make at least 113 mistakes on the ambiguous English data.
Diacritics and Special Symbols
The CELEX2 corpus uses ASCII encodings of characters, employing their own custom diacritic encodings, listed in their README file(s) [this combines the Dutch, English and German accent sets, which are consistent]:
diacritic symbol diacritic sign # ' (acute accent) ` ` (grave accent) " " (diaeresis) ^ ^ (circumflex accent) , , (cedilla) ~ ~ (tilde) @ o (ring)
The German Eszett symbol (ß unicode U+00DF
) is
encoded by the dollar-sign character '$'
.
Multiple Forms of the Same Word
There are also multiple entries and entries with apostrophes for posssessives and contractions:
... 8378\boatswains\1\4518\4\B\0\0\boat-swains\bosuns\B\0\0\bo-suns\bo's'ns\B\0\0\bo's'ns\bo'suns\B\1\0\bo'suns ...
In this example, the word "boatwsains" is hyphenated as boat-swains, and then additional varieties are supplied on the same line, such as "bosuns", "bo's'ns", and "bo'suns", each with their own hyphenation.
Commas in Phrasal Entries
There's an error in some of the English encodings, which have commas in the entries, even though commas are reserved for cedillas. We just clean up by breakong in comma plus space in breaking multiple words.
42215\hop, skip, and jump\0\21578\1\B\0\0\hop, skip, and jump 42216\hop, step, and jump\0\21579\1\B\0\0\hop, step, and jump
Finished Data
The finished product, the file for which was specified
as celex-english-hyphens.txt
in the ant target
invocation above, is UTF-8 encoded, with one entry per line,
with spaces representing hyphens:
'd 'em 'll ... a a back a baft a ban don a ban don ing ... é pée é pées
There are remaining entries with hyphens for contractions and possessives.
Hyphen Extractor Source Code
The source code for the extractor is in src/CelexHyphenCorpus.java
.
It's not particularly interesting, as it merely walks through
the data stripping out pronunciations, breaking compounds into
single words, collecting counts, and reinserting the diacritics.
Cross-Validating Hyphenation
Like our other evaluations, we set up hyphenation to run through ten-fold cross-validation. This means we will divide the data up into ten random sets, known as folds, and then for each fold in turn, train on the other nine folds and test on the given fold. Results get reported as averages.
> ant -Dverbose=false -Dhyphens.file=celex-english-hyphens.txt xval-hyphens ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=1024 VERBOSE=false Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\celex-english-hyphens.txt #words=71476 #hyphens=121794 #hyphen insertion points=524910 FOLD = 0 ACCURACY=0.955 FOLD = 1 ACCURACY=0.957 FOLD = 2 ACCURACY=0.957 FOLD = 3 ACCURACY=0.958 FOLD = 4 ACCURACY=0.955 FOLD = 5 ACCURACY=0.956 FOLD = 6 ACCURACY=0.960 FOLD = 7 ACCURACY=0.952 FOLD = 8 ACCURACY=0.958 FOLD = 9 ACCURACY=0.956 OVERALL ACCURACIES ACROSS FOLDS PER HYPHENATION: PREC=0.975 RECALL=0.976 F(1)=0.975 PER TAGGING DECISION: ACC=0.9886 #DECISIONS= 524910 WHOLE WORD: ACCURACY=0.956 DEV=0.002
Evaluations
Before going into the code, we'll describe the three distinct lines of the evaluations.
Whole Word Accuracy
The simplest evaluation is whole word accuracy, which is simply the number of words that were hyphenated correctly divided by the number of words. The whole word accuracy for this run was 0.956, with a deviation of 0.002 derived from the ten runs, which varied in accuracy between 0.952 and 0.960.
It's instructive to compute the binomial confidence
interval (a kind of Z-test, which is appropriate here
due to large sample size), which is based on a standard deviation calculation,
where p
is the success probability (here 0.956)
and N
is the number of trials (71,476):
dev(Binomial(p,N)) = sqrt(p * (1-p) / N)
Note that the result is normalized (by dividing by N
) back
to the probability scale. The actual deviation in the binomial
outcome is sqrt(N * p * (1-p))
, which is then divided
by N
to yield sqrt(p * (1-p)/n)
.
Plugging in our numbers yields 0.000767. A 95% confidence interval is roughly two times the deviation in either direction, or, +/- .00153, for a confidence interval of (0.9545, 0.9575), which appears to be a bit too narrow given the cross-validation results (see the sidebar).
Per-Hyphenation Precision and Recall
Treating the hyphenations as objects to be found, we can report precision and recall values. There's a total of 121,794 hyphens in the corpus. As usual, precision is the percentage of hyphens the model returns that are correct, whereas recall is the percentage of correct hyphenations returned by the model. The numbers here are precision=0.975 and recall=0.976. That is, we found 97.6% of the hyphens and 97.5% of what we found was correct. The F-measure is their harmonic mean, which is 0.975.
Per-Decision Accuracy
Finally, we can look at each position between a pair of characters in the input to judge whether we made the right decision, hyphen or no-hyphen. There's a total of 524,910 such decision points, and our overall accuracy is 0.9886.
Verbose Reports Per Word
Reports on all the errors made by the system on a word-by-word
basis are available by setting the verbose
property to
true
, which it is by default. Here's a sample:
> ant -Dhyphens.file=celex-english-hyphens.txt xval-hyphens .... EVALUATING FOLD=0/10 Training Hyphenator Compiling Hyphenator REFERENCE RESPONSE CORRECT @lla mas@ @ll a mas@ false @py ja ma@ @py jam a@ false @fril li ness@ @frill i ness@ false @trag i@ @tra gi@ false @ser aphs@ @se raphs@ false @he do nis tic@ @he don is tic@ false @mo ral i ties@ @mor al i ties@ false @frig ates@ @fri gates@ false @cha ris ma@ @char is ma@ false ... @step fathers@ @step fa thers@ false ... @dis in fla tion a ry@ @dis in fla tion ar y@ false ... @vaude ville@ @vau dev ille@ false
The labeled data's hyphenation is listed on the left under
REFERENCE
, with the system's answer listed on the
right under RESPONSE
.
Most of the mistakes involve off-by-one break locations, but some of these lead to the wrong number of syllables, as for "llamas" and "vaudeville".
Some of the mistakes appear to be errors in the training data, such
as "step-fathers", which clearly should have a hyphenation
point in "fathers". Other errors are questionable, such as
"dis-in-fla-tion-a-ry", whose hypehnation
in dictionary.reference.com
matches our response. In
part, this indicates how sketchy the "rules" are for hyphenation
in English.
Code Walkthrough
The cross-validation code is available
in src/XvalHyphenation.java
.
The portions of the code to train, compile, and run on
new data is very simple. Most of the code's tied up
in creating the folds and running evaluations.
Spell Checking
Hyphenation is based on our spell checking API, which is further explained in the spell check tutorial and in the Chinese tokenization tutorial.
Problem Encoding
To cast the problem into our spelling framework, we wrap each word
in a start-word and end-word symbol, both chosen to be
'@'
here. Using different symbols for start and end
doesn't change our results, so we went with the simpler encoding. We
then just train the model by feeding it these wrapped words.
Corpus Parsing
There's code to parse the corpus in method parseCorpus()
.
The only piece of this relevant to the API is the piece that appends
the start-of-word and end-of-word symbols:
public static String[] parseCorpus(File hyphenationDataFile) throws IOException { ... String[] hyphenatedWords = new String[hyphenatedWordList.size()]; for (int i = 0; i < hyphenatedWords.length; ++i) hyphenatedWords[i] = "@" + hyphenatedWordList.get(i) + "@";
and the piece that permutes the array of hyphenated words (using our
utility class util.Arrays
):
com.aliasi.util.Arrays.<String>permute(hyphenatedWords, new Random(42));
The random seed is hard-coded here so that the results are the same. This is a re-usable trick for debugging methods with a random component. Changing the seed changes the results, as there's a truly random element to cross-validation.
Model Training
Training the spell checking model is done in the usual way:
static CompiledSpellChecker compileHyphenator(String[] hyphenatedWords, int fold, int numFolds, int maxNGram, int numChars, double interpolationRatio) throws IOException, ClassNotFoundException { NGramProcessLM lm = new NGramProcessLM(maxNGram,numChars,interpolationRatio); WeightedEditDistance distance = CompiledSpellChecker.TOKENIZING; TrainSpellChecker trainer = new TrainSpellChecker(lm,distance,null); for (int i = 0; i < hyphenatedWords.length; ++i) if (i % numFolds != fold) trainer.handle(hyphenatedWords[i]); ...
First, a process n-gram language model is created using the specified parameters for n-gram length, interpolation ratio and number of characters for the baseline uniform districution.
Next, the weighted edit distance is set to the constant
CompiledSpellChecker.TOKENIZING
, which implements
the deterministic channel model by making space insertion and
matching cost zero and all other edits cost infinity.
Then, the model is compiled using the utility method compile()
from
util.AbstractExternalizable
:
CompiledSpellChecker hyphenator = (CompiledSpellChecker) AbstractExternalizable.compile(trainer); hyphenator.setAllowInsert(true); hyphenator.setAllowMatch(true); hyphenator.setAllowDelete(false); hyphenator.setAllowSubstitute(false); hyphenator.setAllowTranspose(false); hyphenator.setNumConsecutiveInsertionsAllowed(1); hyphenator.setFirstCharEditCost(0); hyphenator.setSecondCharEditCost(0); hyphenator.setNBest(HYPHENATION_N_BEST); return hyphenator;
After the model is compiled, it is configured for efficiency using
methods in CompiledSpellChecker
. In particular, it
allows inserts (for the spaces), but limits the number of consecutive
insertions to one.
The code allows matches, but all other edits are
disallowed.
This code also sets the first and second character edit cost penalties to 0, putting them on an equal footing with other characters. For ordinary spelling correction, there is a much lower chance of first and second characters being wrong than characters later in the word.
Finally, the n-best list is set to a constant. In general, the n-best list should be set as low as possible without hurting accuracy, because fatter n-best lists mean slower parsing time.
Compiling to File
The trained model could have been compiled to a file
using the method compileTo(ObjectOutput)
in the
compiled spell checker or the utility method
AbstractExternalizable.compileto(Compilable,File)
, because
TrainSpellChecker
implements util.Compilable
.
After compilation to a file, the model may be read back in through
ordinary serialization (that is, an ObjectInput
), or
using the utility method
AbstractExternalizable.readObject(File)
, cast to
a CompiledSpellChecker
. Don't forget to run the
configuration on the compiled spell checker following the code
above.
Evaluation
The evaluation happens in the method evaluateFold()
, which
is called for each fold value (from 0 to the number of folds):
static void evaluateFold(String[] hyphenatedWords, int fold, int numFolds, double[] accuracies, PrecisionRecallEvaluation prEval) throws ClassNotFoundException, IOException { CompiledSpellChecker hyphenator = compileHyphenator(hyphenatedWords,fold,numFolds, MAX_NGRAM,NUM_CHARS,INTERPOLATION_RATIO); ...
Note that a precision-recall evaluation is passed in, so that it
may be re-used across all the folds to provide a cumulative answer
for the whole evaluation. The array accuracies
is
to fill with fold accuracies so that means and variance may be
reported once all the folds are run.
The first call is just to compile the hyphenator. Then it loops over the word array, evaluating on a case-by-case basis:
int numCases = 0; int numCorrect = 0; for (int i = 0; i < hyphenatedWords.length; ++i) { if (i % numFolds == fold) { String hyphenatedWord = hyphenatedWords[i]; String unhyphenatedWord = hyphenatedWord.replaceAll(" ",""); String rehyphenatedWord = tamp(hyphenator.didYouMean(unhyphenatedWord)); boolean correct = hyphenatedWord.equals(rehyphenatedWord); ++numCases; if (correct) ++numCorrect; updatePrEval(prEval,hyphenatedWord,rehyphenatedWord); } } double accuracy = numCorrect / (double) numCases; accuracies[fold] = accuracy;
We keep track of overall accuracy using counters, and evaluate
on the fold at hand (determined by the mod arithmetic test).
The oringla hyphenated word is retrieved, and the spaces removed
using a replace operation. Then it is fed to the didYouMean()
method of the hyphenator (named after its use for spelling correction).
The answer is correct if it's equal to the input, and counts are updated
according. Finally, the precision-recall evaluation is updated given
the reference hyphenation from the corpus and system response hyphenation.
Outside of the loop over cases, overall accuracy is recorded on the
array passed in.
The method tamp()
applied to the hyphenation gets rid
of any spurious hyphens between the start and end boundaries and the word:
static String tamp(String hyphenation) { return "@" + hyphenation.replaceAll("@","").trim() + "@"; }
Precision-Recall Evaluation
The precision-recall eval uses the LingPipe class
classify.PrecisionRecallEvaluation
, which is a general
confusion-matrix based scorer.
static void updatePrEval(PrecisionRecallEvaluation prEval, String reference, String response) { Set<Integer> referenceBoundarySet = getBoundarySet(reference); Set<Integer> responseBoundarySet = getBoundarySet(response); Set<Integer> universalSet = new HashSet<Integer>(); universalSet.addAll(referenceBoundarySet); universalSet.addAll(responseBoundarySet); for (Integer i : universalSet) { boolean ref = referenceBoundarySet.contains(i); boolean resp = responseBoundarySet.contains(i); prEval.addCase(ref,resp); } }
The method getBoundarySet(String)
returns a set of
the integers representing the index of the characters before
hyphenation points, not counting the hyphenation characters (spaces). For instance,
foo bar ly
would return the set {2, 5}
.
We then create the union of the reference and response boundaries,
then iterate over these, calculating for each whether it's a reference
and/or response tagging. These are added as cases to the precision-recall
eval. For instance, adding true,true
is a true positive,
true,false
is a false negative, and false,true
a false positive.
static Set<Integer> getBoundarySet(String hyphenatedWord) { Set<Integer> boundarySet = new HashSet<Integer>(); int pos = 0; for (int i = 0; i < hyphenatedWord.length(); ++i) { if (hyphenatedWord.charAt(i) == ' ') boundarySet.add(pos); else ++pos; } return boundarySet; }
The rest of the code is just setup and print statements. Here's how the results get reported given the precision-recall eval, number of decisions (which there's a method to compute given the corpus) and accuracies with deviations:
public static void report(PrecisionRecallEvaluation prEval, int numTagDecisions, double[] accuracies) { System.out.printf("\nPER HYPHENATION: PREC=%5.3f RECALL=%5.3f F(1)=%5.3f\n", prEval.precision(), prEval.recall(), prEval.fMeasure()); double numTagErrors = prEval.falsePositive() + prEval.falseNegative(); double perTagAccuracy = (numTagDecisions - numTagErrors) / (double) numTagDecisions; System.out.printf("\nPER TAGGING DECISION: ACC=%5.4f #DECISIONS=%7d\n", perTagAccuracy,numTagDecisions); System.out.printf("\nWHOLE WORD: ACCURACY=%5.3f DEV=%5.3f\n", Statistics.mean(accuracies), Statistics.standardDeviation(accuracies)); }
An Interpolated Bidirectional Model
Like many left-to-right sequence models, the language-model based noisy channel model suffers from the label bias problem.
Without resorting to conditional random fields or structured support vector
machines, we can mitigate the label bias problem somewhat by running
models in both directions. That is, we train once with the usual data,
then we train a second model with all of the words reversed. To produce
a single answer, we can either combine the probability estimates of the
forward and backward models using the n-best lists from spelling. We
provide a model that does this in src/XvalHyphenationBidi.java
.
N-best Spell Checker Output
Other than a whole lot of bookkeeping to track when a string is
forward and backward, the only interesting innovation here is in
combining the n-best lists from the spell checkers. This is done
with a helper method nBest()
which converts the output
to a mapping from hyphenations to scores:
static ObjectToDoubleMap<String> nBest(CompiledSpellChecker hyphenator, String word, boolean reverseIO) { Iterator<ScoredObject<String>> nBest = hyphenator.didYouMeanNBest(reverseIO ? reverse(word) : word); ObjectToDoubleMap<String> nBestMap = new ObjectToDoubleMap<String>(); while (nBest.hasNext()) { ScoredObject<String> scoredHyphenation = nBest.next(); String hyphenation = XvalHyphenation.tamp(scoredHyphenation.getObject()); Double score = Math.pow(2.0,scoredHyphenation.score()); String orderedHyphenation = reverseIO ? reverse(hyphenation) : hyphenation; Double currentScore = nBestMap.get(orderedHyphenation); if (currentScore != null) { // happens when two inputs tamp to same val score = currentScore + score; } nBestMap.set(orderedHyphenation,score); } double totalProb = 0.0; for (String key : nBestMap.keySet()) { totalProb += nBestMap.get(key); } ObjectToDoubleMap<String> conditionalNBestMap = new ObjectToDoubleMap<String>(); for (String key : nBestMap.keySet()) { double conditionalProb = nBestMap.get(key) / totalProb; conditionalNBestMap.set(key,conditionalProb); } return conditionalNBestMap; }
This just walks over the scored object output of the n-best spell checker. Scored objects include an object, in this case a string representing a hyphenation, and a score, in this case a joint log likelihood of the string and its hyphenation. In some cases, the result of tamping down the spaces around the boundaries leads to two or three entries on the n-best list that correspond to the same hyphenation. In these cases, we convert to a linear scale, add values, and convert back to a log scale for the final result.
The last step of the algorithm sums the linear probabilities
p(h,w)
to get the marginal probability p(w)
of the underlying string, then divides the linear probabilies by the
marginal to get the conditional, p(h|w) = p(h,w)/p(w)
.
Combining Forward and Backward Results
The hyphenation method just adds the forward and backward estimates and divides by two to get an interpolated score:
p(h|w) = 0.5 * (pF(h|w) + pB(h|w))
Different interpolation ratios other than 0.5/0.5 might produce different results. For instance, at 1.0/0.0 it produces the forward results and at 0.0/1.0 the backward results.
static String hyphenate(String word, CompiledSpellChecker hyphenator, CompiledSpellChecker hyphenatorReversed) { String reversed = reverse(word); ObjectToDoubleMap<String> fwdNBest = nBest(hyphenator,word,false); ObjectToDoubleMap<String> revNBest = nBest(hyphenatorReversed,word,true); ObjectToDoubleMap<String> nBest = new ObjectToDoubleMap<String>(); for (String hyphenation : fwdNBest.keySet()) { double fwdScore = fwdNBest.getValue(hyphenation); double revScore = revNBest.getValue(hyphenation); nBest.set(hyphenation, 0.5 * fwdScore + 0.5 * revScore); } return nBest.keysOrderedByValueList().get(0); }
Basically, this just takes the two values and interpolates them evenly, adding them to a map. The final return just takes the top scoring value from the list and returns it.
Bidirectional Results
Unfortunately, the bidirectional results look almost identical to the unidirectional forward results.
> ant -Dhyphens.file=celex-english-hyphens.txt xval-hyphens ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=256 Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\celex-english-hyphens.txt #words=71476 #hyphens=121794 #hyphen insertion points=524910 FOLD = 0 ACCURACY=0.955 FOLD = 1 ACCURACY=0.958 FOLD = 2 ACCURACY=0.956 FOLD = 3 ACCURACY=0.959 FOLD = 4 ACCURACY=0.955 FOLD = 5 ACCURACY=0.956 FOLD = 6 ACCURACY=0.960 FOLD = 7 ACCURACY=0.953 FOLD = 8 ACCURACY=0.959 FOLD = 9 ACCURACY=0.955 OVERALL ACCURACIES ACROSS FOLDS PER HYPHENATION: PREC=0.975 RECALL=0.975 F(1)=0.975 PER TAGGING DECISION: ACC=0.9885 #DECISIONS= 524910 WHOLE WORD: ACCURACY=0.957 DEV=0.002
The whole word accuracy is .001 better, the recall is .001 worse, and the per-tagging accuracy .0001 worse. This makes the bidirectional system statistically indistinguishable from the forward system.
An Intersecting Bidirectional Model
The final variant we consider is motivated by the fact that hyphenation is a precision-oriented problem. This is because of the use case, which is inserting hyphens in printed material. As long as there's a reasonable hyphenation point somewhere near the middle of long words, overall recall is less important. With that in mind, we will deliver a system that simply intersects the first-best hyphens in the forward and backward models.
The Intersection Code
Training works just as in the bidirectional interpolation model, but decoding works by intersecting (or unioning, depending on the command-line flag) the tags in the forward and backward first-best results.
The code is in src/XvalHyphenationBiDiIntersect.java
.
The only interesting difference from previous code takes the first-best
hyphenations from the forward and backward models and intersects (or unions
them).
static String intersect(String hyphenationFwd, String hyphenationRev) { boolean matched = hyphenationFwd.equals(hyphenationRev); if (matched) return hyphenationFwd; Set<Integer> hyphenSetFwd = XvalHyphenation.getBoundarySet(hyphenationFwd); Set<Integer> hyphenSetRev = XvalHyphenation.getBoundarySet(hyphenationRev); if (INTERSECT) hyphenSetFwd.retainAll(hyphenSetRev); else hyphenSetFwd.addAll(hyphenSetRev); int k = 0; int[] hyphenPositions = new int[hyphenSetFwd.size()]; for (Integer hyphenPosition : hyphenSetFwd) { hyphenPositions[k++] = hyphenPosition; } Arrays.sort(hyphenPositions); String hyphenation = hyphenationFwd.replaceAll(" ",""); // strip hyphens for (int i = hyphenPositions.length; --i >= 0; ) { hyphenation = hyphenation.substring(0,hyphenPositions[i]) + " " + hyphenation.substring(hyphenPositions[i]); } return hyphenation; }
It's just a bunch of index fiddling. The boolean flag INTERSECT
determines if the results are unioned or intersected. It's controlled by
the property intersect
from the command line.
Results from Intersection
The code can be run from the ant target xval-hyphens-bidi-int
, with the
verbosity and corpus specified on the command line, and additionally, a flag
indicating whether to intersect or not (a false
value yields union).
> ant -Dverbose=false -Dintersect=true -Dhyphens.file=celex-english-hyphens.txt xval-hyphens-bidi-int Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\celex-english-hyphens.txt #words=71476 #hyphens=121794 #hyphen insertion points=524910 FOLD = 0 ACCURACY=0.948 FOLD = 1 ACCURACY=0.949 FOLD = 2 ACCURACY=0.950 FOLD = 3 ACCURACY=0.952 FOLD = 4 ACCURACY=0.949 FOLD = 5 ACCURACY=0.948 FOLD = 6 ACCURACY=0.953 FOLD = 7 ACCURACY=0.944 FOLD = 8 ACCURACY=0.949 FOLD = 9 ACCURACY=0.947 OVERALL ACCURACIES ACROSS FOLDS PER HYPHENATION: PREC=0.981 RECALL=0.969 F(1)=0.975 PER TAGGING DECISION: ACC=0.9884 #DECISIONS= 524910 WHOLE WORD: ACCURACY=0.949 DEV=0.002
These results are substantially different than the two previous sets (forward and interpolating), in that the precision is much higher at 0.981 versus 0.975. It effectively eliminates .006/.025 = 24% of the spurious hyphens suggested by the system. Whole word accuracy drops from 0.956 to 0.949, but the per-tagging accuracy remains roughly the same. Recall also drops from 0.976 to 0.969.
Running in unioning mode, we push the precision/recall balance in the opposite direction, which is not so useful other than as a theoretical exercise:
ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=256 VERBOSE=false MODE=UNIONING Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\celex-english-hyphens.txt #words=71476 #hyphens=121794 #hyphen insertion points=524910 FOLD = 0 ACCURACY=0.945 FOLD = 1 ACCURACY=0.946 FOLD = 2 ACCURACY=0.949 FOLD = 3 ACCURACY=0.950 FOLD = 4 ACCURACY=0.948 FOLD = 5 ACCURACY=0.946 FOLD = 6 ACCURACY=0.951 FOLD = 7 ACCURACY=0.943 FOLD = 8 ACCURACY=0.949 FOLD = 9 ACCURACY=0.945 OVERALL ACCURACIES ACROSS FOLDS PER HYPHENATION: PREC=0.968 RECALL=0.981 F(1)=0.974 PER TAGGING DECISION: ACC=0.9880 #DECISIONS= 524910 WHOLE WORD: ACCURACY=0.947 DEV=0.002
Here we have recall at 0.981, which is higher than our original 0.976, at the cost of a few points of precision. As with the precision-balanced system, overall accuracy and per-tagging accuracy also suffer compared to the simple forward or the forward/backward interpolating system.
Parameter Tuning
Like most of the models based on character language models, hyphenation is relatively insensitive to the available tuning parameters.
N-gram Length
First, let's look at n-gram length and interpolation ratio for some reasonable values. The following table contains reports for the simple forward model, evaluated over n-grams from 4 to 10 and interpolation ratios from 1/8 the n-gram length to 4 times the n-gram length.
N | Interp | Acc | Prec | Recall | F |
---|---|---|---|---|---|
4 | 0.3 | 0.819 | 0.889 | 0.886 | 0.887 |
4 | 0.5 | 0.819 | 0.889 | 0.885 | 0.887 |
4 | 1.0 | 0.819 | 0.889 | 0.885 | 0.887 |
4 | 2.0 | 0.819 | 0.889 | 0.884 | 0.887 |
4 | 4.0 | 0.817 | 0.889 | 0.883 | 0.886 |
4 | 8.0 | 0.814 | 0.888 | 0.880 | 0.884 |
4 | 16.0 | 0.806 | 0.887 | 0.874 | 0.880 |
5 | 0.3 | 0.909 | 0.945 | 0.947 | 0.946 |
5 | 0.6 | 0.910 | 0.946 | 0.947 | 0.946 |
5 | 1.3 | 0.909 | 0.946 | 0.945 | 0.945 |
5 | 2.5 | 0.907 | 0.945 | 0.944 | 0.944 |
5 | 5.0 | 0.903 | 0.943 | 0.941 | 0.942 |
5 | 10.0 | 0.895 | 0.939 | 0.935 | 0.937 |
5 | 20.0 | 0.881 | 0.931 | 0.925 | 0.928 |
6 | 0.4 | 0.944 | 0.968 | 0.970 | 0.969 |
6 | 0.8 | 0.945 | 0.968 | 0.969 | 0.969 |
6 | 1.5 | 0.944 | 0.968 | 0.969 | 0.968 |
6 | 3.0 | 0.944 | 0.967 | 0.967 | 0.967 |
6 | 6.0 | 0.940 | 0.965 | 0.965 | 0.965 |
6 | 12.0 | 0.933 | 0.961 | 0.960 | 0.961 |
6 | 24.0 | 0.921 | 0.955 | 0.952 | 0.954 |
7 | 0.4 | 0.953 | 0.973 | 0.974 | 0.974 |
7 | 0.9 | 0.953 | 0.973 | 0.974 | 0.974 |
7 | 1.8 | 0.953 | 0.974 | 0.974 | 0.974 |
7 | 3.5 | 0.953 | 0.973 | 0.974 | 0.974 |
7 | 7.0 | 0.952 | 0.972 | 0.972 | 0.972 |
7 | 14.0 | 0.946 | 0.969 | 0.969 | 0.969 |
7 | 28.0 | 0.937 | 0.964 | 0.963 | 0.964 |
N | Interp | Acc | Prec | Recall | F |
---|---|---|---|---|---|
8 | 0.5 | 0.954 | 0.975 | 0.975 | 0.975 |
8 | 1.0 | 0.955 | 0.975 | 0.975 | 0.975 |
8 | 2.0 | 0.956 | 0.976 | 0.976 | 0.976 |
8 | 4.0 | 0.956 | 0.975 | 0.976 | 0.975 |
8 | 8.0 | 0.955 | 0.975 | 0.975 | 0.975 |
8 | 16.0 | 0.951 | 0.972 | 0.972 | 0.972 |
8 | 32.0 | 0.942 | 0.968 | 0.967 | 0.967 |
9 | 0.6 | 0.954 | 0.975 | 0.974 | 0.974 |
9 | 1.1 | 0.955 | 0.975 | 0.975 | 0.975 |
9 | 2.3 | 0.956 | 0.976 | 0.975 | 0.976 |
9 | 4.5 | 0.958 | 0.976 | 0.976 | 0.976 |
9 | 9.0 | 0.956 | 0.975 | 0.975 | 0.975 |
9 | 18.0 | 0.952 | 0.973 | 0.973 | 0.973 |
9 | 36.0 | 0.944 | 0.969 | 0.968 | 0.968 |
10 | 0.6 | 0.953 | 0.974 | 0.973 | 0.974 |
10 | 1.3 | 0.954 | 0.975 | 0.974 | 0.975 |
10 | 2.5 | 0.956 | 0.976 | 0.976 | 0.976 |
10 | 5.0 | 0.958 | 0.976 | 0.976 | 0.976 |
10 | 10.0 | 0.957 | 0.976 | 0.976 | 0.976 |
10 | 20.0 | 0.952 | 0.973 | 0.973 | 0.973 |
10 | 40.0 | 0.944 | 0.969 | 0.968 | 0.969 |
The higher the interpolation ratio, the more smoothing is applied to higher-order character n-gram estimates.
The bold-faced entries are maximal values for their categories. Thus the highest achieved F-measure is 0.976, and the highest accuracy found was 0.958. The most important thing to notice here is that the response of the scores to the parameters is relatively smooth and insensitive to small changes. That is, 9-grams perform pretty much like 8-grams, and an interpolation of 4.0 performs pretty much like 8.0.
Our default recommendation for English is to use 8-grams with the default interpolation ratio of 8.0. That row is highlighted in yellow. The runs in the body of the tutorial use 8-grams with an interpolation ratio of 4.0.
Dutch Hyphenation
We run Dutch the same way as English.
The Dutch Corpus
Here's the ant target to create the corpus:
> ant -Dcelex.ow.file=e:\data\celex2\unpacked\celex2\dutch\dow\dow.cd -Dhyphens.file=celex-dutch-hyphens.txt celex-hyphens # hyphenations=315957 # unique words=315897 [tim er, ti mer] [lijs ten, lij sten] [made, ma de] [mi se, mise] ... [story, sto ry] [an ne, anne] [wets taal, wet staal] [coun try, country] [raket, ra ket] RESIDUAL ESCAPES=[]
Some of those duplicate hyphenations look highly suspect.
Dutch Cross-Validation
Here's the ant target to evaluate using cross-validation:
> ant -Dverbose=false -Dhyphens.file=celex-dutch-hyphens.txt xval-hyphens Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\celex-dutch-hyphens.txt #words=315957 #hyphens=811045 #hyphen insertion points=3115724 ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=1024 VERBOSE=false NGRAM= 8 INTERP= 4.0 ACC=0.994 P=0.998 R=0.998 F=0.998 OVERALL ACCURACIES ACROSS FOLDS PER HYPHENATION: PREC=0.998 RECALL=0.998 F(1)=0.998 PER TAGGING DECISION: ACC=0.9990 #DECISIONS=3115724 WHOLE WORD: ACCURACY=0.994 DEV=0.001
First note that there's almost five times as many words in the Dutch dictionary than in the English one. There's over 300,000 unique Dutch hyphenations in the corpus.
It looks like hyphenation in Dutch is a whole lot easier than in English. In fact, it's nearly perfect, with a mistake being made only every 200th word or 500th hyphenation point.
German Hyphenation
German Corpus Special Cases
Just to keep us on our toes, the German data's in a slightly different format, with only a single hyphenation (no ambiguities) listed in column 5 of their data. And it also contains an escape for the Eskett (ß) character.
So we cut-and-pasted and modified
to produce a German corpus builder, src/CelexHyphenCorpusGerman.java
.
It runs just like the other ones, though from its own ant target celex-hyphens-german
:
> ant -Dcelex.ow.file=e:\data\celex2\unpacked\celex2\german\gow\gow.cd -Dhyphens.file=celex-german-hyphens.txt celex-hyphens-german # hyphenations=312215 # unique words=312128 [ras tet, ra stet] [ra sten, ras ten] [be stem, bes tem] ... [in di gnier test, in dig nier test] [gro tes ke, gro te ske] RESIDUAL ESCAPES=[]
German Cross-Validation
We don't need to do anything special for evaluating German with cross-validation:
> ant -Dverbose=false -Dhyphens.file=celex-german-hyphens.txt xval-hyphens Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\celex-german-hyphens.txt #words=312208 #hyphens=793598 #hyphen insertion points=3164400 ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=1024 VERBOSE=false NGRAM= 8 INTERP= 4.0 ACC=0.997 P=0.999 R=0.999 F=0.999 OVERALL ACCURACIES ACROSS FOLDS PER HYPHENATION: PREC=0.999 RECALL=0.999 F(1)=0.999 PER TAGGING DECISION: ACC=0.9996 #DECISIONS=3164400 WHOLE WORD: ACCURACY=0.997 DEV=0.000
German's even eaiser to hyphenate than English. We didn't run multiple parameters here, but it's easy to imagine longer contexts might help for German, because of the length of some of the compounds in the dictionary.
English Syllabification
Next, we turn to English syllabification. Syllabification is essentially the same problem as hyphenation, only with a different set of characters.
Extracting the Syllables
To extract a training corpus, we have a separate program src/CelexSyllableCorpus.java
.
It's basically just like the hyphen corpus extractor, only with a
different separator and different set of characters.
English syllables can be extracted using the ant target
celex-syllables
:
> ant -Dcelex.pw.file=e:\data\celex2\unpacked\celex2\english\epw\epw.cd -Dcelex.sylls.file=celex-english-syllables.txt celex-syllables # syllabifications=108224 # unique words=108123 [I nO: gjU r+l, In O: gjU r+l] [nEs lIN, nE slIN] [In O: gjU rl,, I nO: gjU rl,] ... [pOp + d+mz, pO p+ d+mz] [pOn j+dz, pO nj+dz] [n& tSr+ lIst, n& tSr+l Ist]
There are many more syllabifications than there were hyphenations. This is primarily due to there being multiple pronunciations for words.
The phonetic symbols are encoded in a CELEX-specific encoding. Some of the symbols represent length, as in the character ':', and some of them indicate syllabic consonants, and so on. There's a full description in the annotation guides reference above. Rather than replacing the multi-character symbols with their IPA equivalents, we just left them as is, which probably hurts accuracy somewhat.
We had to change the input symbols for the syllabification
task because we are using the at-sign (@
) as a boundary in the
hyphenation program. We just replaced it with a plus sign ('+'
),
which doesn't conflict with other symbols.
Performing Syllabification
Because we output the data in the same format as for hyphenation,
we do not need to recode the evaluator. We run the standard forward
evaluation from the ant target xval-hyphens
. The
bidirectional and intersecting targets may be invoked in the same way.
> ant -Dhyphens.file=celex-english-syllables.txt -Dverbose=false xval-hyphens Reading data from file=C:\carp\mycvs\lingpipe\demos\tutorial\hyphenation\celex-english-syllables.txt #words=108224 #hyphens=218187 #hyphen insertion points=689124 ACCURACY ARGS: NGRAM=8 NUM_CHARS=64 INTERPOLATION_RATIO=4.0 HYPHENATION_N_BEST=1024 VERBOSE=false NGRAM= 8 INTERP= 4.0 ACC=0.988 P=0.995 R=0.994 F=0.995 OVERALL ACCURACIES ACROSS FOLDS PER HYPHENATION: PREC=0.995 RECALL=0.994 F(1)=0.995 PER TAGGING DECISION: ACC=0.9966 #DECISIONS= 689124 WHOLE WORD: ACCURACY=0.988 DEV=0.001
As expected, syllabification is a whole lot more predicatable than hyphenation in English. That's because the sound system used by the language is much more regular than the spelling.