What is Text Classification?
Text classification typically involves assigning a document to a category by automated or human means. LingPipe provides a classification facility that takes examples of text classifications--typically generated by a human--and learns how to classify further documents using what it learned with language models. There are many other ways to construct classifiers, but language models are particularly good at some versions of this task.
20 Newsgroups Demo
A publicly available data set to work with is the 20 newsgroups data available from the
4 Newsgroups Sample
We have included a sample of 4 newsgroups with the LingPipe distribution in order to allow you to run the tutorial out of the box. You may also download and run over the entire 20 newsgroup dataset. LingPipe's performance over the whole data set is state of the art.
Quick Start
Once you have downloaded and installed LingPipe, change directories to the one containing this read-me:
> cd demos/tutorial/classify
You may then run the demo from the command line (placing all of the code on one line):
On Windows:
java -cp "../../../lingpipe-4.1.2.jar; classifyNews.jar" ClassifyNews
On Linux, Mac OS X, and other Unix-like operating systems:
java -cp "../../../lingpipe-4.1.2.jar: classifyNews.jar" ClassifyNews
or through Ant:
ant classifyNews
The demo will then train on the data in
demos/fourNewsGroups/4news-train/
and evaluate on
demos/4newsgroups/4news-test
. The results of scoring are
printed to the command line and explained in the rest of this
tutorial.
The Code
The entire source for the example is ClassifyNews.java. We will be using the API from Classifier and its subclasses to train the classifier, and Classifcation to evaluate it. The code should be pretty self explanatory in terms of how training and evaluation are done. Below I go over the API calls.
Training
We are going to train up a set of character based language models (one
per newsgroup as named in the static array CATEGORIES
) that processes
data in 6 character sequences as specified by the NGRAM_SIZE
constant.
private static String[] CATEGORIES = { "soc.religion.christian", "talk.religion.misc", "alt.atheism", "misc.forsale" }; private static int NGRAM_SIZE = 6;
The smaller your data generally the smaller the n-gram sample, but you can play around with different values--reasonable ranges are from 1 to 16 with 6 being a good general starting place.
The actual classifier involves one language model per classifier. In
this case, we are going to use process language models (LanguageModel.Process
). There is a factory method in DynamicLMClassifier
to construct actual models.
DynamicLMClassifier classifier = DynamicLMClassifier .createNGramBoundary(CATEGORIES, NGRAM_SIZE);
There are two other kinds of language model classifiers that may be constructed, for bounded character language models and tokenized language models.
Training a classifier simply involves providing examples of text
of the various categories. This is called through the handle
method after first constructing a classification from the category
and a classified object from the classification and text:
Classification classification = new Classification(CATEGORIES[i]); Classified<CharSequence> classified = new Classified<CharSequence>(text,classification); classifier.handle(classified);
That's all you need to train up a language model classifier. Now we can see what it can do with some evaluation data.
Classifying News Articles
The DynamicLMClassifier is pretty slow when doing classification so it is generally worth going through a compile step to produce the more efficient compiled version, which will classify character sequences into joint classification results. A simple way to do that is in the code as:
JointClassifier<CharSequence> compiledClassifier = (JointClassifier<CharSequence>) AbstractExternalizable.compile(classifier);
Now the rubber hits the road and we can can see how well the machine learning is doing. The example code both reports classifications to the console and evaluates the performance. The crucial lines of code are:
JointClassification jc = compiledClassifier.classifyJoint(text); String bestCategory = jc.bestCategory(); String details = jc.toString();
The text is an article that was not trained on and the JointClassification
is the result of evaluating the text against all the language
models. Contained in it is a bestCategory()
method that
returns the highest scoring language model name for the text. Just to
be sure that some statistics are involved the toString()
method dumps out all the results and they are presented as:
Testing on soc.religion.christian/21417 Best Cat: soc.religion.christian Rank Cat Score P(Cat|In) log2 P(Cat,In) 0=soc.religion.christian -1.56 0.45 -1.56 1=talk.religion.misc -2.68 0.20 -2.68 2=alt.atheism -2.70 0.20 -2.70 3=misc.forsale -3.25 0.13 -3.25
Scoring Accuracy
The remaining API of note is how the system is scored against a gold standard. In this case our testing data. Since we know what newsgroup the article came from we can evaluate how well the software is doing with the JointClassifierEvaluator class.
boolean storeInputs = true; JointClassifierEvaluator<CharSequence> evaluator = new JointClassifierEvaluator<CharSequence>(compiledClassifier, CATEGORIES, storeInputs);
This class wraps the compiledClassifier
in an evaluation framework
that provide very rich reporting of how well the system is doing. Later in the
code it is populated with data points with the method addCase()
,
after first constructing a classified object as for training:
Classification classification = new Classification(CATEGORIES[i]); Classified<CharSequence> classified = new Classified<CharSequence>(text,classification); evaluator.handle(classified);
This will get a JointClassification for the text and then keep track of the results for reporting later. After all the data is run, then many methods exist to see how well the software did. In the demo code we just print out the total accuracy via the ConfusionMatrix class, but it is well worth looking at the relevant Javadoc for what reporting is available.
Cross-Validation
Running Cross-Validation
There's an ant target crossValidateNews
which cross-validates
the news classifier over 10 folds. Here's what a run looks like:
> cd $LINGPIPE/demos/tutorial/classify > ant crossValidateNews Reading data. Num instances=250. Permuting corpus. FOLD ACCU 0 1.00 +/- 0.00 1 0.96 +/- 0.08 2 0.84 +/- 0.14 3 0.92 +/- 0.11 4 1.00 +/- 0.00 5 0.96 +/- 0.08 6 0.88 +/- 0.13 7 0.84 +/- 0.14 8 0.88 +/- 0.13 9 0.84 +/- 0.14
This reports that there are 250 training examples. With 10 folds, that'll be 225 traniing and 25 test cases each. The accuracy for each fold is reported along with the 95% normal approximation to the binomial confidence interval per run (with no smoothing on the binomial estimate, hence the 0.00 variance for fold 4). The moral of this story is that small training sizes lead to large variance.
Cross-validation is a means of using a single corpus to train and evaluate without deciding ahead of time how to carve the data into test and training portions. This is often used for evaluation, but more properly should be used only for development.
How Cross-Validation Works
Cross-validation divides a corpus into a number of evenly sized portions called folds. Then for each fold, the data not in the fold is used to train a classifier which is then evaluated on the current fold. The results are then pooled across the folds, which greatly reduces the variance in the evaluation, reflected in narrower confidence intervals.
Implementing a Cross-Validating Corpus
LingPipe supplies a convenient corpus.Corpus
class which is meant to be used for generic training and testing
applications like cross-validation. The corpus class is typed based
on the handler type H
intended to handle its data.
The basis of the corpus class is
a pair of methods visitTest(H)
and visitTrain(H)
which
send handlers every training instance or every testing instance respectively.
LingPipe implements cross-validation for evaluation with the class
corpus.XValidatingObjectCorpus
.
This corpus implementation just stores the data in parallel lists and
uses it to implement the visit-test and visit-train methods of the
corpus.
Permuting Inputs
It is critical in evaluating classifiers to pay attention to correlations in the corpus. In the case of the 20 newsgroups data, which is organized by category, a naive 10% cross-validation would remove most or all of a category's training data, which would lie in a continuous run.
To solve this problem, the cross-validating corpus implementation includes a method to permute the corpus using a specified random implementation.
We implemented the randomizer with a fixed seed so that experiments would be repeatable. Change the seed to get a different set of runs. You should see the variance even more clearly after more runs.
Cross-Validation Implementation
The command-line implementation for cross-validating is in
src/CrossValidateNews.java
.
The code mostly repeats the simple classifier code. First, we create
a cross-validating corpus, then store all of the data from both the
training and test directories.
XValidatingObjectCorpus<Classified<CharSequence>> corpus = new XValidatingObjectCorpus<Classified<CharSequence>>(NUM_FOLDS); for (String category : CATEGORIES) { Classification c = new Classification(category); File trainCatDir = new File(TRAINING_DIR,category); for (File trainingFile : trainCatDir.listFiles()) { String text = Files.readFromFile(trainingFile,"ISO-8859-1"); Classified<CharSequence> classified = new Classified<CharSequence>(text,c); corpus.handle(classified); } File testCatDir = new File(TESTING_DIR,category); for (File testFile : testCatDir.listFiles()) { String text = Files.readFromFile(testFile,"ISO-8859-1"); Classified<CharSequence> classified = new Classified<CharSequence>(text,c); corpus.handle(classified); } }
The corpus is then permuted using a new random number generator, which randomizes any order-related correlations in the text:
long seed = 42L; corpus.permuteCorpus(new Random(seed));
Note that we have fixed the seed value for the random number generator. Choosing another one would produce a different shuffling of the inputs.
Now that the corpus is created, we loop over the folds, evaluating each one using the methods supplied by the corpus:
for (int fold = 0; fold < NUM_FOLDS; ++fold) { corpus.setFold(fold); DynamicLMClassifier<NGramProcessLM> classifier = DynamicLMClassifier.createNGramProcess(CATEGORIES,NGRAM_SIZE); corpus.visitTrain(classifier); JointClassifier<CharSequence> compiledClassifier = (JointClassifier<CharSequence>) AbstractExternalizable.compile(classifier); boolean storeInputs = true; JointClassifierEvaluator<CharSequence> evaluator = new JointClassifierEvaluator<CharSequence>(compiledClassifier, CATEGORIES, storeInputs); corpus.visitTest(evaluator); System.out.printf("%5d %4.2f +/- %4.2f\n", fold, evaluator.confusionMatrix().totalAccuracy(), evaluator.confusionMatrix().confidence95()); }
For each fold, the fold is first set on the corpus. Then a trainable
classifier is created and the corpus is used to train it through the
visitTrain()
method. Then the classifier is compiled
and used to construct an evaluator. The evaluator is then run over
the test cases by the corpus method visitTest()
. Finally,
the resulting accuracy and 95% confidence interval are printed.
Leave-One-Out Evaluations
The limit of cross-validation is when each fold consists of a single example. This is called "leave one out" (LOO). This is easily achieved in the general corpus implementation by setting the number of folds equal to the number of data points. The only potential problem is rounding errors in arithmetic, so leave-one-out evals are typically done with specialized implementations. Also, in doing leave-one-out, there is no point in compiling the classifier before running it.
References
For a general introduction to cross-validation, see:
- Wikipedia: Cross-validation
For a survey of statistical classification and examples of classifiers using character language models, see:
- W. J. Teahan. 2000. Text classification using minimum cross-entropy. In RIAO 2000.
- Fuchun Peng, Dale Schuurmans and Shaojun Wang. 2003. Language and task independent text categorization with simple language models. In Proceedings of HLT-NAACL 2003.
- Fuchun Peng, Xiangji Huang, Dale Schuurmans, and Shaojun Wang. 2003. Text classification in Asian languages without word segmentation. In Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages.
- Fuchun Peng, Dale Schuurmans, Vlado Keselj and Shaojun Wang. 2003. Language independent authorship attribution using character level language models. In Proceedings of EACL 2003.
- F. Sebastiani. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34(1):1--47.