What is Sentiment Analysis?
Sentiment analysis involves classifying text based on its sentiment. While this may mean many things, in this tutorial, we focus on two types of classification problem:
- Subjective (opinion) vs. Objective (fact) sentences
- Positive (favorable) vs. Negative (unfavorable) movie reviews
How is it Done?
The high-level idea is to use LingPipe's language classification framework to do two classification tasks: separating subjective from objective sentences, and separating positive from negative movie reviews. In the third section, we show how to build a hierarchical classifier by composing these models.
Who's Idea was This?
This tutorial essentially reimplements the basic classifiers and then the hierarchical classification technique described in Bo Pang and Lillian Lee's 2004 ACL paper "A sentimental education."
Downloading Training Corpora
Luckily for us, Lillian Lee and Bo Pang have provided annotated slices of movie review data for polarity (both boolean and scalar), and subjectivity. These three datasets are described at:
We will be using the subjectivity and boolean polarity data:
| Pang and Lee's Data | |||
|---|---|---|---|
| Data Set | Data | Read Me | Description |
| Polarity v2.0 | Data (3.1MB) | README | 1000 positive, 1000 negative full text movie reviews. Drawn from IMDB's archive of rec.arts.movies.reviews. Heuristic scripts used to extract first review score from text. |
| Subjectivity v1.0 | Data (508KB) | README | 5000 "objective", 5000 "subjective" sentences. Objective from Internet Movie Database (IMDB) plot summaries, subjective from Rotten Tomatoes customer review "snippets". |
Each data file should be downloaded and then unpacked. They are distributed in tarred/gzipped format. For instance, here's the result of unpacking the review polarity data which was downloaded to the directory in which untarring is performed:
> ar -xzf review_polarity.tar.gz > tar -xzf rotten_imdb.tar.gz > ls plot.tok.gt9.5000 subjdata.README.1.0 poldata.README.2.0 rotten_imdb.tar.gz quote.tok.gt9.5000 txt_sentoken review_polarity.tar.gz
We will use POLARITY_DIR to refer to the directory
where the reviews were unpacked.
2. Basic Polarity Analysis
We begin with a simple classification exercise, amounting to training and testing our basic classifiers on (different slices of) the boolean polarity data. The resulting classifier is able to judge whether a whole movie review is essentially positive or negative (as defined by the data set curators).
Running the Polarity Classifier
Assuming the data is in the directory
POLARITY_DIR and the sentimentDemo.jar file
exists, the demo may be run from the command line (all one line, with
colon (:) classpath separators replacing the semicolons
(;) for linux):
java
-cp "sentimentDemo.jar;
../../../lingpipe-3.5.1.jar"
PolarityBasic POLARITY_DIR
or through Ant with the target polarity and a system
property determining the polarity directory's location:
ant -DpolarityDir=POLARITY_DIR polarity
This produces the following output after running for about a minute and a quarter on my desktop machine:
BASIC POLARITY DEMO Data Directory=e:\data\pang-lee-polarity\txt_sentoken Training. # Training Cases=1800 # Training Chars=6989652 Evaluating. # Test Cases=200 # Correct=163 % Correct=0.815
This result is very encouraging. We've set LingPipe up with its default recommended n-gram length, 8, and the resulting classification accuracy is within .014 of the best accuracy reported in Pang, Lee and Vaithyanathan's Lee's 2002 Thumbs up? paper from EMNLP.
In Appendix 1: Confidence Intervals, we show how to compute confidence intervals for these results.
Stepping through the Code
In this section, we step through the source found in the file src/PolarityBasic.java. The program reads the training directory location from the command line, trains a classifier on the training data, then evaluates the classifier on the test data.
Main to run
As usual, our main method constructs an instance using
the command-line arguments, then runs it. If any errors are thrown, it
prints their stack traces.
public static void main(String[] args) {
try {
new PolarityBasic(args).run();
} catch (Throwable t) {
System.out.println("Thrown: " + t);
t.printStackTrace(System.out);
}
}
Constructor to Marshal Arguments
Also following our standard operating procedure (SOP), the constructor sets up the member variables using the command-line arguments:
File mPolarityDir;
String[] mCategories;
DynamicLMClassifier mClassifier;
PolarityBasic(String[] args) {
mPolarityDir = new File(args[0],"txt_sentoken");
mCategories = mPolarityDir.list();
int nGram = 8;
mClassifier
= DynamicLMClassifier
.createNGramProcess(mClassifier,nGram);
}
First, the directory is just set to be the directory named
txt_sentoken relative to the top-level polarity data
directory given as the first command-line argument. The category
array is initialized using the directory names under
txt_sentoken, which in this case are
"pos" and "neg". We set
the n-gram length to the constant 8; this could obviously be set with
a command-line argument if desired. We use the factory to construct
a bounded n-gram classifier with the specified categories and n-gram
size. Recall that the process models are normalized for a given
input length and do not model boundaries of strings differently
than other positions.
Training
The run method simply calls training then evaluation:
void run() throws ClassNotFoundException,
IOException {
train();
evaluate();
}
We consider training in this section and evaluation in the next. Here's the training method without the code to count the number of cases and characters or the code to print results:
void train() throws IOException {
for (int i = 0; i < mCategories.length; ++i) {
String category = mCategories[i];
File dir = new File(mPolarityDir,mCategories[i]);
File[] trainFiles = dir.listFiles();
for (int j = 0; j < trainFiles.length; ++j) {
File trainFile = trainFiles[j];
if (isTrainingFile(trainFile)) {
String review
= Files.readFromFile(trainFile);
mClassifier.train(category,review);
}
}
}
}
This method runs through the categories, of which there are two in
this demo. It then creates a directory using the polarity data
directory and the name of the category. This only works for this demo
because the data is organized into directories by category. Then, the
potential training files are listed and iterated. For each training
file, a test is done to see if it is a training file. If it is, then
the text is read from the file using the LingPipe utility method
Files.readFromFile, and then used to train the classifier
for the specified category.
The only mystery is how we determine if a file is a training
file. Lee and Pang were generous enough to pre-slice the files
into ten equally-sized slices which are distinguished by the
third character of the file name. For instance, the file
pos/cv362_15341.txt is a positive training instance
in block 3, whereas pos/cv532_6522.txt is a positive
training instance in block 5. We just decided to train on blocks
0 through 8 and test on block 9, so the method is just:
boolean isTrainingFile(File file) {
return file.getName().charAt(2) != '9'; // test on fold 9
}
If you want to see results for other slices, just swap out the
9 for any digit between 0 and
8.
Evaluation
The evaluation code follows the same structure as the training code:
void evaluate() throws IOException {
int numTests = 0;
int numCorrect = 0;
for (int i = 0; i < mCategories.length; ++i) {
String category = mCategories[i];
File file = new File(mPolarityDir,mCategories[i]);
File[] testFiles = file.listFiles();
for (int j = 0; j < testFiles.length; ++j) {
File testFile = testFiles[j];
if (!isTrainingFile(testFile)) {
String review
= Files.readFromFile(testFile);
++numTests;
Classification classification
= mClassifier.classify(review);
String resultCategory
= classification.bestCategory();
if (resultCategory.equals(category))
++numCorrect;
}
}
}
}
The code rendered in grey is the same as in the training loop described in the last section. The remaining code begins by setting the number of tests and number of correct answer counters to zero. Then, as each review is processed, the number of tests is incremented. Then the classifier is used to produce a classification for a review string in a single line. Next, the classification's best category is extracted as the result category of classification. If the result category matches the test category, the number of correct classifications is incremented.
The final results are then printed with this code:
System.out.println(" # Test Cases="
+ numTests);
System.out.println(" # Correct="
+ numCorrect);
System.out.println(" % Correct="
+ ((double)numCorrect)
/(double)numTests);
There are, of course, more efficient ways to write this code, and ways to refactor so that the cut-and-pasted code is also shared, but we leave those improvements as exercises to the reader.
Basic Subjectivity Analysis
This section covers a second form of sentiment analysis, namely
determining if a sentence is "objective" or
"subjective" (again, as defined by the database curators).
It follows pretty much the same pattern as the last example, with a
slightly different data format, the addition of the classifier
evaluation framework from com.aliasi.classify, and a step
to compile the model to a file for later use. The advantage of the
evaluation framework is that it can not only tell right from wrong,
but also distinguish several shades of grey.
Running the Subjectivity Classifier
Assuming the data is in the directory
POLARITY_DIR and the sentimentDemo.jar file
exists (if it doesn't, run ant jar to create it), the
demo may be run from the command line:
java
-cp "sentimentDemo.jar;
../../../lingpipe-3.5.1.jar"
SubjectivityBasic
POLARITY_DIR
or through Ant with the target subjectivity and a system
property determining the polarity directory's location:
ant -DpolarityDir=POLARITY_DIR subjectivity
This produces the following output after about 45 seconds of chugging away on my desktop:
BASIC SUBJECTIVITY DEMO Data Directory=e:\data\pang-lee-polarity Training. # Sentences plot=5000 # Sentences quote=5000 Compiling. Model file=subjectivity.model # Training Cases=9000 # Training Chars=1160539 Evaluating. CLASSIFIER EVALUATION Categories=[plot, quote] Total Count=1000 Total Correct=921 Total Accuracy=0.921 ...
We'll step through the output (which runs to more than a page) a
bite-sized chunk at a time; the ellipses (...) indicate
that the report is continued. The first line of the report indicates
the name of the categories. In this case, it's plot for
"objective" sentences (drawn from plot summaries) and
quote for "subjective" sentences (drawn from
user-review snippets). Next comes the same accuracy report as we
computed by hand in the last demo. This performance is much better at
92% accuracy than the polarity classification results we saw in the
last section.
After the basic accuracy report, the confusion matrix is presented in a format to provide easy inclusion into a spreadsheet or other graphing package such as gnuplot.
... Confusion Matrix reference \ response ,plot,quote plot,458,42 quote,37,463 ...
This matrix represents the count of all reference/response pairs. The reference categories are read down the left and the response categories along the top. For this demo, the reference is the "gold standard" defined by the database curators and the response is the first-best category produced by the classifier we just trained. Reading the results, there are 458 test cases that were classified as plots by the reference and plots by the response. There are 42 cases that were labeled as plots in the gold standard, but were misclassified as quotes by the classifier. On the next row, there are 463 cases that were labeled quotes in the gold standard that were correctly classified as quotes by our classifier. In addition, there were 37 cases labeled as quotes in the gold standard that our classifier mislabeled as plots.
The remainder of the output is discussed in Appendix 2: Extended Classifier Evaluation.
Stepping through the Code
The code for the subjectivity classifier in src/SubjectivityBasic.java is almost identical to that of the polarity classifier, so we only focus on a few differences in this section.
Splitting the Training File
Unlike in the polarity demo, where every case was included in a separate file, here every case is on a single line within a file, so the code's a bit different:
for (int i = 0; i < mCategories.length; ++i) {
String category = mCategories[i];
File file = new File(mPolarityDir,
mCategories[i] + ".tok.gt9.5000");
String data = Files.readFromFile(file);
String[] sentences = data.split("\n");
int numTraining = (sentences.length * 9) / 10;
for (int j = 0; j < numTraining; ++j) {
String sentence = sentences[j];
mClassifier.train(category,sentence);
}
}
Here, we construct a file using the specified pattern and then read all of the data from the file. We then split on newlines to derive the sentences. The number of training instances is set to 90% of the input data. We then just loop over the training data instances and train as before.
Writing the Model to a File
The remaining code in the train() method simply compiles
the model to a file:
FileOutputStrea
m fileOut = new FileOutputStream("subjectivity.model");
ObjectOutputStream objOut = new ObjectOutputStream(fileOut);
mClassifier.compileTo(objOut);
objOut.close();
This actually transforms the format, with the resulting model being
much faster at runtime. A more robust implementation would handle the
close in a finally block that made sure the file
output stream was closed to ensure no dangling file pointers were
held.
Evaluation
The evaluation code this time is even simpler with the use of the evaluator.
void evaluate() throws IOException {
ClassifierEvaluator evaluator
= new ClassifierEvaluator(mClassifier,
mCategories);
for (int i = 0;
i < mCategories.length; ++i) {
String category = mCategories[i];
File file = new File(mPolarityDir,
mCategories[i]
+ ".tok.gt9.5000");
String data = Files.readFromFile(file);
String[] sentences = data.split("\n");
int numTraining = (sentences.length * 9) / 10;
for (int j = numTraining;
j < sentences.length; ++j) {
evaluator.addCase(category,sentences[j]);
}
}
System.out.println(evaluator.toString());
}
The first line simply creates an evaluator from the classifier and the
array of categories. The greyed out parts of the code are identical
to that in the training method. In particular, note that it
calculates the 10 percent of the data on which to test and then as
each sentence is encountered, it is added as a case to the evaluator
using the ClassifierEvaluator.addCase(String,Object).
Evaluation cases consist of the reference result, in this case
category, and the input, in this case
sentences[j]. Because the evaluator has a handle on the
classifier, it just runs the classifier over the input and records the
results. When the evaluation loop is done, we just call the
toString() method on the evaluator to print out the
results. Note that the category supplied to the evaluator is only
used for evaluation purposes; the classifier will be used to perform
classification on the input sentence without reference to the
reference category. The resulting scored classification is then
added as an evaluation case with the specified reference category
for computing results.
Hierarchical Polarity Analysis
Pang and Lee (2004) introduce a hierarchical approach to classification. Specifically, they use the subjectivity classifier to extract subjective sentences from reviews to be used for polarity classification. Hierarchical models are quite common in the classification and general statistics and machine learning literatures.
Running the Hierarchical Classifier
Note: The basic subjectivity demo must be run first to create the model file that will be used by the hierarchical model.
The hierarchical classifier is run just like the other demos,
either through the Ant target hierarchical (with the same
system property setting the data directory), or by
the command:
java
-cp "sentimentDemo.jar;
../../../lingpipe-3.5.1.jar"
PolarityHierarchical
POLARITY_DIR
This produces the following output after a minute or so on my desktop:
HIERARCHICAL POLARITY DEMO Data Directory=E:\data\pang-lee-polarity\txt_sentoken Reading Compiled Model Training. # Training Cases=1800 # Training Chars=6989652 CLASSIFIER EVALUATION Categories=[neg, pos] Total Count=200 Total Correct=170 Total Accuracy=0.85 Confusion Matrix reference \ response ,neg,pos neg,82,18 pos,12,88 ...
A later line in the report is very telling (see Appendix 2: Extended Classifier Evaluation for more information on these reports):
... Average Conditional Probability Reference=0.5088414431366002 ...
What this is telling us is that the classifier was not at all confident about most of its decisions. When this is the case, classifiers tend to have a very high variance with respect to parameter settings.
Although the classifier isn't very confident on average, what happens when we sort the output by confidence and return answers we are most confident in first? That's the report that we find in the conditional one-versus-all report (this time for the positive category):
... Conditional One Versus All Area Under PR Curve (interpolated)=0.9161714425673629 ... Average Precision=0.913313371341476 Maximum F(1) Measure=0.864321608040201 ...
This is telling us that even though we're not particularly confident about our decisions on average, the ranking is in fact useful. In fact, this tells us that by setting a threshold other than 0.50 for classification, we could achieve an 86.4% f-measure on this task. This looks substantially better than the 85% reported above, but is not very significant as we indicate in the Appendix 1: Confidence Intervals. Even so, 85% is higher than our standalone simple classifier, and 86.4% is the best hierarchical performance reported by Pang and Lee using SVMs to classify polarity stacked on top of a naive Bayes subjectivity classifier.
Inspecting the Code
The source for this demo is in the file src/PolarityHierarchical.java. It's identical to the first demo in the way that it steps through data, so we don't repeat that code here.
Construction: Reading in the Model
The constructor in this implementation does all of the work of the one
in the basic polarity demo in setting up member variables (indicated
by ellipses (...)). It also reads in the subjectivity
model from a hardwired file named subjectivity.model:
...
Classifier mSubjectivityClassifier;
PolarityHierarchical(String[] args)
throws ClassNotFoundException, IOException {
...
File modelFile = new File("subjectivity.model");
System.out.println("\nReading model from file="
+ modelFile);
FileInputStream fileIn
= new FileInputStream(modelFile);
ObjectInputStream objIn
= new ObjectInputStream(fileIn);
mSubjectivityClassifier
= (Classifier) objIn.readObject();
objIn.close();
}
This code raises the possibility of either an IOException
from the I/O or a ClassNotFoundException from reading in
the actual classifier from the object intput stream.
Training
The next step is to train the polarity classifier. It does this in
exactly the same way as in the basic polarity demo using the method
train(). This brings up the possibility of only training
on the subjective sentences of the training data, but when we did
this, we found it hurt rather than helped performance, so we do not
include that technique here. This does illustrate another aspect of
the nearly limitless fiddling that's possible with these kinds of
models.
Evaluation
The evaluation is set up similarly to the subjectivity demo with a slight twist:
void evaluate() throws IOException { ClassifierEvaluator evaluator = new ClassifierEvaluator(null,mCategories); for (int i = 0; i < mCategories.length; ++i) { String category = mCategories[i]; File file = new File(mPolarityDir,mCategories[i]); File[] trainFiles = file.listFiles(); for (int j = 0; j < trainFiles.length; ++j) { File trainFile = trainFiles[j]; if (!isTrainingFile(trainFile)) { String review = Files.readFromFile(trainFile); String subjReview = subjectiveSentences(review); Classification classification = mClassifier.classify(subjReview); evaluator.addClassification(category, classification); } } } System.out.println(); System.out.println(evaluator.toString()); }
As in previous examples, the code that remains the same is greyed out.
The new code creates a classifier evaluator with a null
classifier. This is because we don't actually have an implementation
of the Classifier interface (see the next section for an
illustration of how to create one). Instead, we'll create
classifications on our own and add them to the evaluation. This is
illustrated in the remaining new code. Here, a string
subjReview is the result of applying the method
subjectiveSentences to the review. This extracts the
subjective sentences and returns them as a string. We then create a
classification using the filtered intput subjReview.
Finally, we add it as a case to the evaluator. This illustrates how
the evaluator may be used without an embedded classifier -- cases are
just added in terms of the first-best answer and the response
classification.
The meat of this implementation is in pulling out the subjective sentences. There are many ways this can be configured to run. We implement a technique that reduces a review to 5 to 25 sentences. This will be the five most subjective sentences as ranked by conditional probability of the subjectivity model, as well as up to 20 more sentences if they are 50% or more likely to be subjective according to the subjectivity model.
static int MIN_SENTS = 5;
static int MAX_SENTS = 25;
String subjectiveSentences(String review) {
String[] sentences = review.split("\n");
BoundedPriorityQueue pQueue
= new BoundedPriorityQueue(Scored
.SCORE_COMPARATOR,
MAX_SENTS);
for (int i = 0; i < sentences.length; ++i) {
String sentence = sentences[i];
ConditionalClassification subjClassification
= (ConditionalClassification)
mSubjectivityClassifier
.classify(sentences[i]);
double subjProb;
if (subjClassification.category(0)
.equals("quote"))
subjProb = subjClassification
.conditionalProbability(0);
else
subjProb = subjClassification
.conditionalProbability(1);
pQueue.add(new ScoredObject(sentence,
subjProb));
}
StringBuffer reviewBuf = new StringBuffer();
Iterator it = pQueue.iterator();
for (int i = 0; it.hasNext(); ++i) {
ScoredObject so = (ScoredObject) it.next();
if (so.score() < 0.5 && i >= MIN_SENTS) break;
reviewBuf.append(so.getObject() + "\n");
}
String result = reviewBuf.toString().trim();
return result;
}
The first line simply breaks the review into sentences. We then
create a priority queue of objects ordered by score with a maximum
size of the maximum number of sentences returned. Then we just
iterate over the sentences and classify them with the subjectivity
classifier we created in the constructor. The result is cast to a
conditional classification, which allows us to extract conditional
probabilities. The probability that the sentence is subjective is
then set into the variable subjProb. This is a bit
tricky because we have to determine if the category quote
(meaning "subjective") is the first-best or second-best
response and pull out the correct probability. We then add a scored
object to the queue with its object set to be the current sentence and
its score the conditional probability of that sentence being
subjective. The priority queue keeps the items in ranked order up
to the specified maximum.
The next batch of code creates a buffer into which to append the
subjective sentences. We create an iterator over the elements in the
priority queue, which returns the item in order of highest estimated
probability of being subjective. If we already have five sentences
and the probability of being subjective is less than half
(0.5), we break out of the loop. Otherwise, we append the
next sentence. Finally, we return the result after trimming any
residual whitespace (probably just the final newline; not a very
efficient way to do this at all, but these data sets are miniscule).
Building a Classifier Implementation
In order to play nicely with the rest of LingPipe, we should really
define our hierarchical classifier to implement the interface
classify.Classifier. This is actually very straightfoward,
but we didn't want to confuse the earlier code with tricky Java
particulars like inner class interface implementations.
We could either create a class, or we can just create one inline, as
if we were javax.swing programmers writing GUI event
handlers:
Classifier hierClassifier = new Classifier() {
public Classification classify(Object input) {
String review = input.toString();
String subjReview
= subjectiveSentences(review);
return mClassifier.classify(subjReview);
}
};
We could also build a class that read both models in from a file, or took both classifiers as parameters, or any number of other solutions in-between. By constructing a hierarchical classifier in any of these ways, it may be supplied to an evaluator.
Plug-and-Play Classifiers
Although we've used 8-gram character language model classifiers, it's easy to plug-and-play any of LingPipe's other classifiers.
Modifying the Existing Classes
Naive Bayes can be evaluated by importing the relevant classes and setting the classifier in the constructor:
import com.aliasi.classify
.NaiveBayesClassifier;
import com.aliasi.tokenizer
.IndoEuropeanTokenizerFactory;
...
PolarityHierarchical() {
...
TokenizerFactory factory
= new IndoEuropeanTokenizerFactory();
int charNGramLength = 0;
int numChars = 128;
mClassifier
= new NaiveBayesClassifier(mCategories,
factory,
charNGramLength,
numChars);
...
Run this way, the subjectivity classifier still uses character 8-grams, but the polarity classifier uses naive Bayes with no character language model smoothing. The number of characters (here set to 128), determines the amount of penalty per character for unknown words.
It's also possible to experiment with token n-grams. And, of course, each of these classifiers has a few parameters to tweak. Each model may also be pruned, for further control. Just don't be fooled by overtrained a posteriori results on a test set. You're unlikely to be called out on this in an ACL paper, but the results are rarely achievable in practice.
We'll end with some words of warning on efficiency and performance. If you use naive Bayes or token n-gram models, you should probably compile them first. Uncompiled naive Bayes takes almost ten minutes to complete the classification demo. The act of compiling them to a file actually precomputes almost all of the probabilities; reading them back in computes the suffix tree backoffs. The compiled classifiers should work in exactly the same way as the classifier from which they were compiled.
The second word of warning is that naive Bayes as implemented here isn't accurate for this task. Typically, naive Bayes as used in classifiers is smoothed using something like add-one (Laplace) smoothing. This is what Pang and Lee do for their naive Bayes baseline. The way to implement add-one smoothing over LingPipe's naive Bayes implementation is to collect all of the tokens during the first training pass in a set. Then they may be added as training data for both categories. This approach may also be helpful for the basic character-level models, but they're usually more robust with respect to smoothing than token models, which is just another reason why we prefer them for most applications.
Cross-Validating Factory Implementation
For those who are comfortable reading Java code based on factory
interfaces, there's an implementation of cross-validation that
evaluates several different kinds of models in the file src/PolarityWhole.java.
Cross-validation performs multiple divisions of the data into training
and test sets and then averages the results in order to bring down
evaluation variance in order to tighten confidence intervals. It can
be run just like the other methods either from the command line using
class PolarityWhole or from Ant using the target
whole.
Appendix 1: Confidence Intervals
As usual, we compute 95% confidence intervals for a given accuracy
of p over N trials using the binomial
distribution bionmial(p,N), which is just the
distribution corresponding to N independent trials each
with a p chance of success. The deviation of the binomial
distribution is:
dev(binomial(p,N)) = sqrt(p*(1-p)/N)
Consider the basic polarity evaluation, for which accuracy is 81.5%, or 0.815 over 200 cases. This leads to a deviation of
dev(binomial(0.815,200)) = sqrt(0.815 * (1.0 - 0.815)/200 = 0.0275
This is a huge deviation, primarily because there are only 200 test
cases. A 95% confidence interval is roughly plus or minus 1.6
deviations, or about +/- 0.044.
In interval terms, we're 95% confident our true performance is in
the interval (77.1, 85.9). In layman's terms, we're
not particularly confident about our results for the basic polarity
evaluation.
If we had 2000
tests rather than 200 (ten times as much data), that number would be
+/- 0.014, a factor of sqrt(10) less
due to 10 times the amount of data.
We have a much tighter bound for our basic subjectivity demo. There there are 1000 test cases and a 92.1% accuracy, leading to:
dev(binomial(0.921,1000)) = sqrt(0.815 * (1.0 - 0.815)/200 = 0.00853
So for those results, our 95% confidence interval is the
narrower (90.7, 93.5).
Pang and Lee used paired t-tests over cross-validated slices of data for significance, but we can't use that tighter technique to compare our results to theirs because we don't have access to their results for the requisite pairing.
Appendix 2: Extended Classifier Evaluation
This appendix dives more deeply into the statistical analysis of the results. It picks up where the previous discussion left off in section 2.
Traditional Classification Statistics
The report continues with a range of standard statistics that have been applied to classification problems:
... Random Accuracy=0.5 Random Accuracy Unbiased=0.5000125 kappa=0.8420000000000001 kappa Unbiased=0.8419960499012477 kappa No Prevalence =0.8420000000000001 ...
The most popular statistic here is the kappa statistic, which we present in all three forms: "standard", adjusted for "bias", and adjusted for "prevalence". See the class documentation for more information about these statistics. Suffice it to say here that this kappa value is well within the rule-of-thumb range expected for "reliable" classification.
The next basic statistics report on information-theoretic measures about the marginal and conditional distributions produced by the training data and the results:
... Reference Entropy=1.0 Response Entropy=0.9999278640456615 Cross Entropy=1.0000721383590225 Joint Entropy=1.3983977819792184 Conditional Entropy=0.39839778197921816 Mutual Information=0.6015300820664435 Kullback-Liebler Divergence=7.213835902255758E-5 ...
The conditional entropy and mutual information statistics are the most informative. The conditional entropy statistic tells us how many additional bits we'd need on average to encode the classification result given the reference category. This number will be 0.0 if there is perfect classification. The mutual information statistic just presents the response entropy minus the conditional entropy.
These are then followed by some statistical measures,
including a chi-squared independence test (Pearson's
C2 statistic) and a few-other
fairly widely used statistics, which are explained in the
class documentation:
... chi Squared=709.034903490349 chi-Squared Degrees of Freedom=1 phi Squared=0.709034903490349 Cramer's V=0.8420421031577632 lambda A=0.842 lambda B=0.8404040404040404 ...
The final bit of this section of the report includes results about average performance from a ranked, scored, conditional and joint probability perspective. The average reference rank indicates the average position on the n-best list provided by a ranked classifier of the gold standard answer. The average score of the reference is just that -- the average score returned by the scored classifier for the reference category; note that because language model classification uses a joint probability classification scheme, the log2 joint probability of the reference is the same as the score (although it is expressed as a cross-entropy rate). The average conditional probabilty says that on average, the classifier was only 55.8% confident in its answer.
... Average Reference Rank=0.079 Average Score Reference=-1.9029423566865242 Average Conditional Probability Reference=0.5575955442436352 Average Log2 Joint Probability Reference=-1.9029423566865242 ...
One-Versus-All Evaluations
Next up are one-versus-all evaluations for each category. These indicate performance on a category-by-category basis. This isn't so interesting in our case, because both categories perform about equally well, which is typical in two-category problems with roughly balanced false positives and false negatives.
A one-versus-all evaluation is created by reducing an n-way
confusion matrix to a two-way confusion matrix between a given
category and everything else. (Like the other statistics, this is
thoroughly explained in the class documentation.) Even for a two-way
classification problem, these reports provide some interesting insight
into the classificatioin problem. Here's the initial piece of the
report for the category of objective sentences, plot:
... ONE VERSUS ALL EVALUATIONS BY CATEGORY CATEGORY[0]=plot First-Best Precision/Recall Evaluation Total=1000 True Positive=458 False Negative=42 False Positive=37 True Negative=463 Positive Reference=500 Positive Response=495 Negative Reference=500 Negative Response=505 Accuracy=0.921 Recall=0.916 Precision=0.9252525252525252 Rejection Recall=0.926 Rejection Precision=0.9168316831683169 F(1)=0.9206030150753768 Fowlkes-Mallows=497.49371855331003 Jaccard Coefficient=0.8528864059590316 Yule's Q=0.9854499831466986 Yule's Y=0.8422896535427215 Reference Likelihood=0.5 Response Likelihood=0.495 Random Accuracy=0.5 Random Accuracy Unbiased=0.5000125 kappa=0.8420000000000001 kappa Unbiased=0.8419960499012477 kappa No Prevalence=0.8420000000000001 chi Squared=709.034903490349 phi Squared=0.709034903490349 Accuracy Deviation=0.008529888627643386 ...
This is just a confusion matrix report based on the number
of true positives, false negatives, false positives and
true negatives. For the plot category, recall
was slightly lower than precision.
The one-versus-all report for the plot category
continues with histograms of rank, average rank, conditional
and joint probabilities, just as before, but broken out one-versus-all:
... Rank Histogram= plot,quote 458,42 Average Rank Histogram= plot,quote 0.084,0.916 Average Score Histogram= plot,quote -1.9181814022265598,-2.256328242824601 Average Conditional Probability Histogram= plot,quote 0.5579020517702641,0.442097948229736 Average Joint Probability Histogram= plot,quote -1.9181814022265598,-2.256328242824601 ...
The report concludes with two scored precision recall evaluations. These are the kinds of reports produced for information retrieval tasks as used, for example, in the Text Retrieval Conference (TREC).
... Scored One Versus All Area Under PR Curve (interpolated)=0.7906839673018342 Area Under PR Curve (uninterpolated)=0.7878864243381857 Area Under ROC Curve (interpolated)=0.7840919999999995 Area Under ROC Curve (uninterpolated)=0.7840920000000003 Average Precision=0.7878864243381849 Maximum F(1) Measure=0.7386569872958257 BEP (Precision-Recall break even point)=0.7069943289224953 ...
These include cumulative statistics from the interpolated and uninterpolated precision-recall (PR) and receiver operating characteristic (ROC) curves determined as described in the scored precision-recall evaluation documentation. After the area under the curves, there is average precision, which is very similar, but only includes points which were correct in the average. The maximum F(1)-measure indicates the best possible operating point achievable by setting a threshold. The BEP indicates the best score possible when precision is equal to recall.
Note that the first set of results compares cases using their score,
which is just their joint log probabilities (divided by the number of
characters, and thus expressed as negative cross-entropy rates).
These results are not very good. What this means is that the joint
probabilities assigned to cases perform very well at ranking plots
versus quotes (92% accuracy), but not very good at ranking confidence
overall. For instance, in one case, there might be a score of -1.2
for plot and -1.4 for quote, whereas in
another there might be a score of -1.9 for plot and -2.1
for quote. Ranking these provides case 1 plot, then case
1 quote, then case 2 plot then case 2 quote. Even if both decisions
were right (both were indeed plots), the ranked scores suffer.
Usually, the conditional one-versus-all scores will be much better. These use the conditional probabilities assigned to answers to rank output. This is likely to be the better approach to setting overall thresholds in one-versus all cases, because the scores are true conditional probabilities after the traditional Bayesian normalization.
... Conditional One Versus All Area Under PR Curve (interpolated)=0.9796868652411705 Area Under PR Curve (uninterpolated)=0.9792351050063459 Area Under ROC Curve (interpolated)=0.9786800000000003 Area Under ROC Curve (uninterpolated)=0.9786799999999973 Average Precision=0.9792351050063466 Maximum F(1) Measure=0.9216867469879518 BEP (Precision-Recall break even point)=0.9126984126984127 ...
The average precision being much higher than the accuracy tells us that we are doing a good job ranking by confidence in that we tend to be more correct when we are more confident.
Macro- and Micro-Averaged Results
The final section of the report provides macro- and micro-averaged results. These averages are over the one-versus-all results, which are broken out category-by-category at the end of the report (see below). The thing to remember is that the macro-averaged results average the one-versus-all results with each category weighted equally:
... Macro-averaged Precision=0.9210421042104211 Macro-averaged Recall=0.921 Macro-averaged F=0.9209980249506238 ...
In general, these can diverge widely from the accuracy figures if either the test data or classification results are skewed toward more populous categories (as is often the case with unbalanced training data).
The micro-averaged results weight by case, not by category, treating all cases as equal. This is calculated by summing the one-versus-all matrices and presenting the result as a precision-recall evaluation. As noted in the classifier evaluation documentation and again in the report, micro-averaging leads to multiplying the number of cases by the number of categories and it results in a number of symmetries in counts.
...
Micro-averaged Results
the following symmetries are expected:
TP=TN, FN=FP
PosRef=PosResp=NegRef=NegResp
Acc=Prec=Rec=F
Total=2000
True Positive=921
False Negative=79
False Positive=79
True Negative=921
Positive Reference=1000
Positive Response=1000
Negative Reference=1000
Negative Response=1000
Accuracy=0.921
Recall=0.921
Precision=0.921
Rejection Recall=0.921
Rejection Precision=0.921
F(1)=0.9209999999999999
Fowlkes-Mallows=1000.0
Jaccard Coefficient=0.8535681186283596
Yule's Q=0.9853923195573459
Yule's Y=0.842
Reference Likelihood=0.5
Response Likelihood=0.5
Random Accuracy=0.5
Random Accuracy Unbiased=0.5
kappa=0.8420000000000001
kappa Unbiased=0.8420000000000001
kappa No Prevalence=0.8420000000000001
chi Squared=1417.928
phi Squared=0.708964
Accuracy Deviation=0.006031542091372652
Basically, this is just another precision-recall evaluation over aggregate data.
References
- Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP Proceedings.
- Bo Pang and Lillian Lee. 2004. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. ACL Proceedings.
- Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. ACL Proceedings.