public class ConfusionMatrix extends Object
ConfusionMatrix
represents a
quantitative comparison between two classifiers over a fixed set of
categories on a number of test cases. For convenience, one
classifier is termed the "reference" and the other the
"response".
Typically the reference will be determined by a human or other socalled "gold standard", whereas the response will be the result of an automatic classification. This is how confusion matrices are created from test cases. With this confusion matrix implementation, two human classifiers or two automatic classifications may also be compared. For instance, human classifiers that label corpora for training sets are often evaluated for interannotator agreement; the usual form of reporting for this is the kappa statistic, which is available in three varieties from the confusion matrix. A set of systems may also be compared pairwise, such as those arising from a competitive evaluation.
Confusion matrices may be initialized on construction; with no
matrix argument, they will be constructed with zero values in all
cells. The values can then be incremented by category name with
category name with increment(String,String)
or by
category index with increment(int,int)
. There is also
a incrementByN(int,int,int)
which allows explicit control
over matrix values.
Consider the following confusion matrix, which reports on the classification of 27 wines by grape variety. The reference in this case is the true variety and the response arises from the blind evaluation of a human judge.
Each row represents the results of classifying objects belonging to the category designated by that row. For instance, the first row is the result of 12 cabernet classifications. Reading across, 9 of those cabernets were correctly classified as cabernets, 3 were misclassified as syrahs, and none were misclassified as pinot noir. In the next row are the results for 9 syrahs, 3 of which were misclassified as cabernets and 1 of which was misclassified as a pinot. Similarly, the six pinots being classified are represented on the third row. In total, the classifier categorized 13 wines as cabernets, 9 wines as syrahs, and 5 wines as pinots. The sum of all numbers in the graph is equal to the number of trials, in this case 27. Further note that the correct answers are the ones on the diagonal of the matrix. The individual entries are recoverable using the method
Manyway Confusion Matrix Response Cabernet Syrah Pinot Refer
enceCabernet 9 3 0 Syrah 3 5 1 Pinot 1 1 4
count(int,int)
. The positive and
negative counts per category may be recovered from the result of
oneVsAll(int)
.
Collective results are either averaged per category (macro average) or averaged per test case (micro average). The results reported here are for a single operating point of results. Very often in the research literature, results are returned for the best possible posthoc system settings, established either globally or per category.
The multiple outcome classification can be decomposed into a
number of oneversusall classification problems. For each
category, a classifier that categorizes objects as either belonging
to that category or not. From an nway classifier, a
oneversusall classifier can be constructed automatically by
treating an object to be classified as belonging to the category if
the category is the result of classifying it. For the above
threeway confusion matrix, the following three oneversusall
matrices are returned as instances of PrecisionRecallEvaluation
through the method oneVsAll(int)
:
Note that each has the same truepositive number as in the corresponding cell of the original confusion matrix. Further note that the sum of the cells in each derived matrix is the same as in the original matrix. Finally note that if the original classification problem was two dimensional, the derived matrix will be the same as the original matrix. The results of the various precisionrecall evaluation methods for these matrices are shown in the class documentation for
CabvsAll Response Cab Other Refer
enceCab 9 3 Other 4 11
SyrahvsAll Response Syrah Other Refer
enceSyrah 5 4 Other 4 14
PinotvsAll Response Pinot Other Refer
encePinot 4 2 Other 1 20
PrecisionRecallEvaluation
.
Macroaveraged results are just the average of the percategory results. These include precision, recall and fmeasure. Yule's Q and Y statistics along with the percategory chi squared results are also computed based on the oneversus all matrices.
Microaveraged results are reported based on another derived
matrix: the sum of the scores in the oneversusall matrices. For
the above case, the result given as a PrecisionRecallEvaluation
by the method microAverage()
is:
Note that the true positive cell will be the sum of the truepositive cells of the original matrix (9+5+4=18 in the running example). A little algebra shows that the false positive cell will be equal to the sum of the offdiagonal elements in the original confusion matrix (3+3+1+1+1=9); symmetry then shows that the false negative value will be the same. Finally, the true negative cell will bring the total up to the number of categories times the sum of the entries in the original matrix (here 27*31899=45); it is also equal to two times the number of true positives plus the number of false negatives (here 2*18+9=45). Thus for oneversusall confusion matrices derived from manyway confusion matrices, the microaveraged precision, recall and fmeasure will all be the same.
Sum of OnevsAll Matrices Response True False Refer
enceTrue 18 9 False 9 45
For the above confusion matrix and derived matrices, the noargument and categoryindexed methods will return the values in the following tables. The hotlinked method documentation defines each statistic in detail.
Method Method() categories()
{ "Cabernet", "Syrah", "Pinot" }
totalCount()
27 totalCorrect()
18 totalAccuracy()
0.6667 confidence95()
0.1778 confidence99()
0.2341 macroAvgPrecision()
0.6826 macroAvgRecall()
0.6574 macroAvgFMeasure()
0.6676 randomAccuracy()
0.3663 randomAccuracyUnbiased()
0.3663 kappa()
0.4740 kappaUnbiased()
0.4735 kappaNoPrevalence()
0.3333 referenceEntropy()
1.5305 responseEntropy()
1.4865 crossEntropy()
1.5376 jointEntropy()
2.6197 conditionalEntropy()
1.0892 mutualInformation()
0.3973 klDivergence()
0.007129 chiSquaredDegreesOfFreedom()
4 chiSquared()
15.5256 phiSquared()
0.5750 cramersV()
0.5362 lambdaA()
0.4000 lambdaB()
0.3571
Method 0 (Cabernet) 1 (Syrah) 2 (Pinot) conditionalEntropy(int)
0.8113 1.3516 1.2516
Constructor and Description 

ConfusionMatrix(String[] categories)
Construct a confusion matrix with all zero values from the
specified array of categories.

ConfusionMatrix(String[] categories,
int[][] matrix)
Construct a confusion matrix with the specified set of
categories and values.

Modifier and Type  Method and Description 

String[] 
categories()
Return a copy of the array of categories for this confusion
matrix.

double 
chiSquared()
Returns Pearson's C_{2} independence test
statistic for this matrix.

int 
chiSquaredDegreesOfFreedom()
Return the number of degrees of freedom of this confusion
matrix for the χ^{2} statistic.

double 
conditionalEntropy()
Returns the conditional entropy of the response distribution
against the reference distribution.

double 
conditionalEntropy(int refCategoryIndex)
Returns the entropy of the distribution of categories
in the response given that the reference category was
as specified.

double 
confidence(double z)
Returns the normal approximation of half of the binomial
confidence interval for this confusion matrix for the specified
zscore.

double 
confidence95()
Returns half the width of the 95% confidence interval for this
confusion matrix.

double 
confidence99()
Returns half the width of the 99% confidence interval for this
confusion matrix.

int 
count(int referenceCategoryIndex,
int responseCategoryIndex)
Returns the value of the cell in the matrix for the specified
reference and response category indices.

double 
cramersV()
Returns the value of Cramér's V statistic for this matrix.

double 
crossEntropy()
The crossentropy of the response distribution against the
reference distribution.

int 
getIndex(String category)
Return the index of the specified category in the list of
categories, or
1 if it is not a category for this
confusion matrix. 
void 
increment(int referenceCategoryIndex,
int responseCategoryIndex)
Add one to the cell in the matrix for the specified reference
and response category indices.

void 
increment(String referenceCategory,
String responseCategory)
Add one to the cell in the matrix for the specified reference
and response categories.

void 
incrementByN(int referenceCategoryIndex,
int responseCategoryIndex,
int num)
Add n to the cell in the matrix for the specified reference
and response category indices.

double 
jointEntropy()
Returns the entropy of the joint reference and response
distribution as defined by the underlying matrix.

double 
kappa()
Returns the value of the kappa statistic with chance agreement
determined by the reference distribution.

double 
kappaNoPrevalence()
Returns the value of the kappa statistic adjusted for
prevalence.

double 
kappaUnbiased()
Returns the value of the kappa statistic adjusted for bias.

double 
klDivergence()
Returns the KullbackLiebler (KL) divergence between the
reference and response distributions.

double 
lambdaA()
Returns Goodman and Kruskal's λ_{A} index
of predictive association.

double 
lambdaB()
Returns Goodman and Kruskal's λ_{B} index
of predictive association.

double 
macroAvgFMeasure()
Returns the average F measure per category.

double 
macroAvgPrecision()
Returns the average precision per category.

double 
macroAvgRecall()
Returns the average precision per category.

int[][] 
matrix()
Return a copy of the matrix values.

PrecisionRecallEvaluation 
microAverage()
Returns the microaveraged precisionrecall evaluation.

double 
mutualInformation()
Returns the mutual information between the reference and
response distributions.

int 
numCategories()
Returns the number of categories for this confusion matrix.

PrecisionRecallEvaluation 
oneVsAll(int categoryIndex)
Returns the oneversusall precisionrecall evaluation for the
specified category index.

double 
phiSquared()
Returns the value of Pearson's φ^{2} index of mean
square contingency for this matrix.

double 
randomAccuracy()
The expected accuracy from a strategy of randomly guessing
categories according to reference and response distributions.

double 
randomAccuracyUnbiased()
The expected accuracy from a strategy of randomly guessing
categories according to the average of the reference and
response distributions.

double 
referenceEntropy()
The entropy of the decision problem itself as defined by the
counts for the reference.

double 
responseEntropy()
The entropy of the response distribution.

String 
toString()
Return a stringbased representation of this confusion matrix.

double 
totalAccuracy()
Returns the percentage of response that are correct.

int 
totalCorrect()
Returns the total number of responses that matched the
reference.

int 
totalCount()
Returns the total number of classifications.

public ConfusionMatrix(String[] categories)
The categories are copied so that subsequent changes to the array passed in will not affect the confusion matrix.
categories
 Array of categories for classification.public ConfusionMatrix(String[] categories, int[][] matrix)
For example, the manyway confusion matrix shown in the class documentation above would be initialized as:
String[] categories = new String[] { "Cabernet", "Syrah", "Pinot" }; int[][] wineTastingScores = new int[][] { { 9, 3, 0 }, { 3, 5, 1 }, { 1, 1, 4 } }; ConfusionMatrix matrix = new ConfusionMatrix(categories,wineTastingScores);
categories
 Array of categories for classification.matrix
 Matrix of initial values.IllegalArgumentException
 If the categories and matrix
do not agree in dimension or the matrix contains a negative
value.public String[] categories()
getIndex()
. For a category c
in the
set of categories:
categories()[getIndex(c)].equals(c)
and for an index i
in range:
getIndex(categories()[i]) = i
getIndex(String)
public int numCategories()
numCategories()
is
guaranteed to be the same as categories().length
and thus may be used to compute iteration bounds.public int getIndex(String category)
1
if it is not a category for this
confusion matrix. The index is the index in the array
returned by categories()
.category
 Category whose index is returned.categories()
public int[][] matrix()
public void increment(int referenceCategoryIndex, int responseCategoryIndex)
referenceCategoryIndex
 Index of reference category.responseCategoryIndex
 Index of response category.IllegalArgumentException
 If either index is out of range.public void incrementByN(int referenceCategoryIndex, int responseCategoryIndex, int num)
referenceCategoryIndex
 Index of reference category.responseCategoryIndex
 Index of response category.num
 Number of instances to increment by.IllegalArgumentException
 If either index is out of
range, or if the result of the increment results in a negative
value in a cell.public void increment(String referenceCategory, String responseCategory)
referenceCategory
 Name of reference category.responseCategory
 Name of response category.IllegalArgumentException
 If either category is
not a category for this confusion matrix.public int count(int referenceCategoryIndex, int responseCategoryIndex)
referenceCategoryIndex
 Index of reference category.responseCategoryIndex
 Index of response category.IllegalArgumentException
 If either index is out of range.public int totalCount()
totalCount()
= &Sigma_{i}
&Sigma_{j}
count(i,j)
public int totalCorrect()
totalCorrect()
= &Sigma_{i}
count(i,i)
The value is the same as that of the
microAverage().correctResponse()
>public double totalAccuracy()
totalAccuracy() = totalCorrect() / totalCount()
Note that the classification error is just one minus the
accuracy, because each answer is either true or false.public double confidence95()
Confidence is determined as described in confidence(double)
with parameter z=1.96
.
public double confidence99()
Confidence is determined as described in confidence(double)
with parameter z=2.58
.
public double confidence(double z)
A z score represents the number of standard deviations from the mean, with the following correspondence of z score and percentage confidence intervals:
Thus the zscore for a 95% confidence interval is 1.96 standard deviations. The confidence interval is just the accuracy plus or minus the z score times the standard deviation. To compute the normal approximation to the deviation of the binomial distribution, assume
Z Confidence +/ Z 1.65 90% 1.96 95% 2.58 99% 3.30 99.9%
p=totalAccuracy()
and n=totalCount()
.
Then the confidence interval is defined in terms of the deviation of
binomial(p,n)
, which is defined by first taking
the variance of the Bernoulli (one trial) distribution with
success rate p
:
and then dividing by the numbervariance(bernoulli(p)) = p * (1p)
n
of trials in the
binomial distribution to get the variance of the binomial
distribution:
and then taking the square root to get the deviation:variance(binomial(p,n)) = p * (1p) / n
For instance, withdev(binomial(p,n)) = sqrt(p * (1p) / n)
p=totalAccuracy()=.90
, and
n=totalCount()=10000
:
dev(binomial(.9,10000)) = sqrt(0.9 * (1.0  0.9) / 10000) = 0.003
Thus to determine the 95% confidence interval, we take
z = 1.96
for a halfinterval width of
1.96 * 0.003 = 0.00588
. The
resulting interval is just 0.90 +/ 0.00588
or roughly (.894,.906)
.z
 The z score, or number of standard deviations.public double referenceEntropy()
referenceEntropy()
=
 Σ_{i}
referenceLikelihood(i)
* log_{2} referenceLikelihood(i)
referenceLikelihood(i) = oneVsAll(i).referenceLikelihood()
public double responseEntropy()
responseEntropy()
=
 Σ_{i}
responseLikelihood(i)
* log_{2} responseLikelihood(i)
responseLikelihood(i) = oneVsAll(i).responseLikelihood()
public double crossEntropy()
crossEntropy()
=
 Σ_{i}
referenceLikelihood(i)
* log_{2} responseLikelihood(i)
responseLikelihood(i) = oneVsAll(i).responseLikelihood()
referenceLikelihood(i) = oneVsAll(i).referenceLikelihood()
Note that crossEntropy() >= referenceEntropy()
.
The entropy of a distribution is simply the crossentropy of
the distribution with itself.
Low crossentropy does not entail good classification, though good classification entails low crossentropy.
public double jointEntropy()
jointEntropy()
=  Σ_{i}
Σ_{j}
P'(i,j) * log_{2} P'(i,j)
P'(i,j) = count(i,j) / totalCount()
and where by convention:
0 log_{2} 0 =_{def} 0
public double conditionalEntropy(int refCategoryIndex)
conditionalEntropy(i)
=  Σ_{j}
P'(ji) * log_{2} P'(ji)
P'(ji) = count(j,i) / referenceCount(i)
where
refCategoryIndex
 Index of the reference category.public double conditionalEntropy()
conditionalEntropy()
= Σ_{i}
referenceLikelihood(i) * conditionalEntropy(i)
referenceLikelihood(i) = oneVsAll(i).referenceLikelihood()
Note that this statistic is not symmetric in that if the roles of reference and response are reversed, the answer may be different.
public double kappa()
kappa() = (totalAccuracy()  randomAccuracy())
/ (1  randomAccuracy())
The kappa statistic was introduced in:
Cohen, Jacob. 1960. A coefficient of agreement for nominal scales. Educational And Psychological Measurement 20:3746.
public double kappaUnbiased()
kappaUnbiased() = (totalAccuracy()  randomAccuracyUnbiased()) / (1  randomAccuracyUnbiased())The unbiased version of Kappa was introduced in:
Siegel, Sidney and N. John Castellan, Jr. 1988. Nonparametric Statistics for the Behavioral Sciences. McGraw Hill.
public double kappaNoPrevalence()
kappaNoPrevalence() = 2 * totalAccuracy()  1
The no prevalence version of kappa was introduced in:
Byrt, Ted, Janet Bishop and John B. Carlin. 1993. Bias, prevalence, and kappa. Journal of Clinical Epidemiology 46(5):423429.These authors suggest reporting the three kappa statistics defined in this class: kappa, kappa adjusted for prevalence, and kappa adjusted for bias.
public double randomAccuracy()
randomAccuracy()
= Σ_{i}
referenceLikelihood(i) * resultLikelihood(i)
referenceLikelihood(i) = oneVsAll(i).referenceLikelihood()
responseLikelihood(i) = oneVsAll(i).responseLikelihood()
public double randomAccuracyUnbiased()
randomAccuracyUnbaised()
= Σ_{i}
((referenceLikelihood(i) + resultLikelihood(i))/2)^{2}
referenceLikelihood(i) = oneVsAll(i).referenceLikelihood()
responseLikelihood(i) = oneVsAll(i).responseLikelihood()
public int chiSquaredDegreesOfFreedom()
n×m
matrix, the number of degrees of
freedom is equal to (n1)*(m1)
. Because this
is a symmetric matrix of dimensions equal to the number of
categories, the result is defined to be:
chiSquaredDegreesOfFreedom()
= (numCategories()  1)^{2}
public double chiSquared()
chiSquaredDegreesOfFreedom()
.
See Statistics.chiSquaredIndependence(double[][])
for definitions of the statistic over matrices.
public double phiSquared()
phiSquared() = chiSquared() / totalCount()
As with our other statistics, this is the sample value; the true contingency by the true random variables defining the reference and response.
public double cramersV()
cramersV() = (phiSquared() / (numCategories()1))^{(1/2)}
public PrecisionRecallEvaluation oneVsAll(int categoryIndex)
categoryIndex
 Index of category.public PrecisionRecallEvaluation microAverage()
oneVsAll(int)
over all category indices. See the
class definition above for an example.public double macroAvgPrecision()
macroAvgPrecision()
= Σ_{i}
precision(i) / numCategories()
precision(i) = oneVsAll(i).precision()
public double macroAvgRecall()
macroAvgRecall()
= Σ_{i}
recall(i) / numCategories()
recall(i) = oneVsAll(i).recall()
public double macroAvgFMeasure()
macroAvgFMeasure()
= Σ_{i}
fMeasure(i) / numCategories()
recall(i) = oneVsAll(i).fMeasure()
Note that this is not necessarily the same value as results from computing the Fmeasure from the the macroaveraged precision and macroaveraged recall.
public double lambdaA()
lambdaA()
= (Σ_{j}
maxReferenceCount(j))  maxReferenceCount()
/ (totalCount()  maxReferenceCount())
where maxReferenceCount(j)
is the maximum count
in column j
of the matrix:
maxReferenceCount(j) = MAX_{i} count(i,j)
and where maxReferenceCount()
is the maximum
reference count:
maxReferenceCount() = MAX_{i} referenceCount(i)
Note that like conditional probability and conditional entropy, the λ_{A} statistic is antisymmetric; the measure λ_{B} simply reverses the rows and columns. The probabilistic interpretation of λ_{A} is like that of λ_{B}, only reversing the role of the reference and response.
public double lambdaB()
lambdaB()
= (Σ_{j}
maxResponseCount(i))  maxResponseCount()
/ (totalCount()  maxResponseCount())
where maxResponseCount(i)
is the maximum count
in row i
of the matrix:
maxResponseCount(i) = MAX_{j} count(i,j)
and where maxResponseCount()
is the maximum
response count:
maxResponseCount() = MAX_{j} responseCount(j)
The probabilistic interpration of λ_{B} is the reduction in error likelihood from knowing the specified reference category in predicting the response category. It will thus take on a value between 0.0 and 1.0, with higher values being better. Perfect association yields a value of 1.0 and perfect independence a value of 0.0.
Note that the λ_{B} statistic is antisymmetric; the measure λ_{A} simply reverses the rows and columns.
public double mutualInformation()
mutualInformation()
= Σ_{i}
Σ_{j}
P(i,j)
* log_{2}
( P(i,j) / (P_{reference}(i)
* P_{response}(j)) )
P(i,j) = count(i,j) / totalCount()
P_{reference}(i) = oneVsAll(i).referenceLikelihood()
P_{response}(i) = oneVsAll(i).responseLikelihood()
A bit of algebra shows that mutual information is the reduction
in entropy of the response distribution from knowing the
reference distribution:
mutualInformation() = responseEntropy()  conditionalEntropy()
In this way it is similar to the
λ_{B} measure.
Mutual information is symmetric. We could also subtract the conditional entropy of the reference given the response from the reference entropy to get the same result.
public double klDivergence()
klDivergence()
= Σ_{k}
P_{reference}(k)
* log_{2} (P_{reference}(k)
/ P_{response}(k))
P_{reference}(i) = oneVsAll(i).referenceLikelihood()
P_{response}(i) = oneVsAll(i).responseLikelihood()
Note that KL divergence is not symmetric in the reference and response
distributions.