public class PrecisionRecallEvaluation extends Object
PrecisionRecallEvaluation
collects and reports a
suite of descriptive statistics for binary classification tasks.
The basis of a precision recall evaluation is a matrix of counts
of reference and response classifications. Each cell in the matrix
corresponds to a method returning a long integer count.
The most basic statistic is accuracy, which is the number of correct responses divided by the total number of cases.
Response Reference Totals true false Refer
encetrue truePositive()
(TP)falseNegative()
(FN)positiveReference()
(TP+FN)false falsePositive()
(FP)trueNegative()
(TN)negativeReference()
(FP+TN)Response Totals positiveResponse()
(TP+FP)negativeResponse()
(FN+TN)total()
(TP+FN+FP+TN)
accuracy()
= correct() / total()
This class derives its name from the following four statistics,
which are illustrated in the four tables.
recall()
= truePositive() / positiveReference()
precision()
= truePositive() / positiveResponse()
rejectionRecall()
= trueNegative() / negativeReference()
rejectionPrecision()
= trueNegative() / negativeResponse()
Each measure is defined to be the green count divided by the green
plus red count in the corresponding table:
This picture clearly illustrates the relevant dualities. Precision is the dual to recall if the reference and response are switched (the matrix is transposed). Similarly, rejection recall is dual to recall with true and false labels switched (reflection around each axis in turn); rejection precision is similarly dual to precision.
Recall Response True False Refer
enceTrue +  False
Precision Response True False Refer
enceTrue + False 
Rejection
RecallResponse True False Refer
enceTrue False  +
Rejection
PrecisionResponse True False Refer
enceTrue  False +
Precision and recall may be combined by weighted geometric
averaging by using the fmeasure statistic, with
β
between 0 and infinity being the relative
weight of precision, with 1 being a neutral value.
fMeasure() = fMeasure(1)
fMeasure(β)
= (1 + β^{2})
* precision()
* recall()
/ (recall()
+ β^{2} * precision()
)
There are four traditional measures of binary classification, which are as follows.
fowlkesMallows()
= truePositive() / (precision() * recall())^{(1/2)}
jaccardCoefficient()
= truePositive() / (total()  trueNegative())
yulesQ()
= (truePositive() * trueNegative()  falsePositive() * falseNegative())
/ (truePositive() * trueNegative() + falsePositive() * falsePositive())
yulesY()
= ((truePositive() * trueNegative())^{(1/2)}
 (falsePositive() * falseNegative())^{(1/2)})
/ ((truePositive() * trueNegative())^{(1/2)} + (falsePositive() * falsePositive())^{(1/2)})
Replacing precision and recall with their definitions,
TP/(TP+FP)
and TP/(TP+FN)
:
F_{1} = 2 * (TP/(TP+FP)) * (TP/(TP+FN)) / (TP/(TP+FP) + TP/(TP+FN)) = 2 * (TP*TP / (TP+FP)(TP+FN)) / (TP*(TP+FN)/(TP+FP)(TP+FN) + TP*(TP+FP)/(TP+FN)(TP+FP)) = 2 * (TP / (TP+FP)(TP+FN)) / ((TP+FN)/(TP+FP)(TP+FN) + (TP+FP)/(TP+FN)(TP+FP)) = 2 * TP / / ((TP+FN) + (TP+FP)) = 2*TP / (2*TP + FP + FN)Thus the F_{1}measure is very closely related to the Jaccard coefficient,
TP/(TP+FP+FN)
. Like the Jaccard
coefficient, the F measure does not vary with varying true
negative counts. Rejection precision and recall do vary with
changes in true negative count.
Basic reference and response likelihoods are computed by frequency.
referenceLikelihood() = positiveReference() / total()
responseLikelihood() = positiveResponse() / total()
An algorithm that chose responses at random according to the
response likelihood would have the following accuracy against
test cases chosen at random according to the reference likelihood:
randomAccuracy()
= referenceLikelihood() * responseLikelihood()
+ (1  referenceLikelihood()) * (1  responseLikelihood())
The two summands arise from the likelihood of true positive and the
likelihood of a true negative. From random accuracy, the
κstatistic is defined by dividing out the random accuracy
from the accuracy, in some way giving a measure of performance
above a baseline expectation.
kappa()
= kappa(accuracy(),randomAccuracy())
kappa(p,e)
= (p  e) / (1  e)
There are two alternative forms of the κstatistic, both of which attempt to correct for putative bias in the estimation of random accuracy. The first involves computing the random accuracy by taking the average of the reference and response likelihoods to be the baseline reference and response likelihood, and squaring the result to get the socalled unbiased random accuracy and the unbiased κstatistic:
randomAccuracyUnbiased()
= avgLikelihood()^{2}
+ (1  avgLikelihood())^{2}
avgLikelihood() = (referenceLikelihood() + responseLikelihood()) / 2
kappaUnbiased()
= kappa(accuracy(),randomAccuracyUnbiased())
Kappa can also be adjusted for the prevalence of positive reference cases, which leads to the following simple definition:
kappaNoPrevalence()
= (2 * accuracy())  1
Pearson's C^{2} statistic is provided by the following method:
chiSquared()
= total() * phiSquared()
phiSquared()
= ((truePositive()*trueNegative()) * (falsePositive()*falseNegative()))^{2}
/ ((truePositive()+falseNegative()) * (falsePositive()+trueNegative()) * (truePositive()+falsePositive()) * (falseNegative()+trueNegative()))
The accuracy deviation is the deviation of the average number of positive cases in a binomial distribution with accuracy equal to the classification accuracy and number of trials equal to the total number of cases.
accuracyDeviation()
= (accuracy() * (1  accuracy()) / total())^{(1/2)}
This number can be used to provide error intervals around the
accuracy results.
Using the following three tables as examples:
The various statistics evaluate to the following values:
CabvsAll Response Cab Other Refer
enceCab 9 3 Other 4 11
SyrahvsAll Response Syrah Other Refer
enceSyrah 5 4 Other 4 14
PinotvsAll Response Pinot Other Refer
encePinot 4 2 Other 1 20
Method Cabernet Syrah Pinot positiveReference()
12 9 6 negativeReference()
15 18 21 positiveResponse()
13 9 5 negativeResponse()
14 18 22 correctResponse()
20 19 24 total()
27 27 27 accuracy()
0.7407 0.7037 0.8889 recall()
0.7500 0.5555 0.6666 precision()
0.6923 0.5555 0.8000 rejectionRecall()
0.7333 0.7778 0.9524 rejectionPrecision()
0.7858 0.7778 0.9091 fMeasure()
0.7200 0.5555 0.7272 fowlkesMallows()
12.49 9.00 5.48 jaccardCoefficient()
0.5625 0.3846 0.5714 yulesQ()
0.7838 0.6279 0.9512 yulesY()
0.4835 0.3531 0.7269 referenceLikelihood()
0.4444 0.3333 0.2222 responseLikelihood()
0.4815 0.3333 0.1852 randomAccuracy()
0.5021 0.5556 0.6749 kappa()
0.4792 0.3333 0.6583 randomAccuracyUnbiased()
0.5027 0.5556 0.6756 kappaUnbiased()
0.4789 0.3333 0.6575 kappaNoPrevalence()
0.4814 0.4074 0.7778 chiSquared()
6.2382 3.0000 11.8519 phiSquared()
0.2310 0.1111 0.4390 accuracyDeviation()
0.0843 0.0879 0.0605
Constructor and Description 

PrecisionRecallEvaluation()
Construct a precisionrecall evaluation with all counts set to
zero.

PrecisionRecallEvaluation(long tp,
long fn,
long fp,
long tn)
Construction a precisionrecall evaluation initialized with the
specified counts.

Modifier and Type  Method and Description 

double 
accuracy()
Returns the sample accuracy of the responses.

double 
accuracyDeviation()
Returns the standard deviation of the accuracy.

void 
addCase(boolean reference,
boolean response)
Adds a case with the specified reference and response
classifications.

<T> void 
addCases(Set<T> referencePositives,
Set<T> responsePositives)
Add the cases corresponding to the specified set of
reference positives and response positives.

double 
chiSquared()
Returns the χ^{2} value.

long 
correctResponse()
Returns the number of cases where the response is correct.

long 
falseNegative()
Returns the number of false negative cases.

long 
falsePositive()
Returns the number of false positive cases.

double 
fMeasure()
Returns the F_{1} measure.

double 
fMeasure(double beta)
Returns the
F_{β} value for
the specified β . 
static double 
fMeasure(double beta,
double recall,
double precision)
Returns the F_{β} measure for
a specified β, recall and precision values.

double 
fowlkesMallows()
Return the FowlkesMallows score.

long 
incorrectResponse()
Returns the number of cases where the response is incorrect.

double 
jaccardCoefficient()
Returns the Jaccard coefficient.

double 
kappa()
Returns the value of the kappa statistic.

double 
kappaNoPrevalence()
Returns the value of the kappa statistic adjusted for
prevalence.

double 
kappaUnbiased()
Returns the value of the unbiased kappa statistic.

long 
negativeReference()
Returns the number of negative reference cases.

long 
negativeResponse()
Returns the number of negative response cases.

double 
phiSquared()
Returns the φ^{2} value.

long 
positiveReference()
Returns the number of positive reference cases.

long 
positiveResponse()
Returns the number of positive response cases.

double 
precision()
Returns the precision.

double 
randomAccuracy()
The probability that the reference and response are the same if
they are generated randomly according to the reference and
response likelihoods.

double 
randomAccuracyUnbiased()
The probability that the reference and the response are the same
if the reference and response likelihoods are both the average
of the sample reference and response likelihoods.

double 
recall()
Returns the recall.

double 
referenceLikelihood()
Returns the sample reference likelihood, or prevalence, which
is the number of positive references divided * by the total
number of cases.

double 
rejectionPrecision()
Returns the rejection prection, or selectivity, value.

double 
rejectionRecall()
Returns the rejection recall, or specificity, value.

double 
responseLikelihood()
Returns the sample response likelihood, which is the number of
positive responses divided by the total number of cases.

String 
toString()
Returns a stringbased representation of this evaluation.

long 
total()
Returns the total number of cases.

long 
trueNegative()
Returns the number of true negative cases.

long 
truePositive()
Returns the number of true positive cases.

double 
yulesQ()
Return the value of Yule's Q statistic.

double 
yulesY()
Return the value of Yule's Y statistic.

public PrecisionRecallEvaluation()
public PrecisionRecallEvaluation(long tp, long fn, long fp, long tn)
tp
 True positive count.fn
 False negative count.fp
 False positive count.tn
 True negative count.IllegalArgumentException
 If any of the counts are
negative.public void addCase(boolean reference, boolean response)
reference
 Reference classification.response
 Response classification.public <T> void addCases(Set<T> referencePositives, Set<T> responsePositives)
 TP: +reference, +response
 FN: +reference, response
 FP: reference, +response
Thanks to Erel Segal for suggesting this function and providing an initial implementation.
T
 Type of elements in the sets.referencePositives
 Set of truly positive items.responsePositives
 Set of positive response items.public long truePositive()
public long falsePositive()
public long trueNegative()
public long falseNegative()
public long positiveReference()
public long negativeReference()
public double referenceLikelihood()
public long positiveResponse()
public long negativeResponse()
public double responseLikelihood()
public long correctResponse()
public long incorrectResponse()
public long total()
public double accuracy()
public double recall()
public double precision()
public double rejectionRecall()
public double rejectionPrecision()
public double fMeasure()
fMeasure(double)
to
1
. of the methodpublic double fMeasure(double beta)
F_{β}
value for
the specified β
.beta
 The β
parameter.F_{β}
value.public double jaccardCoefficient()
public double chiSquared()
public double phiSquared()
public double yulesQ()
public double yulesY()
public double fowlkesMallows()
public double accuracyDeviation()
public double randomAccuracy()
public double randomAccuracyUnbiased()
public double kappa()
public double kappaUnbiased()
public double kappaNoPrevalence()
public String toString()
public static double fMeasure(double beta, double recall, double precision)
beta
 Relative weighting of precision.recall
 Recall value.precision
 Precision value.