public class Statistics extends Object
Statistics
class provides static utility methods
for statistical computations.Modifier and Type  Method and Description 

static double 
chiSquaredIndependence(double[][] contingencyMatrix)
Returns the value of Pearson's C_{2}
goodness of fit statistic for independence over the specified
contingency matrix.

static double 
chiSquaredIndependence(double both,
double oneOnly,
double twoOnly,
double neither)
Returns Pearson's C_{2} goodnessoffit
statistic for independence over the specified binary
contingency matrix.

static double 
correlation(double[] xs,
double[] ys)
Returns (Pearson's) correlation coefficient between two
specified arrays.

static double 
dirichletLog2Prob(double[] alphas,
double[] xs)
Returns the log (base 2) of the probability of the specified
discrete distribution given the specified Dirichlet
distribution concentration parameters.

static double 
dirichletLog2Prob(double alpha,
double[] xs)
Returns the log (base 2) of the probability of the specified
discrete distribution given the specified uniform Dirichlet
with concentration parameters equal to the specified value.

static double 
jsDivergence(double[] p,
double[] q)
Return the JensonShannon divergence between the specified
multinomial distributions.

static double 
kappa(double observedProb,
double expectedProb)
Returns the value of the kappa statistic for the specified
observed and expected probabilities.

static double 
klDivergence(double[] p,
double[] q)
Returns the KullbackLeibler divergence of the second
specified multinomial relative to the first.

static double 
klDivergenceDirichlet(double[] xs,
double[] ys)
Returns the KullbackLeibler divergence of the second specified
Dirichlet distribution relative to the first.

static double[] 
linearRegression(double[] xs,
double[] ys)
Returns a twoelement array of lineary regression coefficients
for the specified x and y values.

static double[] 
logisticRegression(double[] xs,
double[] ys,
double maxValue)
Returns a twoelement array of logistic regression coefficients
for the specified x and y values.

static double 
mean(double[] xs)
Returns the mean of the specified array of input values.

static double[] 
normalize(double[] probabilityRatios)
Return an array of probabilities resulting from normalizing the
specified probability ratios.

static int[] 
permutation(int length)
Returns a permutation of the integers between 0 and
the specified length minus one.

static int[] 
permutation(int length,
Random random)
Returns a permutation of the integers between 0 and
the specified length minus one using the specified
randomizer.

static int 
sample(double[] cumulativeProbRatios,
Random random)
Returns a sample from the discrete distribution represented by the
specified cumulative probability ratios, using the specified random
number generator.

static double 
standardDeviation(double[] xs)
Returns the standard deviation of the specified array of input
values.

static double 
symmetrizedKlDivergence(double[] p,
double[] q)
Returns the symmetrized KLdivergence between the specified distributions.

static double 
symmetrizedKlDivergenceDirichlet(double[] xs,
double[] ys)
Returns the symmetric version of the KullbackLeibler divergence
between the Dirichlet distributions determined by the specified
parameters.

static double 
variance(double[] xs)
Returns the variance of the specified array of input values.

public static double klDivergenceDirichlet(double[] xs, double[] ys)
The KL divergence between two Dirichlet distributions with
parameters xs
and ys
of dimensionality
n
is:
whereD_{KL}(Dirichlet(xs)  Dirichlet(ys)) = ∫ p(θxs) log ( p(θxs) / p(θys) ) dθ = log Γ(Σ_{i < n} xs[i])  log Γ(Σ_{i < n} ys[i])  Σ_{i < n} log Γ(xs[i]) + Σ_{i < n} log Γ(ys[i]) + Σ_{i < n} (xs[i]  ys[i]) * (Ψ(xs[i])  Ψ(Σ_{k < n} xs[i]))
Γ
is the gamma function (see Math.log2Gamma(double)
), and where
Ψ
is the digamma function defined in Math.digamma(double)
.
This method in keeping with other informationtheoretic
methods returns the answer in bits (base 2) rather than nats (base e
).
The return is rescaled from the naturallog based divergence:
klDivergenceDirichlet(xs,ys) = D_{KL}(Dirichlet(xs)  Dirichlet(ys)) / log_{2} e
Further descriptions of the computation of KLdivergence between Dirichlets may be found in:
xs
 Dirichlet parameter for the first distribution.ys
 Dirichlet parameter for the second distribution.public static double symmetrizedKlDivergenceDirichlet(double[] xs, double[] ys)
The symmetrized KL divergence for Dirichlets is just the average of both relative divergences:
D_{SKL}(Dirichlet(xs), Dirichlet(ys)) = (D_{KL}(Dirichlet(xs)  Dirichlet(ys)) + D_{KL}(Dirichlet(ys)  Dirichlet(xs))) / 2
See the method documentation for klDivergenceDirichlet(double[],double[])
for a definition of
the relative divergence for Dirichlet distributions.
xs
 Dirichlet parameter for the first distribution.ys
 Dirichlet parameter for the second distribution.public static double klDivergence(double[] p, double[] q)
The KL divergence of a multinomial q
relative
to a multinomial p
, both with n
outcomes, is:
The value is guaranteed to be nonnegative, and will be 0.0 only if the two distributions are identicial. If any outcome has zero probability inD_{KL}(pq) = Σ_{i < n} p(i) log_{2} (p(i) / q(i))
q
and nonzero probability
in p
, the result is infinite.
KL divergence is not symmetric. That is, there are
p
and q
such that
D_{KL}(pq) !=
D_{KL}(qp)
. See symmetrizedKlDivergence(double[],double[])
and jsDivergence(double[],double[])
for symmetric variants.
KL divergence is equivalent to conditional entropy, although
it is written in the opposite order. If H(p,q)
is
the joint entropy of the distributions p
and
q
, and H(p)
is the entropy of
p
, then:
D_{KL}(pq) = H(p,q)  H(p)
p
 First multinomial distribution.q
 Second multinomial distribution.IllegalArgumentException
 If the distributions are not
the same length or have entries less than zero or greater than
1.public static double symmetrizedKlDivergence(double[] p, double[] q)
D_{SKL}(p,q) = ( D_{KL}(p,q) + D_{KL}(q,p) ) / 2
p
 First multinomial.q
 Second multinomial.IllegalArgumentException
 If the distributions are not
the same length or have entries less than zero or greater than
1.public static double jsDivergence(double[] p, double[] q)
whereD_{JS}(p,q) = ( D_{KL}(p,m) + D_{KL}(q,m) ) / 2
m
is defined as the balanced linear
interpolation (that is, the average) of p
and
q
:
The JS divergence is nonzero, equal to zero only ifm[i] = (p[i] + q[i]) / 2
p
and q
are the same distribution, and symmetric.p
 First multinomial.q
 Second multinomial.IllegalArgumentException
 If the distributions are not
the same length or have entries less than zero or greater than
1.public static int[] permutation(int length)
length
 Size of permutation to return.public static int[] permutation(int length, Random random)
length
 Size of permutation to return.random
 Randomizer to use for permutation.public static double chiSquaredIndependence(double both, double oneOnly, double twoOnly, double neither)
chiSquaredIndependence(both,oneOnly,twoOnly,neither) = chiSquaredIndependence(new double[][] { {both, oneOnly}, {twoOnly, neither} });The specified values make up the following contingency matrix for boolean outcomes labeled
One
and
Two
:
+Two Two +One both
oneOnly
One twoOnly
neither
The value both
is the count of events where both
outcomes occurred, oneOnly
for where only the
first outcome occurred, twoOnly
for where only
the second outcome occurred, and neither
for
when neither occurred. Let totalCount
be
the sum of all of the cells in the matrix.
From the contingency matrix, marginal probabilities for the two outcomes may be computed:
P(One) = (both + oneOnly) / totalCount
P(Two) = (both + twoOnly) / totalCount
If these probabilities are independent, the expected values of
the matrix cells are:
E(both)= totalCount * P(One) * P(Two)
E(oneOnly) = totalCount * P(One) * (1P(Two))
E(twoOnly) = totalCount * (1P(One)) * P(Two)
E(neither) = totalCount * (1P(One)) * (1P(Two))
These are used to derive the independence test statistic, which
is the square differences between observed and expected values
under the independence assumption, normalized by the expected
values:
C_{2}
= (both  E(both))^{2} / E(both)
+ (oneOnly  E(oneOnly))^{2} / E(oneOnly)
+ (twoOnly  E(twoOnly))^{2} / E(twoOnly)
+ (neither  E(neither))^{2} / E(neither)
Unlike the higher dimensional case, this statistic applies as a
hypothesis test only in the case when all expected values are
at least 10.both
 Count of samples of both outcomes.oneOnly
 Count of samples with the first and not the
second outcome.twoOnly
 Count of samples with the second and not the
first outcome.neither
 Count of samples with with neither outcome.IllegalArgumentException
 If any of the arguments are not
nonnegative finite numbers.public static double[] linearRegression(double[] xs, double[] ys)
{ β0, β1 }
, define a linear function:
f(x) = β1 * x + β0
The coefficients returned produce the linear function f(x)
with the smallest squared error:
sqErr(f,xs,ys) =
Σ_{i}
(f(xs[i])  ys[i])^{2}
where all sums are for 0 << i < xs.length
.
The funciton requires only a single pass through the two
arrays, with β0
and β1
given by:
β1 = n * Σ_{i} x[i] * y[i]  (Σ_{i} x[i])(Σ_{i} y[i])  n * Σ_{i} x[i]*x[i]  (Σ_{i} x[i])^{2}
whereβ0 = Σ_{i} y[i]  β1 Σ_{i} x[i]  n
n = xs.length = ys.length
.xs
 Array of x values.ys
 Parallel array of y values.IllegalArgumentException
 If the arrays are of length
less than 2, or if the arrays are not of the same length.public static double[] logisticRegression(double[] xs, double[] ys, double maxValue)
{ β0, β1 }
, define the logistic function:
with the minimum squared error. SeeL f(x) =  1 + e ^{β1 * x + β0}
linearRegression(double[],double[])
for a definition of
squared error. This function takes real values in the the open
interval (0, L)
.
Logistic regression coefficients are computed using linear regression, after transforming the y values. This is possible because of the following linear relation:
log ((L  y) / y) = β1 * x + β0
xs
 Array of x values.ys
 Array of y values.maxValue
 Maximum value of function.IllegalArgumentException
 If the maximum value is not
positive and finite, if the arrays are of length less than 2,
or if the arrays are not of the same length.public static double chiSquaredIndependence(double[][] contingencyMatrix)
(numRows1)*(numCols1)
degrees of freedom. The
higher the value, the less likely the two outcomes are
independent.
Pearson's C_{2} statistic is defined as follows:
C_{2}
= Σ_{i}
Σ_{j}
(matrix[i][j]  expected(i,j,matrix))^{2} / expectedCount(i,j,matrix)
where the expected count is the total count times the max
likelihood estimates of row i
probability times
column j
probability:
expectedCount(i,j,matrix)
= totalCount(matrix)
* rowCount(i,matrix)/totalCount(matrix)
* colCount(j,matrix)/totalCount(matrix)
= rowCount(i,matrix) * colCount(j,matrix) / totalCount(matrix)
where
rowCount(i,matrix)
= Σ_{0<=j<=numCols}
matrix[i][j]
colCount(j,matrix)
= Σ_{0<=i<=numRows}
matrix[i][j]
totalCount(matrix)
= Σ_{0<=i<=numRows}
= Σ_{0<=j<=numCols}
matrix[i][j]
The χ^{2} test is a large sample test and is only valid if all of the expected counts are at least 5. This restriction is often ignored for ranking purposes.
contingencyMatrix
 The specified contingency matrix.Illegal
 argument exception if the matrix is not rectangular
or if all values are not nonnegative finite numbers.public static double[] normalize(double[] probabilityRatios)
Warning: This method is implemented by summing the
probability ratios and then dividing each element by the sum.
Because of the limited precision of double
based
arithmetic, if the largest ratio is much larger than the next
largest ratio, then the largest normalized probability will be
one and all others will be zero. Java double values follow the
IEEE 754 arithmetic standard and thus use 52 bits for their
mantissas. Thus only ratios within
2^{52}~10^{16} of
the maximum ratio will be nonzero.
 Parameters:
probabilityRatios
 Ratios of probabilities.
 Returns:
 Probabilities resulting from normalizing the ratios.
 Throws:
IllegalArgumentException
 If the input contains a value
that is not a finite nonnegative number, or if the input does
not contain at least one nonzero entry.

kappa
public static double kappa(double observedProb,
double expectedProb)
Returns the value of the kappa statistic for the specified
observed and expected probabilities. The kappa statistic
provides a kind of adjustment for the exptected (random)
difficulty of a problem. It is defined by:
kappa(p,e) = (p  e) / (1  e)
The most typical use for kappa is in evaluating
classification problems of a machine versus a gold standard or
between two human annotators to assess interannotator
agreement.
 Parameters:
observedProb
 Observed probability.
expectedProb
 Expected probability.
 Returns:
 The value of the kappa statistic for the specified
probability and expected probability.

mean
public static double mean(double[] xs)
Returns the mean of the specified array of input values. The mean
of an array is defined by:
mean(xs)
= Σ_{i < xs.length}
xs[i] / xs.length
If the array is of length zero, the result is Double.NaN
.
 Parameters:
xs
 Array of values.
 Returns:
 Mean of array of values.

variance
public static double variance(double[] xs)
Returns the variance of the specified array of input values.
The variance of an array of values is the mean of the
squared differences between the values and the mean:
variance(xs)
= Σ_{i < xs.length}
(mean(xs)  xs[i])^{2} / xs.length
If the array is of length zero, the result is Double.NaN
.
 Parameters:
xs
 Array of values.
 Returns:
 Variance of array of values.

standardDeviation
public static double standardDeviation(double[] xs)
Returns the standard deviation of the specified array of input
values. The standard deviation is just the square root of the
variance:
standardDeviation(xs) = variance(xs)^{(1/2)}
If the array is of length zero, the result is Double.NaN
.
 Parameters:
xs
 Array of values.
 Returns:
 Standard deviation of array of values.

correlation
public static double correlation(double[] xs,
double[] ys)
Returns (Pearson's) correlation coefficient between two
specified arrays. The correlation coefficient, traditionally
notated as r^{2}
, measures the
square error between a best linear fit between the two arrays
of data. Rescaling either array by a positive constant will
not affect the result.
The square root r
of the correlation
coefficient r^{2}
is the variance
in the second array explained by a linear relation with the
the first array.
The definition of the correlation coefficient is:
correlation(xs,ys)^{2}
= sumSqDiff(xs,ys)^{2}
/ (sumSqDiff(xs,xs) * sumSqDiff(xs,xs))
where
sumSqDiff(xs,ys)
= Σ_{i<xs.length}
(xs[i]  mean(xs)) * (ys[i]  mean(ys))
and thus the terms in the denominator above reduce using:
sumSqDiffs(xs,xs)
= Σ_{i<xs.length}
(xs[i]  mean(xs))^{2}
See the following for more details:
 Eric W. Weisstein. "Correlation Coefficient." From
MathWorldA Wolfram Web Resource. http://mathworld.wolfram.com/CorrelationCoefficient.html
 Wikipedia. Pearson
productmoment correlation coefficient.
 Parameters:
xs
 First array of values.
ys
 Second array of values.
 Returns:
 The correlation coefficient of the two arrays.
 Throws:
IllegalArgumentException
 If the arrays are not the same
length.

sample
public static int sample(double[] cumulativeProbRatios,
Random random)
Returns a sample from the discrete distribution represented by the
specified cumulative probability ratios, using the specified random
number generator.
The cumulative probability ratios represent unnormalized probabilities
of generating the value of their index or less, that is, unnormalized
cumulative probabilities. For instance, consider
the cumulative probability ratios { 0.5, 0.5, 3.9, 10.1}
:
Outcome Value Unnormalized Prob Prob
0 0.5 0.5 0.5/10.1
1 0.5 0.0 0.0/10.1
2 3.9 3.4 3.4/10.1
3 10.1 6.2 6.2/10.1
A sample is taken by generating a random number x between 0.0 and
the value of the last element (here 10.1). The value returned is
the index i such that:
cumulativeProbRatios[i1] < x <= cumulativeProbRatios[i]
The corresponding probabilities given the sample input are
listed in the last column in the table above.
Note that if two
elements have the same value, there is no chance of generating
the outcome with the higher index. In the example above, this
corresponds to outcome 1 having probaiblity 0.0
Warning: The cumulative probability ratios are required to meet
two conditions. First, all values must be finite and nonnegative. Second,
the values must be nondecreasing, so that
cumulativeProbRatios[i] <= cumulativeProbRatios[i+1]
.
If either of these conditions is not met, the result is undefined.
 Parameters:
cumulativeProbRatios
 Cumulative probability for outcome less than
or equal to index.
random
 Random number generator for sampling.
 Returns:
 A sample from the specified distribution.

dirichletLog2Prob
public static double dirichletLog2Prob(double alpha,
double[] xs)
Returns the log (base 2) of the probability of the specified
discrete distribution given the specified uniform Dirichlet
with concentration parameters equal to the specified value.
See dirichletLog2Prob(double[],double[])
for
more information on the Dirichlet distribution. This method
returns a result equivalent to the following (though is more
efficiently implemented):
double[] alphas = new double[xs.length];
java.util.Arrays.fill(alphas,alpha);
assert(dirichletLog2Prob(alpha,xs) == dirichletLog2Prob(alphas,xs));
For the uniform Dirichlet, the distribution simplifies to
the following form:
p(xs  Dirichlet(α)) = (1/Z(α)) Π_{i < k} xs[i]^{α1}
where
Z(α) = Γ(α)^{k} / Γ(k * α)
Warning: The probability distribution must be proper
in the sense of having values between 0 and 1 inclusive and
summing to 1.0. This property is not checked by this method.
 Parameters:
alpha
 Dirichlet concentration parameter to use for each
dimension.
xs
 The distribution whose probability is returned.
 Returns:
 The log (base 2) probability of the specified
distribution in the uniform Dirichlet distribution with concentration
parameters equal to
alpha
.
 Throws:
IllegalArgumentException
 If the values in the distribution
are not between 0 and 1 inclusive or if the concentration parameter is
not positive and finite.

dirichletLog2Prob
public static double dirichletLog2Prob(double[] alphas,
double[] xs)
Returns the log (base 2) of the probability of the specified
discrete distribution given the specified Dirichlet
distribution concentration parameters. A Dirichlet
distribution is a distribution over k
dimensional
discrete distributions.
The Dirichlet is widely used because it is the conjugate
prior for multinomials in a Bayesian setting, and thus may
be used to specify a convenient distribution over discrete
distributions.
A Dirichlet distribution is specified by a dimensionality
k
and a concentration parameter α[i] > 0
for each i < k
.
The probability of the distribution xs
in a Dirichlet distribution of dimension k
and concentration parameters α
is
given (up to a normalizing constant) by:
p(xs  Dirichlet(α))
∝ Π_{i < k} xs[i]^{α[i]1}
The full distribution is:
p(xs  Dirichlet(α))
= (1/Z(α)) * Π_{i < k} xs[i]^{α[i]1}
where the normalizing constant is given by:
Z(α) = Π_{i < k} Γ(α[i])
/ Γ(Σ_{i < k} α[i])
Warning: The probability distribution must be proper
in the sense of having values between 0 and 1 inclusive and
summing to 1.0. This property is not checked by this method.
 Parameters:
alphas
 The concentration parameters for the uniform Dirichlet.
xs
 The outcome distribution
 Returns:
 The probability of the outcome distribution given the
Dirichlet concentratioin parameter.
 Throws:
IllegalArgumentException
 If the Dirichlet parameters and
distribution are not arrays of the same length or if the distribution
parameters in xs are not between 0 and 1 inclusive, or if any of
the concentration parameters is not positive and finite.