public class Statistics extends Object
Statistics
class provides static utility methods
for statistical computations.Modifier and Type | Method and Description |
---|---|
static double |
chiSquaredIndependence(double[][] contingencyMatrix)
Returns the value of Pearson's C2
goodness of fit statistic for independence over the specified
contingency matrix.
|
static double |
chiSquaredIndependence(double both,
double oneOnly,
double twoOnly,
double neither)
Returns Pearson's C2 goodness-of-fit
statistic for independence over the specified binary
contingency matrix.
|
static double |
correlation(double[] xs,
double[] ys)
Returns (Pearson's) correlation coefficient between two
specified arrays.
|
static double |
dirichletLog2Prob(double[] alphas,
double[] xs)
Returns the log (base 2) of the probability of the specified
discrete distribution given the specified Dirichlet
distribution concentration parameters.
|
static double |
dirichletLog2Prob(double alpha,
double[] xs)
Returns the log (base 2) of the probability of the specified
discrete distribution given the specified uniform Dirichlet
with concentration parameters equal to the specified value.
|
static double |
jsDivergence(double[] p,
double[] q)
Return the Jenson-Shannon divergence between the specified
multinomial distributions.
|
static double |
kappa(double observedProb,
double expectedProb)
Returns the value of the kappa statistic for the specified
observed and expected probabilities.
|
static double |
klDivergence(double[] p,
double[] q)
Returns the Kullback-Leibler divergence of the second
specified multinomial relative to the first.
|
static double |
klDivergenceDirichlet(double[] xs,
double[] ys)
Returns the Kullback-Leibler divergence of the second specified
Dirichlet distribution relative to the first.
|
static double[] |
linearRegression(double[] xs,
double[] ys)
Returns a two-element array of lineary regression coefficients
for the specified x and y values.
|
static double[] |
logisticRegression(double[] xs,
double[] ys,
double maxValue)
Returns a two-element array of logistic regression coefficients
for the specified x and y values.
|
static double |
mean(double[] xs)
Returns the mean of the specified array of input values.
|
static double[] |
normalize(double[] probabilityRatios)
Return an array of probabilities resulting from normalizing the
specified probability ratios.
|
static int[] |
permutation(int length)
Returns a permutation of the integers between 0 and
the specified length minus one.
|
static int[] |
permutation(int length,
Random random)
Returns a permutation of the integers between 0 and
the specified length minus one using the specified
randomizer.
|
static int |
sample(double[] cumulativeProbRatios,
Random random)
Returns a sample from the discrete distribution represented by the
specified cumulative probability ratios, using the specified random
number generator.
|
static double |
standardDeviation(double[] xs)
Returns the standard deviation of the specified array of input
values.
|
static double |
symmetrizedKlDivergence(double[] p,
double[] q)
Returns the symmetrized KL-divergence between the specified distributions.
|
static double |
symmetrizedKlDivergenceDirichlet(double[] xs,
double[] ys)
Returns the symmetric version of the Kullback-Leibler divergence
between the Dirichlet distributions determined by the specified
parameters.
|
static double |
variance(double[] xs)
Returns the variance of the specified array of input values.
|
public static double klDivergenceDirichlet(double[] xs, double[] ys)
The KL divergence between two Dirichlet distributions with
parameters xs
and ys
of dimensionality
n
is:
whereDKL(Dirichlet(xs) || Dirichlet(ys)) = ∫ p(θ|xs) log ( p(θ|xs) / p(θ|ys) ) dθ = log Γ(Σi < n xs[i]) - log Γ(Σi < n ys[i]) - Σi < n log Γ(xs[i]) + Σi < n log Γ(ys[i]) + Σi < n (xs[i] - ys[i]) * (Ψ(xs[i]) - Ψ(Σk < n xs[i]))
Γ
is the gamma function (see Math.log2Gamma(double)
), and where
Ψ
is the digamma function defined in Math.digamma(double)
.
This method in keeping with other information-theoretic
methods returns the answer in bits (base 2) rather than nats (base e
).
The return is rescaled from the natural-log based divergence:
klDivergenceDirichlet(xs,ys) = DKL(Dirichlet(xs) || Dirichlet(ys)) / log2 e
Further descriptions of the computation of KL-divergence between Dirichlets may be found in:
xs
- Dirichlet parameter for the first distribution.ys
- Dirichlet parameter for the second distribution.public static double symmetrizedKlDivergenceDirichlet(double[] xs, double[] ys)
The symmetrized KL divergence for Dirichlets is just the average of both relative divergences:
DSKL(Dirichlet(xs), Dirichlet(ys)) = (DKL(Dirichlet(xs) || Dirichlet(ys)) + DKL(Dirichlet(ys) || Dirichlet(xs))) / 2
See the method documentation for klDivergenceDirichlet(double[],double[])
for a definition of
the relative divergence for Dirichlet distributions.
xs
- Dirichlet parameter for the first distribution.ys
- Dirichlet parameter for the second distribution.public static double klDivergence(double[] p, double[] q)
The K-L divergence of a multinomial q
relative
to a multinomial p
, both with n
outcomes, is:
The value is guaranteed to be non-negative, and will be 0.0 only if the two distributions are identicial. If any outcome has zero probability inDKL(p||q) = Σi < n p(i) log2 (p(i) / q(i))
q
and non-zero probability
in p
, the result is infinite.
KL divergence is not symmetric. That is, there are
p
and q
such that
DKL(p||q) !=
DKL(q||p)
. See symmetrizedKlDivergence(double[],double[])
and jsDivergence(double[],double[])
for symmetric variants.
KL divergence is equivalent to conditional entropy, although
it is written in the opposite order. If H(p,q)
is
the joint entropy of the distributions p
and
q
, and H(p)
is the entropy of
p
, then:
DKL(p||q) = H(p,q) - H(p)
p
- First multinomial distribution.q
- Second multinomial distribution.IllegalArgumentException
- If the distributions are not
the same length or have entries less than zero or greater than
1.public static double symmetrizedKlDivergence(double[] p, double[] q)
DSKL(p,q) = ( DKL(p,q) + DKL(q,p) ) / 2
p
- First multinomial.q
- Second multinomial.IllegalArgumentException
- If the distributions are not
the same length or have entries less than zero or greater than
1.public static double jsDivergence(double[] p, double[] q)
whereDJS(p,q) = ( DKL(p,m) + DKL(q,m) ) / 2
m
is defined as the balanced linear
interpolation (that is, the average) of p
and
q
:
The JS divergence is non-zero, equal to zero only ifm[i] = (p[i] + q[i]) / 2
p
and q
are the same distribution, and symmetric.p
- First multinomial.q
- Second multinomial.IllegalArgumentException
- If the distributions are not
the same length or have entries less than zero or greater than
1.public static int[] permutation(int length)
length
- Size of permutation to return.public static int[] permutation(int length, Random random)
length
- Size of permutation to return.random
- Randomizer to use for permutation.public static double chiSquaredIndependence(double both, double oneOnly, double twoOnly, double neither)
chiSquaredIndependence(both,oneOnly,twoOnly,neither) = chiSquaredIndependence(new double[][] { {both, oneOnly}, {twoOnly, neither} });The specified values make up the following contingency matrix for boolean outcomes labeled
One
and
Two
:
+Two -Two +One both
oneOnly
-One twoOnly
neither
The value both
is the count of events where both
outcomes occurred, oneOnly
for where only the
first outcome occurred, twoOnly
for where only
the second outcome occurred, and neither
for
when neither occurred. Let totalCount
be
the sum of all of the cells in the matrix.
From the contingency matrix, marginal probabilities for the two outcomes may be computed:
P(One) = (both + oneOnly) / totalCount
P(Two) = (both + twoOnly) / totalCount
If these probabilities are independent, the expected values of
the matrix cells are:
E(both)= totalCount * P(One) * P(Two)
E(oneOnly) = totalCount * P(One) * (1-P(Two))
E(twoOnly) = totalCount * (1-P(One)) * P(Two)
E(neither) = totalCount * (1-P(One)) * (1-P(Two))
These are used to derive the independence test statistic, which
is the square differences between observed and expected values
under the independence assumption, normalized by the expected
values:
C2
= (both - E(both))2 / E(both)
+ (oneOnly - E(oneOnly))2 / E(oneOnly)
+ (twoOnly - E(twoOnly))2 / E(twoOnly)
+ (neither - E(neither))2 / E(neither)
Unlike the higher dimensional case, this statistic applies as a
hypothesis test only in the case when all expected values are
at least 10.both
- Count of samples of both outcomes.oneOnly
- Count of samples with the first and not the
second outcome.twoOnly
- Count of samples with the second and not the
first outcome.neither
- Count of samples with with neither outcome.IllegalArgumentException
- If any of the arguments are not
non-negative finite numbers.public static double[] linearRegression(double[] xs, double[] ys)
{ β0, β1 }
, define a linear function:
f(x) = β1 * x + β0
The coefficients returned produce the linear function f(x)
with the smallest squared error:
sqErr(f,xs,ys) =
Σi
(f(xs[i]) - ys[i])2
where all sums are for 0 << i < xs.length
.
The funciton requires only a single pass through the two
arrays, with β0
and β1
given by:
β1 = n * Σi x[i] * y[i] - (Σi x[i])(Σi y[i]) ------------------------------------------ n * Σi x[i]*x[i] - (Σi x[i])2
whereβ0 = Σi y[i] - β1 Σi x[i] --------------------- n
n = xs.length = ys.length
.xs
- Array of x values.ys
- Parallel array of y values.IllegalArgumentException
- If the arrays are of length
less than 2, or if the arrays are not of the same length.public static double[] logisticRegression(double[] xs, double[] ys, double maxValue)
{ β0, β1 }
, define the logistic function:
with the minimum squared error. SeeL f(x) = --------------- 1 + e β1 * x + β0
linearRegression(double[],double[])
for a definition of
squared error. This function takes real values in the the open
interval (0, L)
.
Logistic regression coefficients are computed using linear regression, after transforming the y values. This is possible because of the following linear relation:
log ((L - y) / y) = β1 * x + β0
xs
- Array of x values.ys
- Array of y values.maxValue
- Maximum value of function.IllegalArgumentException
- If the maximum value is not
positive and finite, if the arrays are of length less than 2,
or if the arrays are not of the same length.public static double chiSquaredIndependence(double[][] contingencyMatrix)
(numRows-1)*(numCols-1)
degrees of freedom. The
higher the value, the less likely the two outcomes are
independent.
Pearson's C2 statistic is defined as follows:
C2
= Σi
Σj
(matrix[i][j] - expected(i,j,matrix))2 / expectedCount(i,j,matrix)
where the expected count is the total count times the max
likelihood estimates of row i
probability times
column j
probability:
expectedCount(i,j,matrix)
= totalCount(matrix)
* rowCount(i,matrix)/totalCount(matrix)
* colCount(j,matrix)/totalCount(matrix)
= rowCount(i,matrix) * colCount(j,matrix) / totalCount(matrix)
where
rowCount(i,matrix)
= Σ0<=j<=numCols
matrix[i][j]
colCount(j,matrix)
= Σ0<=i<=numRows
matrix[i][j]
totalCount(matrix)
= Σ0<=i<=numRows
= Σ0<=j<=numCols
matrix[i][j]
The χ2 test is a large sample test and is only valid if all of the expected counts are at least 5. This restriction is often ignored for ranking purposes.
contingencyMatrix
- The specified contingency matrix.Illegal
- argument exception if the matrix is not rectangular
or if all values are not non-negative finite numbers.public static double[] normalize(double[] probabilityRatios)
Warning: This method is implemented by summing the
probability ratios and then dividing each element by the sum.
Because of the limited precision of double
-based
arithmetic, if the largest ratio is much larger than the next
largest ratio, then the largest normalized probability will be
one and all others will be zero. Java double values follow the
IEEE 754 arithmetic standard and thus use 52 bits for their
mantissas. Thus only ratios within
252~1016 of
the maximum ratio will be non-zero.
probabilityRatios
- Ratios of probabilities.IllegalArgumentException
- If the input contains a value
that is not a finite non-negative number, or if the input does
not contain at least one non-zero entry.public static double kappa(double observedProb, double expectedProb)
kappa(p,e) = (p - e) / (1 - e)
The most typical use for kappa is in evaluating classification problems of a machine versus a gold standard or between two human annotators to assess inter-annotator agreement.
observedProb
- Observed probability.expectedProb
- Expected probability.public static double mean(double[] xs)
mean(xs)
= Σi < xs.length
xs[i] / xs.length
If the array is of length zero, the result is Double.NaN
.xs
- Array of values.public static double variance(double[] xs)
variance(xs)
= Σi < xs.length
(mean(xs) - xs[i])2 / xs.length
If the array is of length zero, the result is Double.NaN
.xs
- Array of values.public static double standardDeviation(double[] xs)
standardDeviation(xs) = variance(xs)(1/2)
If the array is of length zero, the result is Double.NaN
.xs
- Array of values.public static double correlation(double[] xs, double[] ys)
r2
, measures the
square error between a best linear fit between the two arrays
of data. Rescaling either array by a positive constant will
not affect the result.
The square root r
of the correlation
coefficient r2
is the variance
in the second array explained by a linear relation with the
the first array.
The definition of the correlation coefficient is:
correlation(xs,ys)2
= sumSqDiff(xs,ys)2
/ (sumSqDiff(xs,xs) * sumSqDiff(xs,xs))
where
sumSqDiff(xs,ys)
= Σi<xs.length
(xs[i] - mean(xs)) * (ys[i] - mean(ys))
and thus the terms in the denominator above reduce using:
sumSqDiffs(xs,xs)
= Σi<xs.length
(xs[i] - mean(xs))2
See the following for more details:
xs
- First array of values.ys
- Second array of values.IllegalArgumentException
- If the arrays are not the same
length.public static int sample(double[] cumulativeProbRatios, Random random)
The cumulative probability ratios represent unnormalized probabilities
of generating the value of their index or less, that is, unnormalized
cumulative probabilities. For instance, consider
the cumulative probability ratios { 0.5, 0.5, 3.9, 10.1}
:
A sample is taken by generating a random number x between 0.0 and the value of the last element (here 10.1). The value returned is the index i such that:
Outcome Value Unnormalized Prob Prob 0 0.5 0.5 0.5/10.1 1 0.5 0.0 0.0/10.1 2 3.9 3.4 3.4/10.1 3 10.1 6.2 6.2/10.1
The corresponding probabilities given the sample input are listed in the last column in the table above.cumulativeProbRatios[i-1] < x <= cumulativeProbRatios[i]
Note that if two elements have the same value, there is no chance of generating the outcome with the higher index. In the example above, this corresponds to outcome 1 having probaiblity 0.0
Warning: The cumulative probability ratios are required to meet
two conditions. First, all values must be finite and non-negative. Second,
the values must be non-decreasing, so that
cumulativeProbRatios[i] <= cumulativeProbRatios[i+1]
.
If either of these conditions is not met, the result is undefined.
cumulativeProbRatios
- Cumulative probability for outcome less than
or equal to index.random
- Random number generator for sampling.public static double dirichletLog2Prob(double alpha, double[] xs)
See dirichletLog2Prob(double[],double[])
for
more information on the Dirichlet distribution. This method
returns a result equivalent to the following (though is more
efficiently implemented):
double[] alphas = new double[xs.length]; java.util.Arrays.fill(alphas,alpha); assert(dirichletLog2Prob(alpha,xs) == dirichletLog2Prob(alphas,xs));
For the uniform Dirichlet, the distribution simplifies to the following form:
wherep(xs | Dirichlet(α)) = (1/Z(α)) Πi < k xs[i]α-1
Z(α) = Γ(α)k / Γ(k * α)
Warning: The probability distribution must be proper in the sense of having values between 0 and 1 inclusive and summing to 1.0. This property is not checked by this method.
alpha
- Dirichlet concentration parameter to use for each
dimension.xs
- The distribution whose probability is returned.alpha
.IllegalArgumentException
- If the values in the distribution
are not between 0 and 1 inclusive or if the concentration parameter is
not positive and finite.public static double dirichletLog2Prob(double[] alphas, double[] xs)
k
-dimensional
discrete distributions.
The Dirichlet is widely used because it is the conjugate prior for multinomials in a Bayesian setting, and thus may be used to specify a convenient distribution over discrete distributions.
A Dirichlet distribution is specified by a dimensionality
k
and a concentration parameter α[i] > 0
for each i < k
.
The probability of the distribution xs
in a Dirichlet distribution of dimension k
and concentration parameters α
is
given (up to a normalizing constant) by:
The full distribution is:p(xs | Dirichlet(α)) ∝ Πi < k xs[i]α[i]-1
where the normalizing constant is given by:p(xs | Dirichlet(α)) = (1/Z(α)) * Πi < k xs[i]α[i]-1
Z(α) = Πi < k Γ(α[i]) / Γ(Σi < k α[i])
Warning: The probability distribution must be proper in the sense of having values between 0 and 1 inclusive and summing to 1.0. This property is not checked by this method.
alphas
- The concentration parameters for the uniform Dirichlet.xs
- The outcome distributionIllegalArgumentException
- If the Dirichlet parameters and
distribution are not arrays of the same length or if the distribution
parameters in xs are not between 0 and 1 inclusive, or if any of
the concentration parameters is not positive and finite.