com.aliasi.stats

## Class RegressionPrior

• All Implemented Interfaces:
Serializable

```public abstract class RegressionPrior
extends Object
implements Serializable```
A `RegressionPrior` instance represents a prior distribution on parameters for linear or logistic regression. It has methods to return the log probabilities of input parameters and compute the gradient of the log probability for estimation.

Instances of this class are used as parameters in the `LogisticRegression` class to control the regularization or lack thereof used by the stochastic gradient descent optimizers. The priors typically assume a zero mode (maximal value) for each dimension, but allow variances (or scales) to vary by input dimension. The method `shiftMeans(double[],RegressionPrior)` may be used to shift the means (and hence modes) of priors.

The behavior of a prior under stochastic gradient fitting is determined by its gradient, the partial derivatives with respect to the dimensions of the error function for the prior (negative log likelihood) with respect to a coefficient `βi`.

` gradient(β,i) = - ∂ log p(β) / ∂ βi`

See the class documentation for `LogisticRegression` for more information.

Priors also implement a log (base 2) probability density for the prior for a given parameter in a given dimension. The total log prior probability is defined as the sum of the log probabilities for the dimensions,

` log p(β) = Σi log p(βi)`

Priors affect gradient descent fitting of regression through their contribution to the gradient of the error function with respect to the parameter vector. The contribution of the prior to the error function is the negative log probability of the parameter vector(s) with respect to the prior distribution. The gradient of the error function is the collection of partial derivatives of the error function with respect to the components of the parameter vector. The regression prior abstract base class is defined in terms of a single method `gradient(double,int)`, which specifies the value of the gradient of the error function for a specified dimension with a specified value in that dimension.

This class implements static factory methods to construct noninformative, Gaussian and Laplace priors. The Gaussian and Laplace priors may specify a different variance for each dimension, but assumes all the prior means (which are equivalent to the modes) are zero. The priors also assume the dimensions are independent so that the full covariance matrix is assumed to be diagonal (that is, there is zero covariance between different dimensions).

#### Noninformative Prior & Maximum Likelihood Estimation

Using a noninformative prior for regression results in standard maximum likelihood estimation.

The noninformative prior assumes an improper uniform distribution over parameter vectors:

` p(βi) = Uniform(βi) = constant`
and thus the log probabiilty is constant
` log p(βi) = log constant`
and therefore contributes nothing to the gradient:
` gradient(β,i) =  0.0`
A noninformative prior is constructed using the static method `noninformative()`.

#### Gaussian Prior, L2 Regularization & Ridge Regression

The Gaussian prior assumes a Gaussian (also known as normal) density over parameter vectors which results in L2-regularized regression, also known as ridge regression. Specifically, the prior allows a variance to be specified per dimension, but assumes dimensions are independent in that all off-diagonal covariances are zero. The Gaussian prior has a single mode that is the same as its mean.

The Gaussian density with variance is defined by:

` p(βi) = 1.0/sqrt(2 * π σi2) * exp(-βi2/(2 * σi2))`
which on a log scale is
` log p(βi) = log (1.0/sqrt(2 * π * σi2)) + -βi2/(2 * σi2)`

The Gaussian prior leads to the following contribution to the gradient for a dimension `i` with parameter `βi` and variance `σi2`:

` gradient(β,i) = βi/σi2`
As usual, the lower the variance, the steeper the gradient, and the stronger the effect on the (maximum) a posteriori estimate.

Gaussian priors are constructed using one of the static factory methods, `gaussian(double[])` or `gaussian(double,boolean)`.

#### Laplace Prior, L1 Regularization & the Lasso

The Laplace prior assumes a Laplace density over parameter vectors which results in L1-regularized regression, also known as the lasso. The Laplace prior is called a double-exponential distribution because it is looks like an exponential distribution for positive values and the reflection of this exponential distribution around zero (or more generally, around its mean parameter). The Laplace prior has the mode in the same location as the mean.

A Laplace prior allows a variance to be specified per dimension, but like the Gaussian prior, assumes means are zero and that the dimensions are independent in that all off-diagonal covariances are zero.

The Laplace density is defined by:

` p(βi) = (sqrt(2)/(2 * σi)) * exp(- sqrt(2) * abs(βi) / σi)`
which on the log scale is
` log p(βi) = log (sqrt(2)/(2 * σi)) - sqrt(2) * abs(βi) / σi`

The Laplace prior leads to the following contribution to the gradient for a dimension `i` with parameter `betai`, mean zero and variance `σi2`:

` gradient(β,i) = sqrt(2) * signum(βi) / σi`
where the derivative of the absolute value function is the `signum` function, as defined by `Math.signum(double)`.
` signum(x) = x > 0 ? 1 : (x < 0 ? -1 : 0)`

Laplace priors are constructed using one of the static factory methods, `laplace(double[])` or `laplace(double,boolean)`.

#### Cauchy Prior

The Cauchy prior assumes a Cauchy density (also known as a Lorentz density) over priors. The Cauchy density allows a scale to be specified for each dimension. The mean and variance are undefined as their integrals diverge. The Cauchy distribution is symmetric and for regression priors, we assume a mode of zero for the base distribution. The Cauchy prior also has a single mode at its mean.

The Cauchy density with scale of 1 is a Student-t density with one degree of freedom.

The Cauchy density is defined by:

` p(βi,i) = (1 / π) * (λi / (βi2 + λi2))`
which on a log scale is
` log p(βi,i) = log (1 / π) + log (λi) - log (βi2 + λi2)`

The Cauchy prior leads to the following contribution to the gradient for dimension `i` with parameter `βi` and scale `λi2`:

` gradient(βi, i) = 2 βi / (βi2 + λi2)`

Cauchy priors are constructed using one of the static factory methods `cauchy(double[])` or `cauchy(double,boolean)`.

For use in gradient-based algorithms, the gradients of two different priors may be interpolated. A special case is the elastic net, discussed in he next section. Given two priors `p1` and `p2`, and an interpolation ratio `α` between 0 and 1, the interpolated prior is defined by

` log p(βi) = α * log p1(βi) + (1 - α) * log p2(βi) - Z`
where `Z` is the normalization constant not depending on `β` that normalizes the density,
``` p(β,i) = exp(log p(βi))

= exp(α * log p1(βi)) * exp((1 - α) * log p2(βi)) / exp(Z)

= p1(β,i)α * p2(β,1)(1 - α) / exp(Z)```

The gradient, being a derivative, will be the weighted sum of the underlying gradients `gradient1` and `gradient2`,

` gradient(β,i) = α * gradient1(β,i) + (1 - α) * gradient2(β,i)`

#### Elastic Net Prior

The elastic net prior interpolates between a Laplace prior and a Gaussian prior on the log scale uniformly for all dimensions. There are two parameters, a scale parameter for the prior variances and an interpolation parameter that determines the weight given to the Laplace prior versus the Gaussian prior. The elastic net prior with Laplace weight `α` and scale `λ` is defined by
` log p(β,i) = α * log Laplace(βi|1/sqrt(λ)) + (1 - α) Gaussian(βi|sqrt(2)/λ)`
where `Laplace(βi|1/sqrt(λ))` is the density of the (zero-mean) Laplace distribution with variance `1/sqrt(λ)`, and `Gaussian(βi|sqrt(2)/λ)` is the (zero-mean) Gaussian density function with variance `sqrt(2)/λ`. + (1 - α) Gaussian(βi|sqrt(2)/λ)

Thus the gradient is an interpolation of the gradients of the Laplace with variance `σ2 = 1/sqrt(λ)` and Gaussian with variance `σ2 = sqrt(2)/λ`, leading to a simple gradient form,

` gradient(β,i) = α * λ * signum(βi) + (1 - α) * λ * βi`

The basic elastic net prior has zero means and modes in all' dimensions, but may be shifted like other priors.

#### Non-Zero Means and Modes

Priors with non-zero means or modes typically arise in hierarchical or multilevel regression models or models in which infomative priors are available on a dimension-by-dimension basis.

Through the method `shiftMeans(double[],RegressionPrior)` it is possible to shift the means of a prior by the specified amount. This allows any prior to be used with non-zero means. Probabilities are computed by shifting back. Suppose `p2` is the density and `gradient2` the gradient of the specified prior and `shifts` the specified array of floats specifying the mean shifts. Probabilities and gradients are computed by shifting back,

` p(β) = p2(β - shifts)`
and
` gradient(β,i) = gradient2(β - shifts,i)`
Dimension by dimension, the value is computed by subtracting the shift from the value and plugging it into the underlying prior.

For example, to specify a Gaussian prior with means `mus` and variances `vars`, use

``` double[] mus = ...
double[] vars = ...
RegressionPrior prior = shiftMeans(mus,gaussian(vars))```

#### Special Treatment of Intercept

By convention, input dimension zero (`0`) may be reserved for the intercept and set to value 1.0 in all input vectors. For regularized regression, the regularization is typically not applied to the intercept term. To match this convention, the factory methods allow a boolean parameter indicating whether the intercept parameter has a noninformative/uniform prior. If the intercept flag indicates it is noninformative, then dimension 0 will not have an infinite prior variance or scale, and hence a zero gradient. The result is that the intercept will be fit by maximum likelihood.

#### Serialization

All of the regression priors may be serialized.

#### References

For full details on the Gaussian, cauchy, and Laplace distributions, see:

For explanations of how the priors are used with regression including logistic regression, see the following three textbooks:

and for non-zero means and gradient calculations, the tech reports: For a decription and evaluation of the Cauchy prior, see

For details of the elastic net prior, see

Since:
LingPipe3.5
Version:
3.9.2
Author:
Bob Carpenter
Serialized Form
• ### Method Summary

All Methods
Modifier and Type Method and Description
`static RegressionPrior` `cauchy(double[] priorSquaredScales)`
Returns the Cauchy prior for the specified squared scales.
`static RegressionPrior` ```cauchy(double priorSquaredScale, boolean noninformativeIntercept)```
Returns the Cauchy prior with the specified prior squared scales for the dimensions.
`static RegressionPrior` ```elasticNet(double laplaceWeight, double scale, boolean noninformativeIntercept)```
Returns the elastic net prior with the specified weight on the Laplace prior, the specified scale parameter for the elastic net and a noninformative prior on the intercept (dimension 0) if the specified flag is set.
`static RegressionPrior` `gaussian(double[] priorVariances)`
Returns the Gaussian prior with the specified priors for each dimension.
`static RegressionPrior` ```gaussian(double priorVariance, boolean noninformativeIntercept)```
Returns the Gaussian prior with the specified prior variance and indication of whether the intercept is given a noninformative prior.
`abstract double` ```gradient(double betaForDimension, int dimension)```
Returns the contribution to the gradient of the error function of the specified parameter value for the specified dimension.
`boolean` `isUniform()`
Returns `true` if this prior is the uniform distribution.
`static RegressionPrior` `laplace(double[] priorVariances)`
Returns the Laplace prior with the specified prior variances for the dimensions.
`static RegressionPrior` ```laplace(double priorVariance, boolean noninformativeIntercept)```
Returns the Laplace prior with the specified prior variance and number of dimensions and indication of whether the intecept dimension is given a noninformative prior.
`abstract double` ```log2Prior(double betaForDimension, int dimension)```
Returns the log (base 2) of the prior density evaluated at the specified coefficient value for the specified dimension (up to an additive constant).
`double` `log2Prior(Vector beta)`
Returns the log (base 2) prior density for a specified coefficient vector (up to an additive constant).
`double` `log2Prior(Vector[] betas)`
Returns the log (base 2) prior density for the specified array of coefficient vectors (up to an additive constant).
`static RegressionPrior` ```logInterpolated(double alpha, RegressionPrior prior1, RegressionPrior prior2)```
Returns the prior that interpolates its log probability between the specified priors with the weight going to the first prior.
`double` `mode(int dimension)`
Returns the mode of the prior.
`static RegressionPrior` `noninformative()`
Returns the noninformative or uniform prior to use for maximum likelihood regression fitting.
`static RegressionPrior` ```shiftMeans(double[] shifts, RegressionPrior prior)```
Returns the prior that shifts the means of the specified prior by the specified values.
• ### Methods inherited from class java.lang.Object

`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`
• ### Method Detail

• #### isUniform

`public boolean isUniform()`
Returns `true` if this prior is the uniform distribution. Uniform priors reduce to maximum likelihood calculations.
Returns:
`true` if this prior is the uniform distribution.
• #### mode

`public double mode(int dimension)`
Returns the mode of the prior. The mode is used to clip gradient steps of the prior so they do not pass through the mode.
Parameters:
`dimension` - Dimension position in vector.
Returns:
The mean of the prior for the specified dimension.

```public abstract double gradient(double betaForDimension,
int dimension)```
Returns the contribution to the gradient of the error function of the specified parameter value for the specified dimension.
Parameters:
`betaForDimension` - Parameter value for the specified dimension.
`dimension` - The dimension.
Returns:
The contribution to the gradient of the error function of the parameter value and dimension.
• #### log2Prior

```public abstract double log2Prior(double betaForDimension,
int dimension)```
Returns the log (base 2) of the prior density evaluated at the specified coefficient value for the specified dimension (up to an additive constant). The overall error function is the sum of the negative log likelihood of the data under the model and the negative log of the prior.
Parameters:
`betaForDimension` - Parameter value for the specified dimension.
`dimension` - The dimension.
Returns:
The prior probability of the specified parameter value for the specified dimension.
• #### log2Prior

`public double log2Prior(Vector beta)`
Returns the log (base 2) prior density for a specified coefficient vector (up to an additive constant).
Parameters:
`beta` - Parameter vector.
Returns:
The log (base 2) prior for the specified parameter vector.
Throws:
`IllegalArgumentException` - If the specified parameter vector does not match the dimensionality of the prior (if specified).
• #### log2Prior

`public double log2Prior(Vector[] betas)`
Returns the log (base 2) prior density for the specified array of coefficient vectors (up to an additive constant).
Parameters:
`betas` - The parameter vectors.
Returns:
The log (base 2) prior density for the specified
Throws:
`IllegalArgumentException` - If any of the specified parameter vectors does not match the dimensionality of the prior (if specified).
• #### noninformative

`public static RegressionPrior noninformative()`
Returns the noninformative or uniform prior to use for maximum likelihood regression fitting.
Returns:
The noninformative prior.
• #### gaussian

```public static RegressionPrior gaussian(double priorVariance,
boolean noninformativeIntercept)```
Returns the Gaussian prior with the specified prior variance and indication of whether the intercept is given a noninformative prior.

If the noninformative-intercept flag is set to `true`, the prior variance for dimension zero (`0`) is set to `Double.POSITIVE_INFINITY`.

Parameters:
`priorVariance` - Variance of the Gaussian prior for each dimension.
`noninformativeIntercept` - Flag indicating if intercept is given a noninformative (uniform) prior.
Returns:
The Gaussian prior with the specified parameters.
Throws:
`IllegalArgumentException` - If the prior variance is not a non-negative number.
• #### gaussian

`public static RegressionPrior gaussian(double[] priorVariances)`
Returns the Gaussian prior with the specified priors for each dimension. The number of dimensions is taken to be the length of the variance array.

Parameters:
`priorVariances` - Array of prior variances for dimensions.
Returns:
The Gaussian prior with the specified variances.
Throws:
`IllegalArgumentException` - If any of the variances are not non-negative numbers.
• #### laplace

```public static RegressionPrior laplace(double priorVariance,
boolean noninformativeIntercept)```
Returns the Laplace prior with the specified prior variance and number of dimensions and indication of whether the intecept dimension is given a noninformative prior.

If the noninformative-intercept flag is set to `true`, the prior variance for dimension zero (`0`) is set to `Double.POSITIVE_INFINITY`.

Parameters:
`priorVariance` - Variance of the Laplace prior for each dimension.
`noninformativeIntercept` - Flag indicating if intercept is given a noninformative (uniform) prior.
Returns:
The Laplace prior with the specified parameters.
Throws:
`IllegalArgumentException` - If the variance is not a non-negative number.
• #### laplace

`public static RegressionPrior laplace(double[] priorVariances)`
Returns the Laplace prior with the specified prior variances for the dimensions.

Parameters:
`priorVariances` - Array of prior variances for dimensions.
Returns:
The Laplace prior for the specified variances.
Throws:
`IllegalArgumentException` - If any of the variances is not a non-negative number.
• #### cauchy

```public static RegressionPrior cauchy(double priorSquaredScale,
boolean noninformativeIntercept)```
Returns the Cauchy prior with the specified prior squared scales for the dimensions.

Parameters:
`priorSquaredScale` - The square of the prior scae parameter.
`noninformativeIntercept` - Flag indicating if intercept is given a noninformative (uniform) prior.
Returns:
The Cauchy prior for the specified squared scale and intercept flag.
Throws:
`IllegalArgumentException` - If the scale is not a non-negative number.
• #### cauchy

`public static RegressionPrior cauchy(double[] priorSquaredScales)`
Returns the Cauchy prior for the specified squared scales.

Parameters:
`priorSquaredScales` - Prior squared scale parameters.
Returns:
The Cauchy prior for the specified square scales.
Throws:
`IllegalArgumentException` - If any of the prior squared scales is not a non-negative number.

```public static RegressionPrior logInterpolated(double alpha,
RegressionPrior prior1,
RegressionPrior prior2)```
Returns the prior that interpolates its log probability between the specified priors with the weight going to the first prior.

Parameters:
`alpha` - Weight of first prior.
`prior1` - First prior for interpolation.
`prior2` - Second prior for interpolation.
Returns:
The interpolated prior.
Throws:
`IllegalArgumentException` - If the interpolation ratio is not a number between 0 and 1 inclusive.
• #### elasticNet

```public static RegressionPrior elasticNet(double laplaceWeight,
double scale,
boolean noninformativeIntercept)```
Returns the elastic net prior with the specified weight on the Laplace prior, the specified scale parameter for the elastic net and a noninformative prior on the intercept (dimension 0) if the specified flag is set.

See the class documentation above for more information on elastic net priors.

This is a convenience method for

``` logInterpolated(laplaceWeight,
laplace(1/sqrt(scale),noninformativeIntercept),
gaussian(sqrt(2)/scale,noninformativeIntercept))
```
Parameters:
`laplaceWeight` - Weight on the Laplace prior.
`scale` - Scale parameter for the elastic net.
`noninformativeIntercept` - A flag indicating whether or not the intercept (dimension 0) should have a noninformative prior.
Returns:
The elastic net prior with the specified paramters.
Throws:
`IllegalArgumentException` - If the interpolation parameter is not between 0 and 1 inclusive, and if the scale is not positive and finite.
• #### shiftMeans

```public static RegressionPrior shiftMeans(double[] shifts,
RegressionPrior prior)```
Returns the prior that shifts the means of the specified prior by the specified values.

`shifts` - Mean shifts indexed by dimension.
`prior` - Prior to apply to shifted values.