Introduction
What is Logistic Regression?
Logistic regression is a discriminitive probabilistic classification model that operates over real-valued vector inputs. The dimensions of the input vectors being classified are called "features" and there is no restriction against them being correlated. Logistic regression is one of the best probabilistic classifiers, measured in both log loss and first-best classification accuracy across a number of tasks.
The logistic regression implementation in LingPipe provides multinomial classification; that is, it allows more than two possible output categories.
The main drawback of logistic regression is that it's relatively slow to train compared to the other LingPipe classifiers. It also requires extensive tuning in the form of feature selection and implementation to achieve state-of-the-art classification performance.
What's in the Tutorial?
This tutorial covers both the vector-based implementation in the statistics package and the use of feature extractors for classifying arbitrary objects in the classification package. The tutorial will cover basic estimation, the effects of different choices and parameterizations of priors, and tuning the estimator's search.
The tutorial will also cover the basics of feature-based classification in LingPipe. Feature extractors convert arbitrary objects into feature vectors, which may then be converted to actual vectors for use in logistic regression.
Also Known As (AKA)
For the sake of terminological clarity (and search engine optimization), here are some aliases for multinomial logistic regression.
Polytomous Logistic Regression
Multinomial logistic regression is also known as polytomous, polychotomous, or multi-class logistic regression, or just multilogit regression.
Maximum Entropy Classifier
Logistic regression estimation obeys the maximum entropy principle, and thus logistic regression is sometimes called "maximum entropy modeling", and the resulting classifier the "maximum entropy classifier".
Neural Network: Classification with a Single Neuron
Binary logistic regression is equivalent to a one-layer, single-output neural network with a logistic activation function trained under log loss. This is sometimes called classification with a single neuron.
LingPipe's stochastic gradient descent is equivalent to a stochastic back-propagation algorithm over the single-output neural network.
Ridge Regression and the Lasso
Maximum a priori (MAP) estimation with Gaussian priors is often referred to as "ridge regression"; with Laplace priors MAP estimation is known as the "lasso".
Shrinkage and Regularized Regression
MAP estimation with Gaussian, Laplace or Cauchy priors is known as parameter shrinkage.
Gaussian and Laplace priors are equivalent to regularized regression, with the Gaussian version being regularized with the L_{2} norm (Euclidean distance, called the Frobenius norm for matrices of parameters) and the Laplace version being regularized with the L_{1} norm (taxicab distance or Manhattan metric); other Minkowski metrics may be used for shrinkage.
Generalized Linear Model and Softmax
Logistic regression is a generalized linear model with the logit link function. The logistic link function is sometimes called softmax and given its use of exponentiation to convert linear predictors to probabilities, it is sometimes called an exponential model.
Logistic Regression Models
Logistic regression models provide multi-category classification in cases where the categories are exhaustive and mutually exclusive. That is, every instance belongs to exactly one category.
Inputs are coded as real-valued vectors of a fixed dimensionality. The dimensions are often called predictors or features. There is no requirement that they be independent, and with regularization, they may even be highly or fully linearly correlated.
The model consists of parameter vectors for categories of the dimensionality of inputs. The last category does not get a parameter vector; or equivalently, it gets a constant 0 parameter vector.
More formally, if the inputs are of dimension d
and
there are k
categories, the model consists of k-1
vectors β[0],...,β[k-2]
. Then for a given
input vector x
of dimensionality k
, the
conditional probability of a category given the input is defined to be:
p(0 | x) ∝ exp(β[0] * x) p(1 | x) ∝ exp(β[1] * x) ... p(k-2 | x) ∝ exp(β[k-2] * x) p(k-1 | x) ∝ exp(0 * x)
Normalizing by the sum of the exponentiated bases yields the probability estimates:
p(0 | x) = exp(β[0]*x) / (exp(β[0]*x) + ... + exp(β[k-2]*x) + exp(0*x)) p(1 | x) = exp(β[1]*x) / (exp(β[0]*x) + ... + exp(β[k-2]*x) + exp(0*x)) ... p(k-2 | x) = exp(β[k-2]*x) / (exp(β[0]*x) + ... + exp(β[k-2]*x) + exp(0*x)) p(k-1 | x) = exp(0*x) / (exp(β[0]*x) + ... + exp(β[k-2]*x) + exp(0*x))
Writing it out in summation notation, for c < k-1
:
p(c | x) = exp(β[c] * x) / (1 + Σ_{i < k-1} exp(β[i]*x))
and for c = k-1
:
p(k-1 | x) = 1 / (1 + Σ_{i < k-1} exp(β[i]*x))
Example of Logistic Regression
Logistic regression models are estimated from training data consisting of a sequence of vectors and their reference categories. The vectors are arbitrary, with their dimensions representing features of the input objects being classified. The categories are discrete, and should be numbered contiguously from 0 to the number of categories minus one.
The Wallet Problem
The first example we consider is drawn from chapter 5 of the following book:
- Allison, Paul David. 1999. Logistic Regression Using the SAS System: Theory and Application. SAS Institute.
The data is based on a survey of 195 undergraduates, and attempts to predict their answer to the question "If you found a wallet on the street, would you...", with the following possible responses:
Wallet Problem Outcomes | |
---|---|
Outcome | Description |
0 | keep both |
1 | keep the money, return the wallet |
2 | return both |
The input vectors are five dimensional, consisting of the following features, the descriptions of which are directly transcribed from (Allison 1999):
Wallet Problem Predictors | ||
---|---|---|
Dimension | Description | Values |
0 | Intercept | 1: always |
1 | Male | 1: male 0: female |
2 | Business | 1: enrolled in business school 0: not enrolled in business school |
3 | Punish | Variable describing whether student was physically punished by parents at various ages: 1: punished in elementary school, but not in middle or high school 2: punished in elementary and middle school, but not in high school 3: punished at all three levels |
4 | Explain | Response to question "When you were punished, did your parents generally explain why what you did was wrong?" 1: almost always 0: sometimes or never |
LingPipe requires an explicit representation of the intercept feature, which is implicit in (Allison 1999). The intercept is treated just like other features, but is assumed to take on value 1.0 in all inputs. Thus it provides an input-independent bias term for estimation. Most problems benefit from the addition of such an intercept feature.
Where do Features Come From?
The predictors in this problem are all discrete, most defining binary variables with the physical punishment model taking on three ordinal values. It is also possible to include continuous inputs for regression problems such as token counts in linguistic examples or fetaures like width of petals in flower species classification problems.
Here's the first few training examples out of the complete set of 195:
Wallet Problem Data (Sample) | |||||
---|---|---|---|---|---|
Outcome | Intercept | Male | Business | Punish | Explain |
1 | 1.0 | 0.0 | 0.0 | 2.0 | 0.0 |
1 | 1.0 | 0.0 | 0.0 | 2.0 | 1.0 |
2 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 |
2 | 1.0 | 0.0 | 0.0 | 2.0 | 0.0 |
0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 |
... |
For example, the third training instance represents a survey response for a woman (male=0.0) who is not in business school (business=0.0), who was punished only in elementary school (punish=1.0), had her punishment explained almost always (explain=1.0), and who said she'd return both the wallet and the money (outcome=2). The fifth training example represents a man who's not in business school, was punished only in elementary school, had his punishment explained, and answered that he would keep both the money and wallet.
Our first logistic regression model is estimated from 195 of these training cases, yielding a classifier that given the five input feature values (intercept, male, business, punish and explain), assigns probabilities to the three outcomes (keep both, return only money, return both).
Coding the Problem
The source code for the wallet problem may be found in the
file src/WalletProblem.java
.
In order to train a logistic regression model, LingPipe requires
the inputs to be coded as instances of matrix.Vector
and outputs to be coded as integers. These are presented as
parallel arrays of vectors and output integers.
To keep things simple, the outputs and inputs are coded directly as used as static constants. Here are the outputs:
static final int[] OUTPUTS = new int[] { 1, 1, 2, 2, 0, ... }
The inputs are coded as dense vector instances:
static final Vector[] INPUTS = new Vector[] { new DenseVector(new double[] { 1, 0, 0, 2, 0 }), new DenseVector(new double[] { 1, 0, 0, 2, 1 }), new DenseVector(new double[] { 1, 0, 0, 1, 1 }), new DenseVector(new double[] { 1, 0, 0, 2, 0 }), new DenseVector(new double[] { 1, 1, 0, 1, 1 }), ... };
Note how these two parallel arrays directly encode the sample data as presented in the previous table.
Estimating the Regression Coefficients
Running the code using the ant target wallet
prints out the estimated regression coefficients:
> ant wallet Computing Wallet Problem Logistic Regression Outcome=0 -3.47 1.27 1.18 1.08 -1.60 Outcome=1 -1.29 1.17 0.42 0.20 -0.80
An estimated model consists of a sequence of weight vectors for one minus the number of output categories. We don't need a vector for the last category, because it can be taken to be zero without loss of generality (see Carpenter 2008).
Implicit Coefficients for Final Outcome
The final outcome has all zero coefficients. Filling in for the wallet example, this gives us:
Outcome=2 0.00 0.00 0.00 0.00 0.00
These are not printed as part of the model output.
Interpreting the Regresison Coefficients
Because logistic regression involves a simple linear predictor, the regression coefficients may be interpreted fairly directly.
Intercept
The values of the intercept parameter are -3.47 for outcome 0 (keep both), -1.29 for (keep money, return wallet), and 0.0 implicitly for outcome 2 (return-both). Because of the definition of probability and the fact that the intercept feature dimension is always 1.0, the linear basis for outcomes 0, 1 and 2 start off on an uneven footing. To make the keep-both outcome most likely, another feature or combination of features will have to contribute more than 3.47 to the linear basis.
Computing Probabilities
The main point of fitting a model is to be able to interpret
probabilities for events. For instance, take male business students
who were punished at all three levels without explanation. That
provides an input vector of (1,1,1,3,0)
. The probabilities
work out as:
p(keep-both|1,1,1,3,0) ∝ exp(-3.47*1 + 1.27*1 + 1.18*1 + 1.08*3 + -1.6*0) = exp(-1.54) = 0.21 p(keep-money|1,1,1,3,0) ∝ exp(-1.29*1 + 1.17*1 + 0.42*1 + 0.20*3 + -0.8*0) = exp(-0.3) = 0.74 p(return-both|1,1,1,3,0) ∝ exp(0.0*1 + 0.0*1 + 0.0*1 + 0.0*3 + 0.0*0) = exp(0) = 1
Division by the sum of exponentiated linear predictors yields the probabilities:
p(keep-both|1,1,1,3,0) = 0.21 / (0.21 + 0.74 + 1) = 0.11 p(keep-money|1,1,1,3,0) = 0.74 / (0.21 + 0.74 + 1) = 0.38 p(return-both|1,1,1,3,0) = 1.00 / (0.21 + 0.74 + 1) = 0.51
If we repeat this exercise for women (second feature = 0), we get:
p(keep-both|1,0,1,3,0) = 0.30 / (0.30 + 0.51 + 1.00) = 0.16 p(keep-money|1,0,1,3,0) = 0.51 / (0.30 + 0.51 + 1.00) = 0.28 p(return-both|1,0,1,3,0) = 1.00 / (0.30 + 0.51 + 1.00) = 0.55
According to this model, among business students punished at all three levels without explanation, women are less likely to waffle; they're more likely to keep both the money and the wallet and also more likely to return both than men, who are prone to keep the money and return the wallet.
Code Walk Through
The code is all in the main()
method. The
estimation is done with the following one-liner, with
the imports from the stats
package listed:
import com.aliasi.stats.AnnealingSchedule; import com.aliasi.stats.LogisticRegression; import com.aliasi.stats.RegressionPrior; public static void main(String[] args) { LogisticRegression regression = LogisticRegression.estimate(INPUTS, OUTPUTS, RegressionPrior.noninformative(), AnnealingSchedule.inverse(.05,100), null, // null reporter 0.000000001, // min improve 1, // min epochs 10000); // max epochs
The parameters to the estimate()
method involve
the inputs and outputs, model prior hyperparameter, parameters
to control the search, and a progress monitor parameter.
Input Vectors and Output Categories
The first two arguments to the estimate()
method
are just the parallel arrays of input vectors and output categories;
these were defined in the previous section.
Prior Hyperparameter
The hyperparameter controlling fitting in the model is the prior.
The priors are defined in the stats.RegressionPrior
class. This example uses a so-called noninformative prior, the upshot
of which is that the estimate will be a maximum likelihood estimate
(the parameters which assign the highest likelihood to the entire
corpus of training data). We consider other priors in the next
section.
Search Parameters
The remaining parameters all control the search for the estimate. Logistic regression has no analytic solution, so estimating parameters from data requires numerical optimization. LingPipe employs stochastic gradient descent (SGD), a general, highly-scalable online optimization algorithm. SGD makes several passes through the data, adjusting the parameters a little bit based on examples one at a time.
The first search parameter, is the annealing schedule. Simulated annealing
is a widely used technique in numerical optimization. It involves
starting with large learning rates and gradually reducing the activity
of the learner over time. The annealing schedule used in this demo is
exponential, meaing that the learning rate at each step is an
exponential function. The parameters 0.005 and 0.9999 are the initial
learning rate and the base of the exponent. There is more information
about annealing in the class documentation for stats.AnnealingSchedule
.
The second search parameter, 0.000000001
indicates how
tight the estimate must be before stopping the search. This is measured
in relative corpus log likelihood. That is, if the corpus log likelihood
in an epoch (run through all the input/output pairs) is reduced
by less than 0.0000001 percent, the search is terminated.
The third and fourth search parameters, 1
and
100000
indicate the minimum and maximum number of times
each training example is visited.
Progress Monitor Parameter
The null
parameter can optionally be populated
a com.aliasi.io.Reporter
to which feedback about
the progress of the search will be printed. A standard value
would be to create a reporter using com.aliasi.io.Reporters.stdOut()
to print to standard output.
Applying a Trained Model
Once a regression model is trained, it may be used to probabilistically classify new vectors of the same dimensionality as the training data.
The sample code in wallet problem goes on with some randomly generated data to do classification.
... Input Vector Outcome Conditional Probabilities 1.0 0.0 0.0 1.0 1.0 p(0|input)=0.02 p(1|input)=0.13 p(2|input)=0.86 1.0 0.0 1.0 0.0 0.0 p(0|input)=0.07 p(1|input)=0.28 p(2|input)=0.66 1.0 0.0 1.0 3.0 1.0 p(0|input)=0.28 p(1|input)=0.18 p(2|input)=0.54
The third input represents a female business student who was physically punished through high school with explanation. the model predicts she is 28 percent likely to keep the wallet and money, and only 54% likely to return both.
The code to compute the outcome probabilities given the output just feeds the input vectors to the regression model to produce an array of output conditional probabilities (omitting some of the print statements):
for (Vector testCase : TEST_INPUTS) { double[] conditionalProbs = regression.classify(testCase); for (int i = 0; i < testCase.numDimensions(); ++i) System.out.printf("%3.1f ",testCase.value(i)); for (int k = 0; k < conditionalProbs.length; ++k) System.out.printf(" p(%d|input)=%4.2f ",k,conditionalProbs[k]); }
The variable TEST_INPUTS
is an array of vector objects,
of the same format as the training inputs array. The key method call in the
code is in bold, applying the trained regresison model to classify
a test case. The rest just goes through the output and prints it out
in a readable fashion.
Regularization with Priors
Regression models have a tendency to overfit their training data, so priors are introduced to control the complexity of the fitted model.
The Overfitting Problem
Problems with Maximum Likelihood
Logistic regression models with large numbers of features and limited amounts of training data are highly prone to overfitting under maximum likelihood estimation. A model is overfit if it is a tight match to the training data but does not generalize well to new data. The model is called "overfit" because it is too closely tailored to the training data. The maximum likelihood estimation procedure is at the root of the problem, because it simply fits the training data as tightly as possible.
Linearly Separable Problems
A particularly pathological case of overfitting is when the data is linearly separable. A simple case is when a feature value (dimension of the input) is positive if and only for a single output. For instance, in a study of 195 students, it might have turned out that every male kept the wallet and money. In this case, the coefficient for outcome 0 for the male feature will be unbounded; making it larger always increases the probability.
Priors on Coefficients
To compensate for the tendency of regression models to overfit, it is common to establish prior expectations for the values of parameters. These prior densities are designed to favor simple models. Simplicity for regression models means small regression coefficients, so the priors tend to concentrate parameters around zero. With smaller coefficients, the change in probability for a given change in an input dimension is less and thus the overall estimate is less variable.
Varieties of Priors
LingPipe implements three priors for regression: Cauchy (Student-t with one degree of freedom), Gaussian (normal), and Laplace (double exponential). The priors are listed here in order of how fat their tails are. The Cauchy distribution is so dispersed, in fact, that it does not have a finite mean or variance. The Laplace distribution is so peaked around its mean that it tends to drive most posterior coefficient estimates to its mean.
Because we wish to push coefficients toward zero, we only consider priors with mean (or median in the case of the Cauchy) zero. The variance (or scale in the case of the Cauchy) will determine how fat the distribution is, but the scale of the tails relative to the rest of the distribution is controlled by variance.
Priors with means of zero exert a shrinkage effect on parameters relative to maximum likelihood estimates. Applying priors is thus sometimes called "shrinkage".
Running the Demo
The ant target regularization
demonstrates regularization with priors
over the wallet data.
> ant regularization VARIANCE=0.0010 Prior=LaplaceRegressionPrior(Variance=0.0010, noninformativeIntercept=true) 0) -1.62, 0.00, 0.00, 0.00, 0.00, 1) -0.88, 0.00, 0.00, 0.00, 0.00, Prior=GaussianRegressionPrior(Variance=0.0010, noninformativeIntercept=true) 0) -1.63, -0.00, 0.00, 0.01, -0.02, 1) -0.84, 0.03, 0.01, 0.02, 0.01, Prior=CauchyRegressionPrior(Scale=0.0010, noninformativeIntercept=true) 0) -3.36, 0.00, 0.00, 1.08, -0.00, 1) -0.88, 0.01, 0.00, 0.01, 0.00, ... VARIANCE=0.512 Prior=LaplaceRegressionPrior(Variance=0.512, noninformativeIntercept=true) 0) -3.00, 0.63, 0.57, 0.92, -0.91, 1) -1.12, 0.88, 0.03, 0.06, -0.39, Prior=GaussianRegressionPrior(Variance=0.512, noninformativeIntercept=true) 0) -3.13, 0.75, 0.76, 0.96, -0.98, 1) -1.23, 0.90, 0.28, 0.15, -0.51, Prior=CauchyRegressionPrior(Scale=0.512, noninformativeIntercept=true) 0) -3.14, 0.80, 0.77, 0.98, -1.12, 1) -1.26, 0.96, 0.23, 0.16, -0.50, ... VARIANCE=524.288 Prior=LaplaceRegressionPrior(Variance=524.288, noninformativeIntercept=true) 0) -3.46, 1.24, 1.15, 1.08, -1.57, 1) -1.27, 1.17, 0.42, 0.19, -0.78, Prior=GaussianRegressionPrior(Variance=524.288, noninformativeIntercept=true) 0) -3.48, 1.27, 1.17, 1.09, -1.60, 1) -1.28, 1.18, 0.43, 0.20, -0.79, Prior=CauchyRegressionPrior(Scale=524.288, noninformativeIntercept=true) 0) -3.48, 1.27, 1.17, 1.09, -1.60, 1) -1.28, 1.18, 0.43, 0.20, -0.79,
With very low prior variance, as shown in the first example with a prior variance of 0.001, the coefficients are driven close to zero in the posterior. As the variance increases, the results get closer and closer to the maximum likelihood estimates, with only the Laplace prior only just barely shrinking a few parameters.
Also note that for a given variance (or scale for the Cauchy), the Cauchy exerts the least push toward zero and the Laplace the most push toward zero. In the natural language problems we consider in the next section, a fairly liberal Laplace prior still drives most posterior parameters to zero.
Code Walk Through
We return to the wallet example in a demo of the effects of regularization in
src/RegularizationDemo.java
.
The main()
method just loops over variances trying all the priors:
for (double variance = 0.001; variance <= 1000; variance *= 2.0) { System.out.println("\n\nVARIANCE=" + variance); evaluate(RegressionPrior.laplace(variance,true)); evaluate(RegressionPrior.gaussian(variance,true)); evaluate(RegressionPrior.cauchy(variance,true)); }
The evaluation program just fits a model and prints out the results, just as in the wallet example:
static void evaluate(RegressionPrior prior) { LogisticRegression regression = LogisticRegression.estimate(WalletProblem.INPUTS, WalletProblem.OUTPUTS, prior, AnnealingSchedule.inverse(.05,100), null, 0.0000001, 10, 5000); Vector[] betas = regression.weightVectors(); ...
Feature Extractors and Text Classification
As implemented in the LingPipe stats
package,
logistic regression operates over input vectors, integer
outcomes, and arrays of conditional probabilities. This
is the basic material required to implement
a classifier that produces conditional probability classifications.
The Logistic Regression Classifier
Several classes are implicated in adapting the stats package logistic regression models to implementations of classifiers. First, a feature extractor is used to convert arbitrary objects into mappings from string-based features to values. Second, a symbol table converts these features into dimensions. Together, a feature extractor and symbol table support the conversion of arbitrray objects into vectors. Finally, another symbol table is used to convert the string-based category representations in the classification package into the integer representation required by the statistics package.
The class
classify.LogisticRegressionClassifier
handles all the details of this adaptation, as shown
in the code examples below.
Running the Demo
There's a simple demo implementation of natural language classification based on the 4-newsgroup data distributed with LingPipe and discussed in the Topic Classification Tutorial. The data is the bodies of messages to four easily confusible news groups:
soc.religion.christian
talk.religion.misc
alt.atheism
misc.forsale
The demo is run with the ant target nl-topics
:
> ant nl-topics Reading data. Num instances=250. Permuting corpus. EVALUATING FOLDS Logistic Regression Progress Report Number of dimensions=1462 Number of Outcomes=4 Number of Parameters=4386 Prior: LaplaceRegressionPrior(Variance=0.5, noninformativeIntercept=true) Annealing Schedule=Exponential(initialLearningRate=0.0020, base=0.9975) Minimum Epochs=100 Maximum Epochs=1000 Minimum Improvement Per Period=1.0E-7 Has Sparse Inputs=true Has Informative Prior=true ...
The first part of the output reports back on some of the praameters set in the code. For instance, there are 4386 unique feature dimensions, the prior is a Laplace prior with variance 0.5 and an noninformative intercept on the intercept, the annealing schedule is exponential, and so on.
and then provides feedback on the epoch-by-epoch progress of the stochastic gradient descent algorithm used for estimation.
... epoch= 0 lr=0.002000000 ll= -392.4239 lp= -34.4435 llp= -426.8675 llp*= -426.8675 :00 epoch= 1 lr=0.001995000 ll= -342.0040 lp= -43.1246 llp= -385.1286 llp*= -385.1286 :00 epoch= 2 lr=0.001990013 ll= -294.9343 lp= -43.7030 llp= -338.6373 llp*= -338.6373 :00 epoch= 3 lr=0.001985037 ll= -249.8116 lp= -44.6577 llp= -294.4693 llp*= -294.4693 :00 ... epoch= 997 lr=0.000164891 ll= -53.4495 lp= -54.3246 llp= -107.7740 llp*= -107.7740 :15 epoch= 998 lr=0.000164478 ll= -53.4494 lp= -54.3239 llp= -107.7732 llp*= -107.7732 :15 epoch= 999 lr=0.000164067 ll= -53.4492 lp= -54.3232 llp= -107.7725 llp*= -107.7725 :15 ...
In each epoch, the algorithm visits every training instance and
adjusts each coefficient based on the current model and the trianing
instance. The reports indicate the epoch number, the learning rate
for that epoch (lr
), the log likelihood of the data in
the model (ll
), the log likelihood of the current set of
coefficients (lp
), the sum of the two log likelihoods
(llp
) [note that this is just negative error], the best
sum so far (llp*
), and finally, the time, down to the
second. In this case, estimation took 15 seconds resulting in a log
likelihood of -53.4 and log prior -107.8.
After the feedback on estimation, the demo program prints out
the features by name and their coefficient weights. In this instance, features
are alphabetic or numeric tokens (or the intercept). Here are
the top positive and negative coefficients for each category, as well as
some zero coefficients from the first category, alt.atheism
:
CLASSIFIER & FEATURES NUMBER OF CATEGORIES=4 NUMBER OF FEATURES=1462 CATEGORY=alt.atheism Jim 0.542459 some 0.519327 atheists 0.454521 on 0.346886 they 0.264611 model 0.259512 at 0.233116 is 0.225087 article 0.154355 ICO 0.153443 TEK 0.153443 vice 0.153192 of 0.137563 The 0.136769 mcl 0.134621 timmbake 0.134621 ... approach -0.000000 causes -0.000000 equally -0.000000 happen -0.000000 ignoring -0.000000 immediately -0.000000 later -0.000000 provided -0.000000 regard -0.000000 separate -0.000000 small -0.000000 sounds -0.000000 stand -0.000000 week -0.000000 willing -0.000000 women -0.000000 America -0.000000 cultural -0.000000 disciples -0.000000 speaking -0.000000 implied -0.000000 debate -0.000000 ... 7 -0.179450 that -0.237454 2 -0.239784 do -0.269343 Mormon -0.291933 very -0.293706 Christian -0.339846 ca -0.374287 in -0.425005 *&^INTERCEPT%$^&** -0.947079
Here the name "Jim" is the most positively indicative of
the alt.atheism
topic, and the intercept the most
negative feature. For some reason this includes the word in
and very
and do
, which seem unlikely features
to discriminate atheism from other religious topics. This is a problem
with unigram (single token) features -- they are often perplexing.
Features like "Christian" may show up because the alt.atheism
board may refer less to Christians (or Mormons) as a class.
Compare the alt.atheism
topic with the misc.forsale
topic, where words like "PC" or "sale" are strongly
positive and again words like "Mormon" are negative, with some
perplexing entries like "that".
CATEGORY=misc.forsale for 0.888025 PC 0.747638 drive 0.700343 or 0.510954 2 0.377641 edu 0.282761 300 0.253986 on 0.251271 would 0.194138 sale 0.147936 00 0.103214 *&^INTERCEPT%$^&** 0.083822 ... ca -0.000098 Book -0.058855 the -0.084477 In -0.108700 of -0.120941 Mormon -0.246687 to -0.461844 that -0.726345 Re -0.883255
Finally, here are the top positive and negative features
for soc.religion.christian
:
CATEGORY=soc.religion.christian rutgers 1.068456 life 0.763778 has 0.700812 May 0.676530 Mary 0.565283 athos 0.530967 who 0.354184 doctrine 0.263468 Orthodox 0.242234 Trinity 0.193247 s 0.178808 verses 0.142781 ... NNTP -0.026967 we -0.046827 edu -0.047974 A -0.055054 Mormon -0.056953 The -0.060776 Robert -0.077967 the -0.079859 were -0.108695 ca -0.120587 it -0.156520 you -0.157730 Organization -0.164869 a -0.219583 Christian -0.294424 Distribution -0.389873 *&^INTERCEPT%$^&** -0.437988 Posting -0.470168 Host -0.486523
Oddly, this category has junk from the mail headers and signatures and what not, like "rutgers", "NNTP" and "Posting".
Variance of Coefficient Estimates
To get some feeling for the variability of the feature estimates, here are the top positive and negative features for the second fold of a four-way cross-validation of which the above reports the first fold:
CATEGORY=misc.forsale for 1.215978 drive 0.748811 *&^INTERCEPT%$^&** 0.629428 PC 0.555030 on 0.525066 sale 0.515190 2 0.479129 Host 0.266548 Posting 0.266545 ... COM -0.000069 been -0.000102 Sun -0.000180 are -0.034722 in -0.070332 the -0.114678 of -0.272626 In -0.276476 to -0.372242 Re -0.459132 that -0.806237
Code Walkthrough
The code for generating this demo is in src/TextClassificationDemo.java
.
First, it builds up a corpus instance just as in the cross-validation
demo in the topic classification tutorial:
public static void main(String[] args) throws Exception { ... PrintWriter progressWriter = new PrintWriter(System.out,true); int numFolds = 4; XValidatingObjectCorpus<Classified<CharSequence>> corpus = new XValidatingObjectCorpus<Classified<CharSequence>>(numFolds); ... corpus.permuteCorpus(new Random(7117)); // destroys runs of categories TokenizerFactory tokenizerFactory = new RegExTokenizerFactory("\\p{L}+|\\d+"); // letter+ | digit+ FeatureExtractor<CharSequence> featureExtractor = new TokenFeatureExtractor(tokenizerFactory); int minFeatureCount = 5; boolean addInterceptFeature = true; boolean noninformativeIntercept = true; double priorVariance = 0.5; RegressionPrior prior = RegressionPrior.laplace(priorVariance,noninformativeIntercept); AnnealingSchedule annealingSchedule = AnnealingSchedule.exponential(0.002,0.9975); double minImprovement = 0.0000001; int minEpochs = 100; int maxEpochs = 1000; for (int fold = 0; fold < numFolds; ++fold) { corpus.setFold(fold); Reporter reporter = Reporters.writer(progressWriter); LogisticRegressionClassifier<CharSequence> classifier = LogisticRegressionClassifier.<CharSequence>train(corpus, featureExtractor, minFeatureCount, addInterceptFeature, prior, annealingSchedule, minImprovement, minEpochs, maxEpochs, reporter); ...
The training method for the logistic regression classifier is almost
identical to the static method for estimating logistic regression models
in the stats
package. The main difference is that the
classifier requires an instance of util.FeatureExtractor
. The feature extractor interface defines a single method:
public interface FeatureExtractor<E> { public Map<String,? extends Number> features(E in); }
In the code above, we use a pre-built adapter that converts a
tokenizer factory into a feature extractor. The
tokenizer.TokenFeatureExtractor
is constructed with a
tokenizer factory, which it then uses to tokenize character sequences
it receives. The resulting mapping is simply the count of the tokens in the
input.
The features are printed out in order by the classifier itself:
... progressWriter.println("\nCLASSIFIER & FEATURES\n"); progressWriter.println(classifier); ...
The evaluation is done in the usual way, by having the corpus walk the evaluator over the test section:
... progressWriter.println("\nEVALUATION\n"); boolean storeInputs = false; ConditionalClassifierEvaluator<CharSequence> evaluator = new ConditionalClassifierEvaluator<CharSequence>(classifier,CATEGORIES,storeInputs); corpus.visitTest(evaluator); progressWriter.printf("FOLD=%5d ACC=%4.2f +/-%4.2f\n", fold, evaluator.confusionMatrix().totalAccuracy(), evaluator.confusionMatrix().confidence95()); }
Discrete Choice Analysis
Discrete choice analysis models one or more agent's selections when presented with one or more alternative options. The alternatives are represented as vectors and the decision is probabilistically normalized using muli-logit, just as for logistic regression. The vectors representing the alternatives may include information about the agent making the choice; their content is really up to the application.
DCA models are easy to construct directly or to estimate from a set of training data. Each training instance consists of one or more alternatives represented as vectors along with an indication of which choice was made.
Feature-Extraction for DCA
As for logistic regression, it is possible to represent choices among alternatives that are arbitrary objects. A user-supplied feature extractor is used to map the alternative objects to vectors.
DCA Examples
DCA Transportation Example
A common example is transportation, where choices might include walking, bicycling, taknig a bus, taking a train, driving, etc. Each choice is represented a a set of features, such as time, cost, safety, and so on. The agent making the choice may be represented by further features such as disposable income and impatience. A model then considers how to weight these features to represent choices.
DCA Query-Specific Document Ranking Example
Another application is to rank documents with respect to a query. If each document and query pair is represented by a feature vector, the choice probabilities may be used for ranking.
DCA Coreference Example
Coreference is essentially a clustering task on entity mentions in text. For instance, in Babe Ruth's Wikipedia entry (as downloaded on 12 March 2010 at 2:59 PM EST) the Yankee slugger is referred to witht he names "Base Ruth", "Ruth", and "the Babe", the pronoun "he" and the common noun phrase "the first true American sports celebrity superstar". On the othe hand, in the same article, "Baby Ruth" refers to a candy bar, "The Babe Ruth Story" t a film, and "Babe Ruth Day" to a recurrning event.
As a choice problem, coreference is often approached by working through a document in text order, making a linkage decision for each referential noun phrase. The noun phrase may be linked to any of the previously introduced entities or it may be the first mention of an entity. Obviously the first mention is a new entity. The second mention may either refer to the same entity as the first mention, or may introduce a new entity.
These options may be represented as alternative choices, with features such as gender, distance, number of intervening noun phrases, similarity of phrasing such as token overlap, edit distance, etc., syntactic structure, discourse parallelism, semantic knowledge of common nouns, and so on.
Cross-document coreference is often approached the same way, as a sequential choice problem. In this case, you'd want to use different features than for within-document coreference. You can use information such as string overlap, as well as meta information such document source, date, author, language, etc.
Mathematical Definition
A K
-dimensional discrete choice model is parameterized
by a single K
-dimensional real-valued vector
β
. A set of N
alternatives is
represented by N
K
-dimensional vectors
α[1],...,α[N]
, one for each alternative.
DCA is normalized in the same way as logistic regression, with
the multi-logit transform. The multilogit takes a vector of
arbitrary real numbers and converts them into probabilities
that sum to 1.0. This is done by simply exponentiating and
normalizing. Thus the probability of choosing alternative n
is
proportional to exp(β'α[n])
(where x'
is
the transpose of an implicit column vector x
). In
symbols,
p(n|α,β) ∝ exp(β'α[n]).
where normalization is computed by the usual summation,
p(n|α,β) = exp(β'α[n]) / Σ_{m in 1:N} exp(β'α[m]).
Stochastic Gradient MAP Estimator for DCA
We use the same set of tools to fit DCA models as logistic regression models. Specifically, fitting is carried out via stochastic gradient descent using a specified learning rate annealing schedule. Priors may be placed on the coefficients, resulting in maximum a posteriori (MAP) estimates. and priors on coefficients. Convergence may be monitored to within a certain tolerance or the number of training epochs may be specified directly. A reporter may be configured and passed into the fitter to monitor the fit on an epoch-by-epoch basis.
DCA Simulation and API Demo
A good strategy for testing an estimator is to simulate a data set according to the model and see if the estimator finds the model parameters.
Here, we take a very simple model with β = (3,-2,1)
.
For instance, this might be a very simple model for proper name coreference
where dimension 1 represents string similarity, dimension 2 represents distance,
and dimension 3 represents discourse parallelism. Thus better choices have
higher string similarity, lower distance, and higher parallelism.
Running the Demo
The demo may be run directly from ant:
> cd $LINGPIPE/demos/tutorial/logistic-regression > ant dca
The first output provided dumps the training data, consisting of 1000 samples randomly generated according to the model (we'll see the code shortly):
DCA Demo Sample 0 random choice prob=0.011066323782317866 0 p=0.0010942559149001196 xs= 0=0.5537208017939629 1=0.26016864313905447 2=-1.7522309679972126 * 1 p=0.9912422418199297 xs= 0=-0.20898887445894693 1=-1.8909983320493842 2=3.0424484541278125 2 p=6.61481706344221E-7 xs= 0=-2.3236185850552404 1=-0.40191224523950836 2=-1.8554774163187067 3 p=0.007342638209775382 xs= 0=-0.8585314472003086 1=-2.6506128889054077 2=-1.4334136804432425 4 p=2.5835112077154026E-4 xs= 0=-1.4926364565738457 1=-0.9874988045707495 2=0.44799562310696617 5 p=6.185145291665587E-5 xs= 0=-1.3091994188446996 1=-1.2314203046062415 2=-2.0197424935306385 Sample 1 random choice prob=0.5716203055299767 0 p=0.4774808501121091 xs= 0=-0.34658066612216726 1=-3.7171702891729055 2=2.847613891265687 1 p=5.3291162894215095E-6 xs= 0=-2.727452048150046 1=-3.929332197507756 2=-1.8371896878683784 2 p=5.631246738398885E-9 xs= 0=-5.097774786769612 1=-3.261921315683861 2=-0.24400954923978324 * 3 p=0.5194675028142764 xs= 0=2.578318143603154 1=-0.5382777113119868 2=0.5149828170181778 4 p=0.003046289030927008 xs= 0=-0.6398287520090654 1=-3.5369997185026127 2=-0.9669006257184916 5 p=2.294939187015502E-8 xs= 0=-1.0199809307787397 1=2.233346753932399 2=-0.08189440303456576 6 p=3.45760010680944E-10 xs= 0=-2.215343033840159 1=3.724642835294389 2=2.291482275418305 Sample 2 random choice prob=0.6571403489348165 0 p=0.0033735721977343185 xs= 0=-2.1173544097861843 1=-0.9525961733238733 2=-1.5757085479755961 1 p=0.0014417439480520734 xs= 0=-2.809800209893541 1=-1.313051420770636 2=-1.069400368972158 * 2 p=0.9951846838542135 xs= 0=-1.6884130661052648 1=-1.43110875271501 2=1.8673984117586233 ... Sample 998 random choice prob=0.7185102111403285 * 0 p=0.9999870403750512 xs= 0=4.94972268738353 1=-0.9018199885770234 2=1.0128061582237957 1 p=1.2385892859586818E-5 xs= 0=0.7649756068129245 1=-0.95647020022203 2=2.1588075307465373 2 p=3.807341515916061E-7 xs= 0=1.540848982493464 1=0.019785197111358486 2=-1.69851386624265 3 p=1.8148517752607114E-7 xs= 0=-0.1960515695662867 1=-1.8262256057851793 2=-0.9207611987335641 4 p=3.9743816403578186E-12 xs= 0=-2.7740831520265568 1=-0.42268913434325006 2=-1.108653597466911 5 p=1.150878487520586E-8 xs= 0=-0.4714941289898704 1=-2.440351927961609 2=-4.080749502726419 Sample 999 random choice prob=0.979861378724821 * 0 p=0.9938278009515112 xs= 0=2.1921160621760403 1=-1.8106535867532818 2=0.5622076284200712 1 p=8.531431715509726E-7 xs= 0=0.3150259443121377 1=0.30802222712621374 2=-3.537317523230222 2 p=1.9513289313863527E-7 xs= 0=0.568094267556212 1=3.425238707385209 2=0.46266391914347 3 p=0.006144525436311744 xs= 0=0.9501447084912402 1=-1.1087942598824052 2=0.6058379024682905 4 p=2.1205752876833113E-5 xs= 0=-2.6827568717213035 1=-2.9647540869667397 2=2.123578704514057 5 p=4.19065365895493E-7 xs= 0=-0.025821863917245638 1=1.5799428062033578 2=-0.681833406242208 6 p=5.000517302099065E-6 xs= 0=-0.653631008901344 1=1.243981268705386 2=3.008940687919986 7 p=5.67485413433939E-13 xs= 0=-2.5063853494611035 1=4.837516266561886 2=-0.23731845331128532 ...
The training samples have varying number of alternatives. For
instance, sample 0 and sample 998 have five altenratives, whereas
sample 2 only has three alternatives. Each alternative is represented
by a vector, following xs=
. For instance, the first
alternative in sample 0 is roughly (0.55,0.26,-1.75)
, and
the second alternative is represented by the vector
(0.21,-1.90,3.04)
. Each alternative for a sample is preceded by
its probability in the model computed according to the formula above.
Each training example simulates a choice by generating a random number between 0 and 1, shown after the sample number. For instance, sample 1 generated the random number that was roughly 0.572. This resulted in alternative 3 being chosen. But alternative 3 only had a 0.52 chance of being chosen; if the random number were lower than 0.47, alternative 0 would have been selected. These choices make up the training data.
The next piece of output reports the estimation parameters: an epoch-by-epoch basis:
... :00 estimate() :00 # training cases=1000 :00 regression prior=GaussianRegressionPrior(Variance=4.0, noninformativeIntercept=true) :00 annealing schedule=Exponential(initialLearningRate=0.1, base=0.99) :00 min improvement=1.0E-5 :00 min epochs=5 :00 max epochs=500
We have a fairly low prior variance and use an exponential decay learning rate
that decays fairly quickly. We have specified it to stop after relative convergence to
to within 0.0001
, or about four or five decimal places.
Then, like logistic regression, we get an epoch-by-epoch report of the gradient descent,
:00 epoch= 0 lr=0.100000000 ll= -229.2331 lp= -3.7495 llp= -232.9826 llp*= -232.9826 :00 epoch= 1 lr=0.099000000 ll= -230.3207 lp= -3.8286 llp= -234.1493 llp*= -232.9826 :00 epoch= 2 lr=0.098010000 ll= -230.6148 lp= -3.8482 llp= -234.4630 llp*= -232.9826 :00 epoch= 3 lr=0.097029900 ll= -230.5728 lp= -3.8521 llp= -234.4249 llp*= -232.9826 :00 epoch= 4 lr=0.096059601 ll= -230.4288 lp= -3.8516 llp= -234.2804 llp*= -232.9826 :00 epoch= 5 lr=0.095099005 ll= -230.2566 lp= -3.8498 llp= -234.1064 llp*= -232.9826 :00 epoch= 6 lr=0.094148015 ll= -230.0782 lp= -3.8476 llp= -233.9258 llp*= -232.9826 :00 epoch= 7 lr=0.093206535 ll= -229.9002 lp= -3.8453 llp= -233.7455 llp*= -232.9826 :00 epoch= 8 lr=0.092274469 ll= -229.7245 lp= -3.8430 llp= -233.5674 llp*= -232.9826 :00 epoch= 9 lr=0.091351725 ll= -229.5515 lp= -3.8407 llp= -233.3922 llp*= -232.9826 :00 epoch= 10 lr=0.090438208 ll= -229.3815 lp= -3.8385 llp= -233.2200 llp*= -232.9826 :00 epoch= 11 lr=0.089533825 ll= -229.2144 lp= -3.8362 llp= -233.0506 llp*= -232.9826 :00 epoch= 12 lr=0.088638487 ll= -229.0501 lp= -3.8340 llp= -232.8841 llp*= -232.8841 :00 epoch= 13 lr=0.087752102 ll= -228.8885 lp= -3.8319 llp= -232.7204 llp*= -232.7204 :00 epoch= 14 lr=0.086874581 ll= -228.7297 lp= -3.8297 llp= -232.5594 llp*= -232.5594 ... :03 epoch= 252 lr=0.007944545 ll= -217.6262 lp= -3.6695 llp= -221.2957 llp*= -221.2957 :03 epoch= 253 lr=0.007865100 ll= -217.6224 lp= -3.6693 llp= -221.2917 llp*= -221.2917 :03 epoch= 254 lr=0.007786449 ll= -217.6186 lp= -3.6692 llp= -221.2878 llp*= -221.2878 :03 epoch= 255 lr=0.007708584 ll= -217.6149 lp= -3.6691 llp= -221.2840 llp*= -221.2840 :03 epoch= 256 lr=0.007631498 ll= -217.6113 lp= -3.6689 llp= -221.2802 llp*= -221.2802 :03 epoch= 257 lr=0.007555183 ll= -217.6077 lp= -3.6688 llp= -221.2765 llp*= -221.2765 :03 epoch= 258 lr=0.007479632 ll= -217.6042 lp= -3.6687 llp= -221.2729 llp*= -221.2729 :03 Converged with rollingAverageRelativeDiff=9.894823299509656E-6 ...
We see that the estimate covnerged to within our specified bound after 259 epochs (we count from 0). The columns are the same as for logistic regression. First, the time (here only 3 seconds total including all the output), the learning rate for the epoch (note the steep decline frome poch 0 to epoch 258), the log likelihood of the training data given the model parameters, the log prio value (note the likelihood dominates the prior here with only 3 parameters), and a sum of log likelihood and prior followed by the best found log likelihood and prior.
Finally, we see how well the estmiator recovered the model parameters:
ACTUAL coeffs= 0=3.0 1=-2.0 2=1.0 FIT coeffs= 0=3.017291072658265 1=-1.9885202732810667 2=1.0845623307943484
That's very clsoe given 1000 training examples, so we can see that the API is doing the right thing.
DCA Code Walkthrough
The DCA code is in src/Dca.java
.
Simulating the Training Data
Our first job is to simulate those training examples. We actually do this by first constructig a discrete chooser with our simulated coefficient vector:
public static void main(String[] args) { double[] simCoeffs = new double[] { 3.0, -2.0, 1.0 }; Vector simCoeffVector = new DenseVector(simCoeffs); DiscreteChooser simChooser = new DiscreteChooser(simCoeffVector); ...
Next, we generate the alternatives using a fixed random number generator (you can change it and other parameters to get different training data). First, we create a two-dimensioanl array of vectors to represent the alternative choices and a parallel one-dimensional array of integers to represent the outcomes.
... int numDims = simCoeffs.length; int numSamples = 1000; Random random = new Random(42); Vector[][] alternativess = new Vector[numSamples][]; int[] choices = new int[numSamples]; ...Next, we iterate over the samples, first generating a random number of choices between 1 and 8 and creating a vector to hold them in the alternatives vector:
... for (int i = 0; i < numSamples; ++i) { int numChoices = 1 + random.nextInt(8); alternativess[i] = new Vector[numChoices]; ...
Next, we generate a value for each dimension of each alternative by sampling from a Gaussian distribution with mean of 0 and standard deviation of 2. Finally, we set the altenative to a dense vector with those coefficients:
... for (int k = 0; k < numChoices; ++k) { double[] xs = new double[numDims]; for (int d = 0; d < numDims; ++d) { xs[d] = 2.0 * random.nextGaussian(); } alternativess[i][k] = new DenseVector(xs); } ...
Next, we extract the choice probabilities from the simulated
discrete choice model, simChooser
. We then sample an
outcome in the usual way by generating a random number between
0 and 1 (choiceProb
). We take the first outcome where
the cumulative probability is above that value.
double[] choiceProbs = simChooser.choiceProbs(alternativess[i]); double choiceProb = random.nextDouble(); double cumProb = 0.0; for (int k = 0; k < numChoices; ++k) { cumProb += choiceProbs[k]; if (choiceProb < cumProb || k == (numChoices - 1)) { choices[i] = k; break; } } System.out.println("\nSample " + i + " random choice prob=" + choiceProb); for (int k = 0; k < numChoices; ++k) { System.out.println((choices[i] == k ? "* " : " ") + k + " p=" + choiceProbs[k] + " xs=" + alternativess[i][k]); } }
This time we've left in the print in order to illustrate the working of the choices array, the choice probabilities, and the alternatives vector.
Estimating a DCA Model from Data
The rest of the demo consists of creating the estimator parameters and then running it over the training data. The parameters are set up just as in logistic regression, with:
... double priorVariance = 4.0; boolean nonInformativeIntercept = true; RegressionPrior prior = RegressionPrior.gaussian(priorVariance,nonInformativeIntercept); int priorBlockSize = 100; double initialLearningRate = 0.1; double decayBase = 0.99; AnnealingSchedule annealingSchedule = AnnealingSchedule.exponential(initialLearningRate,decayBase); double minImprovement = 0.00001; int minEpochs = 5; int maxEpochs = 500; Reporter reporter = Reporters.stdOut().setLevel(LogLevel.DEBUG); ...
Then we call the static estimate method on the dca.DiscreteChooser
class:
... DiscreteChooser chooser = DiscreteChooser.estimate(alternativess, choices, prior, priorBlockSize, annealingSchedule, minImprovement, minEpochs, maxEpochs, reporter); ...
The estimated model is then assigned to the variable chooser
.
Its parameters are printed along with the simulated value with:
... Vector coeffVector = chooser.coefficients(); System.out.println("\nACTUAL coeffs=" + simCoeffVector); System.out.println("FIT coeffs=" + coeffVector); }
Serialization
A discrete chooser may be serialized.
Training with Objects and Feature Extractors
Object based training data replaces the alternative vectors in the two-dimensional
array alternativess
with a two-dimensional array of objects and
a feature extractor. A symbol table is used to convert the training data into
vectors, and then the basic DCA estimator is called, again, just like for
logistic regression.