public class SvdMatrix extends AbstractMatrix
SvdMatrix
provides a means of storing a matrix that
has been factored via a singularvalue decomposition (SVD). This
class also provides a static method for computing regularized
singularvalue decompositions of partial matrices.
Singular value decomposition (SVD) factors an
m×n
matrix A
into a product of
three matrices:
whereA = U * S * V^{T}
U
is an m×k
matrix,
V
is an n×k
matrix, and
S
is a k×k
matrix, where
k
is the rank of the matrix A
. The
multiplication (*
) is matrix multiplication and
the superscripted T
indicates matrix transposition.
The m
dimensional vectors making up the columns of
U
are called left singular vectors, whereas
the n
dimesnional vectors making up the rows of
V
are called right singular vectors. The values on
the diagonal of S
are called the singular values.
The combination of the q
th left singular vector,
right singular vector, and singular value is known as a factor.
The singular value matrix S
is a diagonal matrix
with positive, strictly nonincreasing values, so that
S[i][i] >= S[i+1][i+1]
, for i < k
.
The set of left and set of right singular vectors are orthonormal.
Normality means that each singular vector is of unit length (length
1
). Orthogonality means that any pair of left singular
vectors is orthogonal and any pair of right singular vectors are
orthogonal (meaning their dot product is 0
).
Matrices have unique singularvalue decompositions up to the reordering of columns with equal singular values and up to cancelling sign changes in the singular vectors.
SvdMatrix
may be constructed out of the singular
vectors and singular values, or out of the vectors with singular
values scaled in.
Given that S
is diagonal, the value of a particular
entry in A
works out to:
To save time in the application and space in the class, we factorA[i][j] = Σ_{k} U[i][k] * S[k][k] * V[j][k]
S
into U
and V
to produce a new pair of matrices U'
and V'
defined by:
with the squareroot performed componentwise:U' = U * sqrt(S) V'^{T} = sqrt(S) * V^{T}
By the associativity of matrix multiplication, we have:sqrt(S)[i][j] = sqrt(S[i][j])
U * S * V^{T} = U * sqrt(S) * sqrt(S) * V = U' * V'^{T}
Thus the class implementation is able to store U'
and V'
, thus reducing the amount computation involved
in returning a value (using column vectors as the default vector
orientation):
A[i][j] = U'[i]^{T} * V'[j]
A
and B
are
m×n
matrices. The square error between them is
defined by the socalled Frobenius norm, which extends the standard
vector L_{2} or Euclidean norm to matrices:
The square error is sometimes referred to as the residual sum of squares (RSS), becausesquareError(A,B) = frobeniusNorm(AB) = Σ_{i < m} Σ_{j < n} (A[i][j]  B[i][j])^{2}
A[i][j]  B[i][j]
is the residual (difference between actual and predicted value).
A
of dimension
m×n
into the product of two matrices
X * Y^{T}
, where X
is of dimension
m×k
and Y
is of dimension
n×k
. We may then measure the square
error squareError(A,X * Y)
to determine how well
the factorization matches A
. We know that if
we take U', V'
from the singular value
decomposition that:
HeresquareError(A, U' * V'^{T}) = 0.0
U'
and V'
have a number of columns
(called factors) equal to the rank of the matrix A
.
The singular value decomposition is such that the first
k
columns of U'
and V'
provide the best order q
approximation of
A
of any X
and Y
of dimensions m×q
and n×q
In symbols:
whereU'_{q}, V'_{q} = argmin _{X is m×q, Y is n×q} squareError(A, X * Y^{T})
U'_{q}
is the restriction of U'
to its first q
columns.
Often errors are reported as means, where the mean square error (MSE) is defined by:
meanSquareError(A,B) = squareError(A,B)/(m×n)
To adjust to a linear scale, the square root of mean square error (RMSE) is often used:
rootMeanSquareError(A,B) = sqrt(meanSquareError(A,B))
U' * V'^{T}
.
Typically, the approximation is of lower order than the rank of
the matrix.
Shrinkage is a general technique in least squares fitting that adds a penalty term proportional to the square of the size of the parameters. Thus the square error objective function is replaced with a regularized version:
where the parameter costs for a vectorregularizedError(A, U' * V') = squareError(A, U' * V') + parameterCost(U') + parameterCost(V')
X
of
dimensionality q
is the sum of the squared
parameters:
Note that the hyperparameterparameterCost(X) = λ * Σ_{i < q} X[i]^{2} = λ * length(X)^{2}
λ
scales the
parameter cost term in the overall error.
Setting the regularization parameter to zero sets the parameter costs to zero, resulting in an unregularized SVD computation.
In Bayesian terms, regularization is equivalent to a normal
prior on parameter values centered on zero with variance controlled
by λ
. The resulting minimizer of regularized
error is the maximum a posteriori (MAP) solution. The Bayesian
approach also allows covariances to be estimated (including simple
parameter variance estimates), but these are not implemented in
this class.
O(m^{3} +
n^{3})
for an m×n
matrix (equal
to O(max(m,n)^{3})
. Arithmetic precision is
especially vexing at singular values near zero and with highly
correlated rows or columns in the input matrix.
For large partial matrices, we use a form of stochastic gradient descent which computes a single row and column singular vector (a single factor, that is) at a time. Each factor is estimated by iteratively visiting the data points and adjusting the unnormalized singular vectors making up the current factor. Each adjustment is a leastsquares fitting step, where we simultaneously adjust the left singular vectors given the right singular vectors and viceversa.
The leastsquares adjustments take the following form. For each
value, we compute the current error (using the order k
approximation and the current order k+1
values) and
then move the vectors to reduce error. We use U'
and
V'
for the incremental values of the singular vectors
scaled by the singular values:
wherefor (k = 0; k < maxOrder; ++k) for (i < m) U'[i][k] = random.nextGaussian()*initValue; for (j < n) V'[j][k] = random.nextGaussian()*initValue; for (epoch = 0; epoch < maxEpochs && not converged; ++epoch) for (i,j such that M[i][j] defined) error = M[i][j]  U'_{k}[i] * V'_{k}[j] // * is vector dot product uTemp = U'[i][k] vTemp = V'[j][k] U'[i][k] += learningRate[epoch] * ( error * vTemp  regularization * uTemp ) V'[j][k] += learningRate[epoch] * ( error * uTemp  regularization * vTemp )
initValue
, maxEpochs
,
learningRate
(see below), and
regularization
are algorithm hyperparameters. Note
that the initial values of the singular vectors are set randomly to
the result of a draw from a Gaussian (normal) distribution with
mean 0.0 and variance 1.0.
Because we use the entire set of factors in the error computation, the current factor is guaranteed to have singular vectors orthogonal to the singular vectors already computed.
Note that in the actual implementation, the contribution to the
error of the first k1
factors is cached to reduce
time in the inner loop. This requires a doublelength floating
point value for each defined entry in the matrix.
Like most linear learners, this algorithm merely moves the
parameter vectors U'[i]
and U'[j]
in the
direction of the gradient of the error. The gradient is the
multivariate derivative of the objective function being minimized.
Because our object is squared error, the gradient is just its
derivative, which is just (two times) the (linear) error itself. We
roll the factor of 2 into the learning rate to derive the update in
the algorithm pseudocode above.
The term (error * vTemp)
is the component of the
error gradient due to the current value of V'[i][k]
and the term (regularization * uTemp)
is the component of the
gradient to the size of the parameter U'[i][k]
. The
updates thus move the parameter vectors in the direction of
the gradient.
The convergence conditions for a given factor require either hitting the maximum number of allowable epochs, or finding the improvement from one epoch to the next is below some relative threshold:
When the relative difference in square error is less than a hyperparameter thresholdregError^{(epoch)} = regError(M,U'_{k}^{(epoch)} * V'_{k}^{(epoch)}^{T}) relativeImprovement^{(epoch+1)} = relativeDiff(regError^{(epoch+1)}, regError^{(epoch)}) relativeDiff(x,y) = abs(xy)/(abs(x) + abs(y))
minImprovement
, the
epoch terminates and the algorithm moves on to the next
factor k+1
.
Note that a complete matrix is a degenerate kind of partial matrix. The gradient descent computation still works in this situation, but is not as efficient or as accurate as an algebraic SVD solver for small matrices.
learningRate[epoch]
, is
lowered as the number of epochs increase. This lowering of the
learning rate has a thermodynamic interpretation in terms of free
energy, hence the name "annealing". Larger moves are
made in earlier epochs, then the temperature is gradually lowered
so that the learner may settle into a stable fixed point. The
function learningRate[epoch]
is called the annealing
schedule.
There are theoretical requirements on the annealing schedule that guarantee convergence (up to arithmetic precision and no upper bound on the number of epochs):
The schedule we use is the one commonly chosen to meet the above requirements:Σ_{epoch} learningRate[epoch] = infinity Σ_{epoch} learningRate[epoch]^{2} < infinity
wherelearningRate[epoch] = initialLearningRate / (1 + epoch/annealingRate)
initialLearningRate
is an initial learning rate and
annealingFactor
determines how quickly it shrinks.
The learning rate moves from its initial size
(initialLearningRate
) to one half (1/2
) of its
original size after annealingRate
epochs, and
moves from its initial size to one tenth (1/10
) of
its initial size after 9 * annealingRate
epochs,
and one hundredth of its initial size after
99 * annealingRate
epochs..
The previous discussion has introduced a daunting list of parameters required for gradient descent for singular value decomposition. Unfortunately, the stochastic gradient descent solver requires a fair amount of tuning to recover a low mean square error factorization. The optimal settings will also depend on the input matrix; for example, very sparse partial matrices are much easier to fit than dense matrices.
Determining the order of the decomposition is mostly a matter of determining how many orders are needed for the amount of smoothing required. Low order reconstructions are useful for most applications. One way to determine maximum order is using held out data for an application. Another is to look for a point where the singular values become (relatively) insignificant.
For low mean square error factorizations, many epochs may be necessary. Lowering the maximum number of epochs leads to what is known as early stopping. Sometimes, early stopping leads to more accurate heldout predictions, but it will always raise the factorization error (training data predictions). In general, the maximum number of epochs needs to be set empirically. Initially, try fewer epochs and gradually raise the number of epochs until the desired accuracy is achieved.
Initial values of the singular vectors are not particularly sensitive, because we are using multiplicative updates. A good rule of thumb for starting values is the the inverse square root of the maximum order:
initVal ~ 1 / sqrt(maxOrder)
A good starting point for the learning rate is 0.01. The annealing parameter should be set so that the learning rate is cut by at least a factor of 10 in the final rounds. This calls for a value that's roughly one tenth of the maximum number of epochs. If the initial learning rate is set higher, then the annealing schedule should be more agressive so that it spends a fair amount of time in the 0.001 to 0.0001 learning rate range.
There are thorough Wikipedia articles on singular value decomposition and gradient descent, although the SVD entry focuses entirely on complete (nonpartial) matrices:
Both of the standard machine learning textbooks have good theoretical explanations of least squares, regularization, gradient descent, and singular value decomposition, but not all together:The following is a particularly clear explanation of many of the issues involved in the context of neural networks:
Our partial SVD solver is based on C code from Timely Development (see license below). Timely based their code on Simon Funk's blog entry. Simon Funk's approach was itself based on his and Genevieve Gorrell's 2005 EACL paper.
The singular value decomposition code is rather loosely based on a C program developed by Timely Development. Here is the copyright notice for the original code:
// SVD Sample Code // // Copyright (C) 2007 Timely Development (www.timelydevelopment.com) // // Special thanks to Simon Funk and others from the Netflix Prize contest // for providing pseudocode and tuning hints. // // Feel free to use this code as you wish as long as you include // these notices and attribution. // // Also, if you have alternative types of algorithms for accomplishing // the same goal and would like to contribute, please share them as well :) // // STANDARD DISCLAIMER: // //  THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY //  OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT //  LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR //  FITNESS FOR A PARTICULAR PURPOSE.
Constructor and Description 

SvdMatrix(double[][] rowSingularVectors,
double[][] columnSingularVectors,
double[] singularValues)
Construct an SVD matrix using the specified singular vectors
and singular values.

SvdMatrix(double[][] rowVectors,
double[][] columnVectors,
int order)
Construct a factored matrix using the specified row and column
vectors at the specified order.

Modifier and Type  Method and Description 

double[][] 
leftSingularVectors()
Returns a matrix in which the left singular vectors make up the
columns.

int 
numColumns()
Returns the number of columns in this matrix.

int 
numRows()
Returns the number of rows in this matrix.

int 
order()
Returns the order of this factorization.

static SvdMatrix 
partialSvd(int[][] columnIds,
double[][] values,
int maxOrder,
double featureInit,
double initialLearningRate,
double annealingRate,
double regularization,
Reporter reporter,
double minImprovement,
int minEpochs,
int maxEpochs)
Return the singular value decomposition of the specified
partial matrix, using the specified search parameters.

double[][] 
rightSingularVectors()
Returns a matrix in which the right singular vectors make up
the columns.

double 
singularValue(int order)
Returns the singular value for the specified order.

double[] 
singularValues()
Returns the array of singular values for this decomposition.

static SvdMatrix 
svd(double[][] values,
int maxOrder,
double featureInit,
double initialLearningRate,
double annealingRate,
double regularization,
Reporter reporter,
double minImprovement,
int minEpochs,
int maxEpochs)
Returns the signular value decomposition of the specified
complete matrix of values.

double 
value(int row,
int column)
Returns the value of this matrix at the specified row and column.

columnVector, equals, hashCode, rowVector, setValue
public SvdMatrix(double[][] rowVectors, double[][] columnVectors, int order)
See the class documentation above for more information on singular value decomposition.
rowVectors
 Row vectors, indexed by row.columnVectors
 Column vectors, indexed by column.order
 Dimensionality of the row and column vectors.public SvdMatrix(double[][] rowSingularVectors, double[][] columnSingularVectors, double[] singularValues)
See the class documentation above for more information on singular value decomposition.
rowSingularVectors
 Row singular vectors, indexed by row.columnSingularVectors
 Column singular vectors, indexed by column.singularValues
 Array of singular values, in decreasing order.public int numRows()
numRows
in interface Matrix
numRows
in class AbstractMatrix
public int numColumns()
numColumns
in interface Matrix
numColumns
in class AbstractMatrix
public int order()
public double value(int row, int column)
value
in interface Matrix
value
in class AbstractMatrix
row
 Row index.column
 Column index.public double[] singularValues()
public double singularValue(int order)
order
 Dimension of the singular value.public double[][] leftSingularVectors()
m×k
, if the
original input was an m×n
matrix and SVD was
performed at order k
.public double[][] rightSingularVectors()
n×k
, if
the original input was an m×n
matrix and SVD
was performed at order k
.public static SvdMatrix svd(double[][] values, int maxOrder, double featureInit, double initialLearningRate, double annealingRate, double regularization, Reporter reporter, double minImprovement, int minEpochs, int maxEpochs)
For a full description of the arguments and values, see
the method documentation for
partialSvd(int[][],double[][],int,double,double,double,double,Reporter,double,int,int)
and the class documentation above.
The twodimensional array of values must be an
m × n
matrix. In particular, each row
must be of the same length. If this is not the case, an illegal
argument exception will be raised.
This is now a utility method that calls svd(double[][],int,double,double,double,double,Reporter,double,int,int)
with a reporter wrapping the specified print writer at the
debug level, or a silent print writer.
values
 Array of values.maxOrder
 Maximum order of the decomposition.featureInit
 Initial value for singular vectors.initialLearningRate
 Incremental multiplier of error
determining how fast learning occurs.annealingRate
 Rate at which annealing occurs; higher values
provide more gradual annealing.regularization
 A regularization constant to damp learning.reporter
 Reporter to which to send progress and error
reports.minImprovement
 Minimum relative improvement in mean
square error required to finish an epoch.minEpochs
 Minimum number of epochs for training.maxEpochs
 Maximum number of epochs for training.
training, or null
if no output is desired.IllegalArgumentException
 Under conditions listed in the
method documentation above.public static SvdMatrix partialSvd(int[][] columnIds, double[][] values, int maxOrder, double featureInit, double initialLearningRate, double annealingRate, double regularization, Reporter reporter, double minImprovement, int minEpochs, int maxEpochs)
The writer parameter may be set to allow incremental progress reports to that writer during training. These report on RMSE per epoch.
See the class documentation above for a description of the algorithm.
There are a number of constraints on the input, any violation of which will raise an illegal argument exception. The conditions are:
columnIds
 Identifiers of column index for given row and entry.values
 Values at row and column index for given entry.maxOrder
 Maximum order of the decomposition.featureInit
 Initial value for singular vectors.initialLearningRate
 Incremental multiplier of error determining how
fast learning occurs.annealingRate
 Rate at which annealing occurs; higher values
provide more gradual annealing.regularization
 A regularization constant to damp learning.reporter
 Reporter to which progress reports are written, or
null
if no reporting is required.minImprovement
 Minimum relative improvement in mean square error required
to finish an epoch.minEpochs
 Minimum number of epochs for training.maxEpochs
 Maximum number of epochs for training.IllegalArgumentException
 Under conditions listed in the
method documentation above.