## The Problem

Computing the Singular Value Decomposition (SVD) is a fundamental linear algebra primitive, ubiquitous in machine learning, statistics, signal processing, and other fields. Formally, the SVD of a real m × n matrix A is a factorization of the form A = U Σ Vᵀ, where U is an m × m orthogonal matrix of left singular vectors, Σ is an m × n diagonal matrix of singular values, and Vᵀ is an n × n orthogonal matrix of right singular vectors.

The SVD provides a roadmap to computing a low-rank approximation for A– that is, finding a matrix Aᵣ with rank r < m, n such that ||Aᵣ – A|| is minimized. Given A = U Σ Vᵀ, we can approximate A by the following procedure:

- Sort the nonzero elements of Σ in decreasing order;
- Take the first r elements of Σ (say Σᵣ, an r × r diagonal matrix), the corresponding columns of U (Uᵣ, an m × r orthogonal matrix) and the corresponding columns of V (Vᵣ, an n × r orthogonal matrix);
- Take Aᵣ = Uᵣ Σᵣ Vᵣᵀ.

Visually, computing a low-rank matrix factorization corresponds to the following diagram:

We can show that Aᵣ satisfies various optimality properties (for example, in the spectral norm, the approximation error ||Aᵣ – A|| is the (r+1)-th diagonal element of Σ).

Thus, the SVD is a useful framework for dimensionality reduction tasks. Consider Principal Component Analysis (PCA), which, given a centered matrix of m observations and p covariates X as an m × p matrix, computes W as a p × l matrix and Λ a diagonal l × l matrix such that Xᵀ X ≈ W Λ Wᵀ. This maps the original observations from a p-dimensional space into an l-dimensional space, for l < p – thus performing *dimensionality reduction*. One algorithm for computing a reduced PCA is via a low-rank matrix approximation. Take X ≈ Uᵣ Σᵣ Vᵣᵀ, and then take Xᵀ X ≈ (Uᵣ Σᵣ Vᵣᵀ)ᵀ (Uᵣ Σᵣ Vᵣᵀ) = Vᵣ Σᵣ² Vᵣᵀ by the orthogonality of Uᵣ. Thus computing the SVD of A immediately yields the PCA of A.

These techniques have a huge number of applications in machine learning, data mining, bioinformatics, and even political science – from predicting hashtags from the content of a status by factorizing their co-occurrence matrix, to understanding political voting patterns by factorizing the matrix of legislators and bills.

Unfortunately, computing the SVD can be extremely time-consuming for the large-scale problems common at Facebook and elsewhere. Thus, we turn to randomized methods which offer significant speedups over classical methods.

## Randomized Matrix Approximation

There are various algorithms for computing the SVD – here, we’ll focus on some modern randomized matrix approximation techniques, developed in (amongst others) in Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, a 2009 paper by Nathan Halko, Per-Gunnar Martinsson and Joel A. Tropp.

The authors propose a general class of algorithms for various matrix approximation tasks. The algorithm proceeds in two major steps. The first is to use randomized techniques to compute an approximation to the range of A. That is, we seek to find Q with r orthonormal columns and A ≈ Q Qᵀ A. Assuming we have found such a Q, we can then compute an SVD of A as follows.

First, construct B = Qᵀ A. Now, as B is relatively small (recall Q has only a small number of columns), we can efficiently compute the SVD of B by standard methods, and thus B = S Σ Vᵀ for S, V orthogonal and Σ diagonal. Then as A ≈ Q Qᵀ A = Q (S Σ Vᵀ), we see that taking U = Q S, we have computed a low rank approximation A ≈ U Σ Vᵀ.

The trick in this algorithm comes from computing the approximate matrix Q efficiently. Intuitively, to estimate the range of a matrix A, we can just take a collection random vectors Ω₁, Ω₂, … and examine the subspace formed by the action of A on each of these random vectors. If we form an n × l Gaussian random matrix Ω, compute Y = A Ω, and take the QR decomposition of Y, Q R = Y, then Q is an m × l matrix whose columns are an orthonormal basis for the range of Y.

The paper proves strong bounds on the quality of our approximation by appealing to concentration of measure results for random matrices – see Terry Tao’s lecture notes on the topic for some useful background.

A few issues arise in theory and in practice when performing these algorithms with finite-precision arithmetic.

One method to improve the approximation is to perform a few power iterations on (A Aᵀ) to drive the spectrum of Y down. The key here is that if the singular values of A are Σ, the singular values of (A Aᵀ)ᵖ A are Σ^{2p+1}, so the spectrum (and our approximation error) decays exponentially with the number of iterations. In practice, one or two iterations is enough (whereas p=0 is generally not sufficient for noisy data).

To maintain numerical precision throughout these operations, we take intermediate QR and LU decompositions during the power iterations.

## Implementation and Considerations

We are releasing implementations of these algorithms in the public domain, based on the linear algebra primitives provided by NumPy and SciPy. The implementation is compatible with the framework provided by scikit-learn.

The implementation uses the system’s BLAS for the underlying linear algebra. Using Intel’s Math Kernel Library (MKL) for some key primitives (notably, computing the product of a sparse CSR matrix with a dense matrix) provided a substantial speedup on some problem instances.

## Benchmarks

We consider a few cases. We consider

- Large dense matrices, up to 100,000 × 200,000 in size,
- Large sparse matrices, up to 10⁷ × 10⁷ in size with O(10⁹) non-zero elements,
- A comparison to Apache Spark’s distributed SVD implementation.

Throughout, we compute a rank 10 approximation, and we run two iterations of the power method. For the dense case, we compare against `numpy.linalg.svd`

, and the sparse case, we compare against `scipy.sparse.linalg.svds`

.

### Timings

10⁶ × 10⁵ matrix with 10⁷ non-zeros, 1 second vs 100 seconds with ARPACK,

10⁶ × 10⁵ matrix with 10⁸ non-zeros, 5 seconds vs 63 minutes with ARPACK,

10⁵ × 10⁵ dense matrix in 120 seconds,

9.4 * 10⁷ × 4 * 10³ matrix with 1.6 * 10⁹ non-zeros, 60 seconds on single machine server vs 50 seconds on a 68 machine cluster with Spark.

## Open Source and Future Directions

We will soon release the implementations for these algorithms described. We are additionally exploring distributed variants of these algorithms, which should allow efficient factorizations of truly huge matrices – for example, the adjacency matrix of Facebook users to Facebook pages induced by likes, with size O(10⁹) × O(10⁸).