5 The singular value decomposition
In this section we will introduce a new matrix decomposition called the Singular Value Decomposition (SVD). The SVD is generally applicable to all matrices of any dimension or rank, and has many applications in machine learning and beyond. We will begin by defining the decomposition and then discuss some of it uses.
The intended learning outcomes of these notes are that students should be able to:
Define the full, reduced, and truncated SVDs with correct matrix definitions; know the properties of each of the matrices in the decomposition; understand their storage and time complexities
Understand the Eckart-Young Theorem and its application in scientific machine learning for data compression and principal component analysis.
Define the Moore-Penrose pseudoinverse and describe how it is used to solve least-squares problems.
5.1 SVD: the basics
The SVD decomposes the matrix \(X\in\mathbb{R}^{N\times n}\) into the product of three matrices, \(U\in\mathbb{R}^{N\times N}\), \(\Sigma\in\mathbb{R}^{N \times n}\), and \(V^\top \in\mathbb{R}^{n\times n}\), so that
\[ X = U\Sigma V^\top \qquad(5.1)\]
The matrices \(U\) and \(V\) are orthogonal (unitary), so that \(U^\top U = I_N\) and \(V^\top V = I_n\), and the matrix \(\Sigma\) is diagonal. The diagonal entries of \(\Sigma\) are denoted \(\sigma_1,\sigma_2,\ldots,\) and are called the singular values of \(X\), and are always non-negative and in non-increasing order, that is: \(\sigma_1\geq \sigma_2\geq \cdots \geq 0\). \(\Sigma\) is called the singular value matrix, and the number of nonzero singular values is equal to the rank of the matrix \(X\). The matrix \(U\) is called the left singular vector matrix, and its columns are called the left singular vectors. The matrix \(V\) is called the right singular vector matrix, and its columns are called the right singular vectors.
To compute the SVD in python, you would use numpy.linalg.svd or scipy.linalg.svd. Like other matrix decompositions, the time complexity to compute the SVD is third-order, specifically \(\mathcal{O}(\min(Nn^2, N^2n))\). (Note that when \(N>n\) this is the same time complexity as the QR decomposition for tall matrices.) The storage complexity is \(\mathcal{O}(\max(N,n)^2)\).
Above, we have described the full SVD. Another version of the SVD is the reduced SVD or the thin SVD1. Let \(k = \min(N,n)\). The reduced SVD decomposes \(X\) into the product \(X=U\Sigma V^\top\) where \(U\in\mathbb{R}^{N\times k}\), \(\Sigma\in\mathbb{R}^{k\times k}\), and \(V\in\mathbb{R}^{n\times k}\). The matrices \(U\) and \(V\) are still unitary, so that \(U^\top U = I_k\) and \(V^\top V = I_k\) and \(\Sigma\) is square and diagonal. When one of the dimensions of \(X\) is much larger than the other, this can lead to signficant cost savings in both time and memory. Note that the reduced SVD is subtly different from the truncated SVD which we describe next.
5.2 The truncated SVD
Let \(k=\min(N,n)\) as before. Then, we can rewrite Equation 5.1 as follows:
\[ X = \sum_{i=1}^k u_i \sigma_i v_i^\top = \sum_{i=1}^k \sigma_i (u_i v_i^\top) \qquad(5.2)\]
where \(u_1,\ldots,u_N\in\mathbb{R}^N\) are the left singular vectors (columns of \(U\)) and \(v_1,\ldots,v_n\in\mathbb{R}^n\) are the right singular vectors (columns of \(V\)), and \(\sigma_i\) are again the diagonal entries in \(\Sigma\). The matrices \(u_1v_1^\top,\ldots,u_kv_k^\top\in\mathbb{R}^{N\times n}\) are rank-1 matrices formed by taking the outer product of \(u_i\) with \(v_i\).2 Recall that the number of nonzero singular values is equal to the matrix rank and let \(r = \textsf{rank}(X)\). Thus, we can also write
\[ X = \sum_{i=1}^{r} \sigma_i (u_i v_i^\top) \qquad(5.3)\]
We say a matrix is low rank when its rank is less than its maximum possible rank \(k\). A low-rank matrix can be stored more efficiently than a full rank matrix: note that \(X\in\mathbb{R}^{N\times n}\) generally has \(Nn\) elements that must be stored, but if the the matrix has rank \(r\), then from Equation 5.3 you can see that it would be more efficient to store \(r\) sets of \((\sigma_i, u_i, v_i)\), for a total of \(r(N+n+1)\) values. (In fact if storage is your only goal, you can just store the product \(\sigma_iu_i\) and ditch that last 1 in the parentheses.) Low-rank matrices are also more efficient to compute with: while a standard matrix vector multiplication \(X\beta\) where \(\beta\in\mathbb{R}^n\) costs \(\mathcal{O}(Nn)\) FLOPs, we can do a matrix vector multiplication using the expression in Equation 5.3 in just \(\mathcal{O}(r(N+n))\) FLOPs (see Exercises).
Because low-rank matrices are more efficient to work with and to store, it is often useful to approximate a full-rank matrix with a low-rank matrix (or approximate a low-rank matrix with an even lower-rank matrix). By far the most common way to do this is using the truncated SVD, which just truncates the sum Equation 5.2 at some \(\ell < k\), as follows:
\[ X_\ell = \sum_{i=1}^\ell \sigma_i(u_iv_i^\top) = U_\ell \Sigma_\ell V_\ell^\top \]
where \(U_\ell\in\mathbb{R}^{N\times \ell}\) contains the left-most \(\ell\) columns of \(U\), and \(V_\ell\in\mathbb{R}^{n\times \ell}\) contains the left-most \(\ell\) columns of \(V\), and \(\Sigma_\ell\) contains the upper-left \(\ell\)-by-\(\ell\) block of \(\Sigma\) and thus has \(\ell\) diagonal entries. The truncated SVD \(X_\ell\) is the best rank-\(\ell\) approximation of the matrix \(X\) — this is a result called the Eckart-Young Theorem:
Theorem (Eckart-Young). Let \(X\in\mathbb{R}^{N\times n}\) and let \(\mathcal{M}_\ell\) denote the class of matrices in \(\mathbb{R}^{N\times n}\) with rank equal to or less than \(\ell\). Then,
\[ X_\ell = \arg\min_{M_\ell \in\mathcal{M}_\ell} ||X - M_\ell||_F^2 \qquad(5.4)\]
That is, the truncated SVD is the rank-\(\ell\) matrix that is closest to the original matrix \(X\). Above, we have stated the theorem in the Frobenius norm3, but it also holds for the matrix \(2\)-norm. This optimality result is the reason the SVD is one of the most powerful and broadly used matrix decompositions there is. The minimum of Equation 5.4 gives the error incurred by the approximation, and is given by what’s called the “tail sum” of the square of the singular values, that is,
\[ \min_{M_\ell \in\mathcal{M}_\ell} ||X - M_\ell||_F^2 = \|X_\ell - X||_F^2 = \sum_{i=\ell+1}^k \sigma_i^2 \]
5.3 SVD applications in scientific machine learning
5.3.1 Data compression
One application of the SVD is for data compression. If you have a very large data set \(Z\in\mathbb{R}^{N\times d}\) with either/both \(N,d\), very large, you can quickly have too much data to store easily on whatever machine is available to you. This is a common challenge in scientific simulation of partial differential equations (e.g., in computational fluid dynamics or finite element analysis of structures) where \(d\) generally scales with the size of the computational mesh, which can easily have hundreds of millions of cells. Instead of storing the full data matrix, we can store the singular vectors and values needed to represent the truncated SVD. A similar idea is used to compress images for efficient storage.
5.3.2 PCA features
Another way that the SVD gets used is to define PCA features. PCA stands for “principal component analysis”; this language comes from a statistics perspective on the SVD which we’ll elaborate on later. PCA features are defined as follows: let \(Z\in\mathbb{R}^{d\times N}\) denote the input matrix4, and let \(Z = \Phi\Xi\Psi^\top\) be the SVD of the input matrix, so that \(\phi_1,\phi_2,\ldots\) denote the left singular vectors of \(\Phi\). We can define \(n<d\) PCA features as follows: for \(i\leq n\), let \(x_i(z) = \phi_i^\top z\). We can write the entire feature map for all \(n\) features as follows: \(x(z) = \Phi_n^\top z\), where \(\Phi_n\in\mathbb{R}^{d\times n}\) consists of the leading left singular vectors of \(\Phi\).
How can we interpret PCA features? Note that the columns of \(\Phi\) form an orthonormal basis for \(\mathbb{R}^d\), and the ordering of these columns has the special property that they are in order of decreasing “importance” for representing the data in \(Z\) (this is one way of interpreting the Eckart-Young theorem). Recall that \(\Phi\Phi^\top=I\), so \(\Phi\Phi^\top z = z\); let \(\hat z = \Phi^\top z\). Then you see that \(z = \Phi\hat z\), so \(z\) can be written as a linear combination of the columns of \(\Phi\), with weights given by \(\hat z\). In essence, if we transform the input \(z\) to the coordinate system defined by the basis given in \(\Phi\), the coefficients of that linear combination are given in \(\hat z\), so the \(i\)th element of \(\hat z\), which is the \(i\)th PCA feature, \(x_i(z)\), tells us the size of the component of \(z\) in the direction given by \(\phi_i\). By truncating to the leading \(n\) PCA components, we keep the components in the \(n\) largest (on average over all the data) directions, and drop the trailing smaller components.
5.3.3 The Moore-Penrose Pseudoinverse
Last but certainly not least I finally can tell you more about how we solve underdetermined least squares problems. The Moore-Penrose pseudoinverse of a matrix \(X\) is generally denoted \(X^\dagger\) (although some sources use \(X^+\)), and is given by
\[ X^\dagger = V \Sigma^{\dagger} U^\top, \]
where \(X = U\Sigma V^\top\) is the SVD of \(X\), and the pseudoinverse of \(\Sigma\in\mathbb{R}^{N\times n}\) is denoted \(\Sigma^{\dagger}\in\mathbb{R}^{n\times N}\) and is given by inverting all the elements on the diagonal while leaving all other non-diagonal elements as zeroes.
Thus, in solving the normal equations, \(X^\top X \beta = X^\top Y\), if \(X^\top X\) is not invertible, we can take the pseudoinverse to arrive at the solution
\[ \beta = (X^\top X)^\dagger X^\top Y \qquad(5.5)\]
where \(\beta\) given this way is the norm-minimizing solution.
In fact, Equation 5.5 is equivalent to just writing \(\beta=X^\dagger Y\) (see Exercises).
5.4 Exercises
Show that Equation 5.1 and Equation 5.2 are equivalent (nothing fancy, just write it out using what you know about the structure of \(U\), \(\Sigma\), and \(V\)).
Let \(X\in\mathbb{R}^{N\times n}\) have rank \(r\) and let \(\beta\in\mathbb{R}^n\). Show that the matrix multiplication \(X\beta\) can be computed using the expression Equation 5.3 in \(\mathcal{O}(r(N+n))\) FLOPs.
Show that \(\beta = X^\dagger Y\) is equivalent to Equation 5.5 using properties of the SVD.
As with many of the terms we have encountered and will encounter in this course, the term “SVD” is used in the wild to mean either the full SVD or the reduced SVD – often the distinction is not super important, but when it is you need to be able to discern from context what is meant.↩︎
Recall that the rank of a matrix describes the number of linearly independent rows/columns of the matrix; to see that \(ab^\top\) has rank 1 when \(a,b\) are vectors, note that the \(j\)th row of \(ab^\top\) is given by \(a_jb^\top\), where \(a_j\in\mathbb{R}\) is the \(j\)th element of \(a\), so that \(ab^\top\) is just the same row multiplied by different constants.↩︎
the Frobenius norm of a matrix \(A\), denoted \(\|A\|_F\), is the natural generalization of the Euclidean norm from vectors to matrices – it is given by the square root of the sum of squares of all matrix elements.↩︎
Carefully note the dimensions here: in this definition of the input matrix, each column of \(Z\) corresponds to one datum, whereas in our feature matrix definition each row of \(X\) corresponds to one datum. What are the dimensions of \(\Phi\), \(\Xi\), and \(\Psi\)? ↩︎