Lecture 11: Singular Value Decompositions

Today we talked about Singular Value Decompositions: the idea was simple, given any $n\times D$ matrix $A$ we could write it as $A = U \Sigma V^\intercal$, where $U$ and $V$ have orthonormal columns, and $\Sigma$ is a positive diagonal matrix of size $r$, where $r$ is the rank of the matrix, and $\sigma_1 \geq \sigma_2 \geq \ldots \sigma_r$ are the singular values on the diagonal. Equivalently, we can write it as $A = \sum_{i = 1}^r \sigma_i u_i v_i^\intercal$.

The Eckert-Young theorem then says that for any $k$, if we take the first $k$ terms in this sum, the resulting matrix $A_k := \sum_{i = 1}^k \sigma_i u_i v_i^\intercal$ is the best rank-$k$ matrix approximation for $A$, where the measure of goodness is the error $\| A - A_k \|$, and the matrix norm is the Frobenius norm, the spectral norm, or indeed any of the broad class of matrix norms called unitarily invariant norms.

Some Matrix Norms

BTW, if you have not seen matrix norms before, here’s a short list of the commonly used ones. Given a matrix $M$, let $\mathbf{\sigma}$ denote the vector $(\sigma_1, \sigma_2, \ldots, \sigma_r)$ consisting of all the non-zero singular values.

the Frobenius norm is $\|M\|_F := \sqrt{\sum_{ij} M_{ij}^2}. $ This is also $\|\mathbf{\sigma}\|_2$, the Euclidean length of the vector composed of the singular values. (Please try to prove this yourself!)
the spectral norm is denoted $\|M\|$ (without any subscripts), and is the maximum singular value of $M$. Hence this is equal to $\|\mathbf{\sigma}\|_\infty$.
The trace norm is given by $\|M\|_{tr} := \sum_\ell | \sigma_\ell |$, and hence equals $\| \mathbf{\sigma} \|_1$.
In general, you can define for any $p$ the $p$-Schatten norm, which is given by $\|\mathbf{\sigma}\|_p$.
Given any vector norm $\|\cdot \|$, we can define the induced or operator norm $\| A \| := \sup \frac{\|Ax\|}{\|x\|}$. Often one tends to use the term operator norm just for the case when we use the Euclidean vector norm, in which case it just equals the spectral norm.

Much more on Wikipedia.

The Final Part of the Least-Squares Regression Proof

Recall that we were talking about solving the regression problem $\min_x \| Ax - b\|$. And when I stopped at the end of the lecture, we’d just pointed out that the projection of $b$ onto the column span of $A$ was given by $UU^\intercal b$. However, I’d stopped too soon: this gives us the value of $Ax$, how do we find the right $x$? If $A$ were invertible, we could just taken the inverse. (In fact, then $x = A^{-1} b$ would have been a solution to the regression problem in the first place.) But we’d said that $A$ had more rows than columns – the system was over-constrained – so we cannot just take inverses.

Instead, we can use the SVD a little more. Given the SVD for $A = U \Sigma V^\intercal$, define $A^{+} = V \Sigma^{-1} U^\intercal$. We’re using the fact that $\Sigma$ does not contain any of the zero singular values, and hence is invertible, with

$\Sigma^{-1} = diag(1/\sigma_1, 1/\sigma_2, \ldots, 1/\sigma_r).$

Moreover, we get the invaluable property that for any $c$ in the image of $A$ (i.e., such that there exists $y$ with $Ay = c$), we get

$A (A^+ c) = c.$

Hence $A^+$ acts like an inverse for vectors in the image of $A$. It is called the (Moore-Penrose) pseudoinverse for the non-invertible matrix $A$. Now recall that we want to find $x$ such that $Ax = UU^\intercal b$. It therefore suffices to set

$x = A^+ UU^\intercal b = (V \Sigma^{-1} U^\intercal) UU^\intercal b = V \Sigma^{-1} (U^\intercal U)U^\intercal b = V \Sigma^{-1} U^\intercal b = A^+ b.$

Hence, the solution to the regression problem is given by $A^+ b$, a nice and compact solution!

Computing Eigenvaleus and The Power Method

I realize I went over the power method a little fast, so here’s a little more detail. The problem we want to solve is to compute the maximum eigenvalue of a symmetric $n \times n$ matrix $M$.

In theory you can use fast matrix multiplication to solve this in $n^{\omega}$ time, see this paper of Demmel, Dimitriu, and Holtz. It gives you all the eigenvectors and eigenvalues. However, in practice you may probably use some other method. Here’s the power method, also called power iteration. Pick a starting unit vector $r_0$, typically at random. At the $i^{th}$ iteration, given some $r_i$ compute the matrix-vector product $Mr_i$, rescale to get a unit vector $r_{i+1} := Mr_i/\|Mr_i\|_2$, and repeat.

For the moment, let’s ignore the rescaling, which is mainly to keep the length under control. Then we just take the vector $r_0$ and keep hitting it with the matrix (thereby powering it):

$r_0 \rightarrow Mr_0 \rightarrow M^2 r_0 \rightarrow M^3 r_0 \rightarrow \ldots$

Say the (orthonormal) eigenvectors of $M$ are $v_1, v_2, \ldots, v_n$, with corresponding eigenvalues $\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n$. Let’s write $r_0$ in terms of this eigenbasis: suppose $r_0 = \sum \alpha_i v_i$. Then recalling that $M^kv_i = \lambda^k v_i$, we get

$M^k r_0 = \sum_i \alpha_i \lambda_i^k v_i.$

Now suppose, for simplicity, $\lambda_1 > \lambda_2$. Moreover, since we chose $r_0$ at random, it should have about the same projection on all the basis vectors, so imagine that $\alpha_i \approx 1/\sqrt{n}$ for all $i$. Then the coefficient of the first term is approximately $\lambda_1^k/\sqrt{n}$, whereas the sum of the rest of the coefficients is at most $(n-1) \lambda_2^k/\sqrt{n} \leq \sqrt{n} \lambda_2^k$. So when $k = t \log_{(\lambda_1/\lambda_2)} n$, the coefficient of the first term is $n^{t-1}$ times the sum of the rest of the coefficients. Hence, after that many iterations, the vector $M^k r_0$ is pretty close to the top eigenvector! Observe: we used the fact that there was a gap between the top two eigenvectors, else we could not prove such a statement.

Here’s a blog post by Luca Trevisan about the power method, with some more details about how to go on to the second-largest eigenvector, or the smallest ones. (Although those extensions may suffer from some numerical stability issues.) One can get out of these situations by using other methods which are better from the numerical stability perspective: these include Arnoldi iteration and Lanczos’ algorithm, that you can probably learn about in spectral graph theory courses (by, e.g., Gary, Dan Speilman, or Luca. Or check out a book on numerical analysis. E.g., I like this one by Trefethan and Bau. Or talk to a specialist in the area: Jason Howell is a professor from the Math department (who often comes to our lectures) and a bonafide specialist in numerical methods!