Lecture 11 (contd.): Singular Value Decompositions

Here’s the example for dimension reduction using SVD I was referring to in lecture. The test data in the MNIST database of hand-written digits (e.g., from here) has about 1000 images for each of the 10 digits. Each image is itself a $28 \times 28$-image represented as a $28^2 = 768$-dimensional vector, each coordinate being a single bit showing whether it’s on or off. Here is the plot of the original images:

MNIST Handwritten digits

Now put each image as a row in the matrix $A$, so that we have about $n = 10000$ rows and $D = 768$-columns. Compute $A_k$, the best rank-$k$ representation for the data set $A$. Observe that $A_k$ is a $10000 \times 768$ matrix of reals, some of which may even be negative. To make sense of these, we round these reals to get another bitmap, and then plot them. Here’s what we get for $k=10$:

MNIST Handwritten digits

And $k=20$:

MNIST Handwritten digits

And $k=40$:

MNIST Handwritten digits

Things get pretty good by the time we’re up to rank 40: almost all the digits are readable. And that’s a win? The matrix $A$ is represented using $n \times D$ bits. Whereas $A_k = \sum_{i = 1}^k \sigma_i u_i v_i^\intercal$ can be represented using $k$ triples of $\sigma_i$, $u_i$, and $v_i$, which may require much less space to store. So we could use this as a data compression technique. Is it surprising that the best rank-$20$ or rank-$40$ approximation actually gives you good images? After all, we saw the Eckert-Young theorem saying we had the smallest error possible in a whole slew of norms. But the fact that this small error in those norms translates to small error in a visual sense may seem a bit surprising. Or maybe not. I am not sure. :)

You can do the same thing with faces: here are the original images:

Olivetti Faces

Here is $k=1$:

Olivetti Faces

And $k=4$:

Olivetti Faces

And $k=20$:

Olivetti Faces

And $k=40$:

Olivetti Faces

As I mentioned above, this stuff is widely used. For a quick introduction, see the chapter on dimension reduction in the Mining Massive Data Sets book by Leskovec, Rajaraman, and Ullman. Or this lecture by Juse Leskovec which also talks about how SVDs were used for the Netflix challenge (to predict which people may like which movies). The hope is that the $r$ dimensions there correspond to movie types or topics (like action or comedy or drama). So the $U$ matrix gives the weight that each person assigns to each topic, and the $V$ matrix gives the amount to which each movie corresponds to each topic. Jure’s lecture talks more about this, and I am sure your machine learning courses do too.

Finally, if you’re interested in playing with it, here’s my Matlab code. The data can be obtained off Sam Roweis’ page above.