So, to review, in order to run a search engine with the vector space model, we first have to convert the entire world wide web into a term-by-document matrix. This matrix would be huge! It would have to be at least 300,000 x 3,000,000,000 according to the most conservative estimates. Processing a query for this kind of vector space would be ridiculous. We would have to find the angle measures between the query vector and billions of other 300,000 dimensional vectors for every query! That would take way too long for practical purposes. Fortunately, there are some tricks in linear algebra that we can use to cut a few corners.
You may remember from your linear algebra class that Eigen Decomposition can make many calculations much quicker and easier. For example, suppose we have a matrix A, and we want to find A³. Instead of computing A*A*A, which could require a lot of calculation (especially if A is large), we can use the eigen decomposition
and λ1...λn are the eigenvalues of A. Now
which is much easier to calculate. Since this method of decomposition can only be used on square matrices, we have no use for it. BUT we can use Singular Value Decomposition, which works in much the same way.
Singular Value Decomposition has two wonderful properties that make it very helpful and important for our work. First, it exists for any and all matrices: large, small, square, rectangular, singular, non-singular, sparse and dense. So, no matter what kind of term by document matrix the internet yields, we know it has a singular value decomposition.
The next helpful property is the optimality property. This property basically tells us that, if we would like to find a rank-k approximation of our matrix, the best one can be found using singular value decomposition. I will explain this property in more detail a little bit later on in your training.
Definition: Given that an mxn matrix A has rank r, A can be factored where U and V are orthogonal matrices containing the singular vectors, and S is a matrix of the form where D is a diagonal matrix containing the singular values of A.
So, basically, SVD breaks A into three components:
It would probably help your understanding of SVD if we actually look at some examples. With the help of MATLAB, we can easily calculate and compare the SVDs of a few matrices. If you are not familiar with MATLAB, you may want to read through this tutorial.
In the following and later activities we will be using some of the m-files from the ATLAST project. To download these, go to the ATLAST page and download the m-files for the second edition (pay attention to where the computer puts them).
Now open MATLAB. We need to have the ATLAST files open in the current directory.
The location of the folder will be different depending on what kind of computer you are working with. Find it by navigating through the directory and selecting the atlast65 folder as shown above.
Create a random 5 by 6 matrix of rank 4 and entries between 2 and -2 with the command:
A = randintr(5,6,2,4)
(Learn more about this command; enter
Find the singular value decomposition of your matrix A by typing
[U,S,V] = svd(A)
This command calculates the svd and allows you to see each component of the decomposition.
How many singular values does A have?
Try the same thing with
A = randintr(5,6,2,5). Now how many singular values does A have? Can you see a relationship between the rank of a matrix and how many singular values it has?
It just so happens that, if r is the rank of A, then r is also the number of non-zero entries of S (in other words, r is the number of singular values of A). Berry & Brown
Now that we know what SVD, we can look more closely at how it can help our search engine by making the calculations a bit less arduous.
Or Jump to a page!letter 1 2 3 4 5 6
This material is based upon work supported by the National Science Foundation under Grant No. 0546622. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.