Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others it looks at all the documents as a whole and the terms within them to identify relationships.
An example of LSA: Using a search engine search for “sand”.
Documents are returned which do not contain the search term “sand” but contains terms like “beach”.
LSA has identified a latent relationship, “sand” is semantically close to “beach”.
There are some very good papers which describing LSA in detail:
An introduction to LSA: http://lsa.colorado.edu/papers/dp1.LSAintro.pdf
Creating your own LSA space: http://www.andrew.cmu.edu/user/jquesada/pdf/bookSpacesRev1.pdf
Latent Semantic analysis: http://en.wikipedia.org/wiki/Latent_semantic_indexing
This is an implementation of LSA in Python (2.4+). Thanks to scipy its rather simple!
1 Create the termdocument matrix
We use the previous work in Vector Space Search to build this matrix.
2 tfidf Transform
Apply the tfidf transform to the termdocument matrix. This generally tends to help improve results with LSA.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 

3 Singular Value Decomposition
SVD: http://en.wikipedia.org/wiki/Singular_value_decomposition
Determine U, Sigma, VT from our MATRIX from previous steps.
U . SIGMA . VT = MATRIX
We use the scipy svd implementation here. Note that numpy (version: 1.0.4 ) also has an implementation of svd but I had lots of problems with it. I found it did not work with anything larger than a 2 dimensional matrix.
1 2 3 

4 Reduce the dimensions of Sigma
We generally delete the smallest coefficients in the diagonal matrix Sigma to produce Sigma’. The reduction of the dimensions of Sigma combines some dimensions such that they are on more than one term. The number of coefficients deleted can depend of the corpus used. It should be large enough to fit the real structure in the data, but small enough such that noise or unimportant details are not modelled.
The real difficulty and weakness of LSA is knowing how many dimensions to remove. There is no exact method of finding the right dimensions. Generally L2norm or Frobenius norm are used.
5 Calculate the Product with New Sigma’
Finally we calculate:
U . SIGMA' . VT = MATRIX'
1 2 

Giving use our final LSA matrix. We can now apply the same functionality used in Vector space search: searching and finding related scores for documents.
LSA In Action – Matrices
We start with out ModelTerm frequency matrix with is generated from creating a Vector Space Search with four documents (D1D4): D1:“The cat in the hat disabled” D2:“A cat is a fine pet ponies.” D3:“Dogs and cats make good pets” D4:“I haven’t got a hat.”
<strong>good pet hat make dog cat poni fine disabl</strong>
D1 [+0.00 +0.00 +1.00 +0.00 +0.00 +1.00 +0.00 +0.00 +1.00 ]
D2 [+0.00 +1.00 +0.00 +0.00 +0.00 +1.00 +1.00 +1.00 +0.00 ]
D3 [+1.00 +1.00 +0.00 +1.00 +1.00 +1.00 +0.00 +0.00 +0.00 ]
D4 [+0.00 +0.00 +1.00 +0.00 +0.00 +0.00 +0.00 +0.00 +0.00 ]
Apply tfidf transform:
D1 [+0.00 +0.00 +0.23 +0.00 +0.00 +0.10 +0.00 +0.00 +0.46 ]
D2 [+0.00 +0.17 +0.00 +0.00 +0.00 +0.07 +0.35 +0.35 +0.00 ]
D3 [+0.28 +0.14 +0.00 +0.28 +0.28 +0.06 +0.00 +0.00 +0.00 ]
D4 [+0.00 +0.00 +0.69 +0.00 +0.00 +0.00 +0.00 +0.00 <span style="color: #ff0000;"><strong>+0.00</strong></span> ]
Perform SVD – Reduce Sigmas dimensions(removing the 2 smallest coefficients)
D1 [+0.01 +0.01 +0.34 +0.01 +0.01 +0.03 +0.02 +0.02 +0.11 ]
D2 [0.00 +0.17 0.01 0.00 0.00 +0.08 +0.35 +0.35 +0.02 ]
D3 [+0.28 +0.14 0.01 +0.28 +0.28 +0.06 0.00 0.00 +0.02 ]
D4 [0.01 0.01 +0.63 0.01 0.01 +0.04 0.01 0.01 <strong><span style="color: #ff0000;">+0.19</span></strong> ]
Note the Word ‘disabl’ despite not being in D4 now has a weighting in that document.
Dependencies
Problems
LSA assumes the Normal distribution where the Poisson distribution has actually been observed.
Source Code
Available at github:
git clone git://github.com/josephwilk/semanticpy.git