Latent Semantic Analysis in Python
December 19th, 2007
Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others it looks at all the documents as a whole and the terms within them to identify relationships.
An example of LSA:
Using a search engine search for “sand“.
Documents are returned which do not contain the search term “sand” but contains terms like “beach”.
LSA has identified a latent relationship, “sand” is semantically close to “beach”.
There are some very good papers which describing LSA in detail:
- An introduction to LSA: http://lsa.colorado.edu/papers/dp1.LSAintro.pdf
- Creating your own LSA space: http://www.andrew.cmu.edu/user/jquesada/pdf/bookSpacesRev1.pdf
- Latent Semantic analysis: http://en.wikipedia.org/wiki/Latent_semantic_indexing
This is an implementation of LSA in Python (2.4+). Thanks to scipy its rather simple!
1 Create the term-document matrix
We use the previous work in Vector Space Search to build this matrix.
2 tf-idf Transform
Apply the tf-idf transform to the term-document matrix. This generally tends to help improve results with LSA.
def tfidfTransform(self,):""" Apply TermFrequency(tf)*inverseDocumentFrequency(idf) for each matrix element.This evaluates how important a word is to a document in a corpusWith a document-term matrix: matrix[x][y]tf[x][y] = frequency of term y in document x / frequency of all terms in document xidf[x][y] = log( abs(total number of documents in corpus) / abs(number of documents with term y) )Note: This is not the only way to calculate tf*idf"""documentTotal = len(self.matrix)rows,cols = self.matrix.shapefor row in xrange(0, rows): #For each documentwordTotal= reduce(lambda x, y: x+y, self.matrix[row] )for col in xrange(0,cols): #For each term#For consistency ensure all self.matrix values are floatsself.matrix[row][col] = float(self.matrix[row][col])if self.matrix[row][col]!=0:termDocumentOccurences = self.__getTermDocumentOccurences(col)termFrequency = self.matrix[row][col] / float(wordTotal)inverseDocumentFrequency = log(abs(documentTotal / float(termDocumentOccurences)))self.matrix[row][col]=termFrequency*inverseDocumentFrequency
3 Singular Value Decomposition
SVD: http://en.wikipedia.org/wiki/Singular_value_decomposition
Determine U, Sigma, VT from our MATRIX from previous steps.
U . SIGMA . VT = MATRIX
We use the scipy svd implementation here. Note that numpy (version: 1.0.4 ) also has an implementation of svd but I had lots of problems with it. I found it did not work with anything larger than a 2 dimensional matrix.
#Sigma comes out as a list rather than a matrixu,sigma,vt = linalg.svd(self.matrix)
4 Reduce the dimensions of Sigma
We generally delete the smallest coefficients in the diagonal matrix Sigma to produce Sigma’. The reduction of the dimensions of Sigma combines some dimensions such that they are on more than one term. The number of coefficients deleted can depend of the corpus used. It should be large enough to fit the real structure in the data, but small enough such that noise or unimportant details are not modelled.
The real difficulty and weakness of LSA is knowing how many dimensions to remove. There is no exact method of finding the right dimensions. Generally L2-norm or Frobenius norm are used.
5 Calculate the Product with New Sigma’
Finally we calculate:
U . SIGMA' . VT = MATRIX'
#Reconstruct MATRIX'reconstructedMatrix= dot(dot(u,linalg.diagsvd(sigma,len(self.matrix),len(vt))),vt)
Giving use our final LSA matrix. We can now apply the same functionality used in Vector space search: searching and finding related scores for documents.
LSA In Action – Matrices
We start with out Model-Term frequency matrix with is generated from creating a Vector Space Search with four documents (D1-D4):
D1:”The cat in the hat disabled”
D2:”A cat is a fine pet ponies.”
D3:”Dogs and cats make good pets”
D4:”I haven’t got a hat.”
good pet hat make dog cat poni fine disabl D1 [+0.00 +0.00 +1.00 +0.00 +0.00 +1.00 +0.00 +0.00 +1.00 ] D2 [+0.00 +1.00 +0.00 +0.00 +0.00 +1.00 +1.00 +1.00 +0.00 ] D3 [+1.00 +1.00 +0.00 +1.00 +1.00 +1.00 +0.00 +0.00 +0.00 ] D4 [+0.00 +0.00 +1.00 +0.00 +0.00 +0.00 +0.00 +0.00 +0.00 ]
Apply tf-idf transform:
D1 [+0.00 +0.00 +0.23 +0.00 +0.00 +0.10 +0.00 +0.00 +0.46 ]
D2 [+0.00 +0.17 +0.00 +0.00 +0.00 +0.07 +0.35 +0.35 +0.00 ]
D3 [+0.28 +0.14 +0.00 +0.28 +0.28 +0.06 +0.00 +0.00 +0.00 ]
D4 [+0.00 +0.00 +0.69 +0.00 +0.00 +0.00 +0.00 +0.00 +0.00 ]
Perform SVD – Reduce Sigmas dimensions(removing the 2 smallest coefficients)
D1 [+0.01 +0.01 +0.34 +0.01 +0.01 +0.03 +0.02 +0.02 +0.11 ]
D2 [-0.00 +0.17 -0.01 -0.00 -0.00 +0.08 +0.35 +0.35 +0.02 ]
D3 [+0.28 +0.14 -0.01 +0.28 +0.28 +0.06 -0.00 -0.00 +0.02 ]
D4 [-0.01 -0.01 +0.63 -0.01 -0.01 +0.04 -0.01 -0.01 +0.19 ]
Note the Word ‘disabl’ despite not being in D4 now has a weighting in that document.
Dependencies
Problems
LSA assumes the Normal distribution where the Poisson distribution has actually been observed.
Source Code
Alternatively available at github:
git clone git://github.com/josephwilk/semanticpy.git
from scipy import linalg,array,dot,matfrom math import *from pprint import pprintclass LSA:""" Latent Semantic Analysis(LSA).Apply transforms to a document-term matrix to bring out latent relationships.These are found by analysing relationships between the documents and the terms theycontain."""def __init__(self, matrix):self.matrix = array(matrix)def __repr__(self,):""" Make the matrix look pretty """stringRepresentation=""rows,cols = self.matrix.shapefor row in xrange(0,rows):stringRepresentation += "["for col in xrange(0,cols):stringRepresentation+= "%+0.2f "%self.matrix[row][col]stringRepresentation += "]\n"return stringRepresentationdef __getTermDocumentOccurences(self,col):""" Find how many documents a term occurs in"""termDocumentOccurences=0rows,cols = self.matrix.shapefor n in xrange(0,rows):if self.matrix[n][col]>0: #Term appears in documenttermDocumentOccurences+=1return termDocumentOccurencesdef tfidfTransform(self,):""" Apply TermFrequency(tf)*inverseDocumentFrequency(idf) for each matrix element.This evaluates how important a word is to a document in a corpusWith a document-term matrix: matrix[x][y]tf[x][y] = frequency of term y in document x / frequency of all terms in document xidf[x][y] = log( abs(total number of documents in corpus) / abs(number of documents with term y) )Note: This is not the only way to calculate tf*idf"""documentTotal = len(self.matrix)rows,cols = self.matrix.shapefor row in xrange(0, rows): #For each documentwordTotal= reduce(lambda x, y: x+y, self.matrix[row] )for col in xrange(0,cols): #For each term#For consistency ensure all self.matrix values are floatsself.matrix[row][col] = float(self.matrix[row][col])if self.matrix[row][col]!=0:termDocumentOccurences = self.__getTermDocumentOccurences(col)termFrequency = self.matrix[row][col] / float(wordTotal)inverseDocumentFrequency = log(abs(documentTotal / float(termDocumentOccurences)))self.matrix[row][col]=termFrequency*inverseDocumentFrequencydef lsaTransform(self,dimensions=1):""" Calculate SVD of objects matrix: U . SIGMA . VT = MATRIXReduce the dimension of sigma by specified factor producing sigma'.Then dot product the matrices: U . SIGMA' . VT = MATRIX'"""rows,cols= self.matrix.shapeif dimensions <= rows: #Its a valid reduction#Sigma comes out as a list rather than a matrixu,sigma,vt = linalg.svd(self.matrix)#Dimension reduction, build SIGMA'for index in xrange(rows-dimensions, rows):sigma[index]=0#print linalg.diagsvd(sigma,len(self.matrix), len(vt))#Reconstruct MATRIX'reconstructedMatrix= dot(dot(u,linalg.diagsvd(sigma,len(self.matrix),len(vt))),vt)#Save transformself.matrix=reconstructedMatrixelse:print "dimension reduction cannot be greater than %s" % rowsif __name__ == '__main__':#Example document-term matrix# Vector dimensions: good, pet, hat, make, dog, cat, poni, fine, disablmatrix=[[0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0],[0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0],[1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0],[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]#Createlsa = LSA(matrix)print lsa#Preparelsa.tfidfTransform()print lsa#Performlsa.lsaTransform()print lsa




April 5th, 2008 at 6:50 pm
This is very interesting. I’m wondering what applications this has. It is just a sophisticated way to search documents, or are there other things you can do with it as well?
April 5th, 2008 at 8:06 pm
For example, you can build a thesauraus, find synonyms, or … cluster software by topics! Detecting high-level, ie beyond mere structure, aspects.
http://smallwiki.unibe.ch/adriankuhn/hapax/
April 6th, 2008 at 10:50 am
LSA’s applications are varied. While it is well known in search with projects like Lucene and Ferret,
in the wider scope its a feature extraction technique.
So it has been used in applications outside of just searching documents such as
pLSA (Probabilistic latent semantic analysis) and LDA (Latent Dirichlet allocation) are two other techniques that come from LSA. I’m in the middle of writing LDA in ruby at the moment.
April 7th, 2008 at 10:27 pm
[...] 7th, 2008 in Links An implementation of LSA in Python. Latent Semantic Analysis does closeness comparisons in textual sources to find [...]
April 17th, 2008 at 10:35 pm
Minor typo. Line 67 should read:
self.matrix[row][col] = float(self.matrix[row][col])
>Thanks, corrected typo.
>Joseph Wilk
April 22nd, 2008 at 6:42 pm
can you please put up source code of PLSA also….it sounds more challenging. LSA has a lot of scope for improvement, PLSA being one method. So please work and put up source code if you can.
April 23rd, 2008 at 9:11 pm
I agree the lack of statistical foundations of LSA does leave some room for improvement.
My next release will be Latent Dirichlet allocation and this will be implemented in ruby. I’ll post here as soon as I’ve got the code up (and I’ll release the code before I write up the blog if you want to see it early).
May 25th, 2008 at 10:32 am
Hello to all,
great post
I am a student and my thesis is on recommendation systems
i am working on a movie data set ( using svd )
i was wondering if we could use LSA on a subtitle data set
( if there is one ) in order to get recommendations
its just an idea, i see you dont have recommendations in
LSA’s applications so i just had to ask
thank you
May 30th, 2008 at 7:11 pm
Hello Andreas.
Ultimately with LSA we are using a vector space model. No matter what way we derive the weights (pLSA, LSA, LDA) we map to that model.
So within the model we can use cosine to measure the relatedness of documents in the vector space.
With your example I can see you could find subtitles that where related. As for recommendations it really depends on what your movie data set is like. Whether it would make sense mapping a movie data item to a document in the vector space (where each vector is a subtitle document) and using cosine to find recommendations for subtitles.
June 24th, 2008 at 10:43 pm
Andreas: we’re applying LSA (actually, PLSI) to textual data for our recommendationa algorithm, which we launched a demo of a few days ago.
As Joseph points out, the real trick is dimensionality reduction. How you tune your dimensionality reduction algorithm and/or heuristic will have a significant impact on precision and recall (and performance).
Overall, however, LSA works very well for computing recommendations from textual data. Currently we’re applying PLSI to only the English Wikipedia corpus; however, it would probably work well on the IMDB dataset, too.
-ESer.org
December 6th, 2008 at 5:37 pm
One major problem with LSA is that is doesn’t scale well to large sizes, because of the LDU matrix decomposition…. it’s O(docs^2 + unique_terms^2), even when calculated using sparse methods. It can also be very hard interpret what the “factors” really mean.