Latent Semantic Analysis in Python
19 Dec
Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others it looks at all the documents as a whole and the terms within them to identify relationships.
An example of LSA:
Using a search engine search for “sand“.
Documents are returned which do not contain the search term “sand” but contains terms like “beach”.
LSA has identified a latent relationship, “sand” is semantically close to “beach”.
There are some very good papers which describing LSA in detail:
- An introduction to LSA: http://lsa.colorado.edu/papers/dp1.LSAintro.pdf
- Creating your own LSA space: http://www.andrew.cmu.edu/user/jquesada/pdf/bookSpacesRev1.pdf
- Latent Semantic analysis: http://en.wikipedia.org/wiki/Latent_semantic_indexing
This is an implementation of LSA in Python (2.4+). Thanks to scipy its rather simple!
1 Create the term-document matrix
We use the previous work in Vector Space Search to build this matrix.
2 tf-idf Transform
Apply the tf-idf transform to the term-document matrix. This generally tends to help improve results with LSA.
def tfidfTransform(self,):""" Apply TermFrequency(tf)*inverseDocumentFrequency(idf) for each matrix element.This evaluates how important a word is to a document in a corpusWith a document-term matrix: matrix[x][y]tf[x][y] = frequency of term y in document x / frequency of all terms in document xidf[x][y] = log( abs(total number of documents in corpus) / abs(number of documents with term y) )Note: This is not the only way to calculate tf*idf"""documentTotal = len(self.matrix)rows,cols = self.matrix.shapefor row in xrange(0, rows): #For each documentwordTotal= reduce(lambda x, y: x+y, self.matrix[row] )for col in xrange(0,cols): #For each term#For consistency ensure all self.matrix values are floatsself.matrix[row][col] = float(self.matrix[row][col])if self.matrix[row][col]!=0:termDocumentOccurences = self.__getTermDocumentOccurences(col)termFrequency = self.matrix[row][col] / float(wordTotal)inverseDocumentFrequency = log(abs(documentTotal / float(termDocumentOccurences)))self.matrix[row][col]=termFrequency*inverseDocumentFrequency
3 Singular Value Decomposition
SVD: http://en.wikipedia.org/wiki/Singular_value_decomposition
Determine U, Sigma, VT from our MATRIX from previous steps.
U . SIGMA . VT = MATRIX
We use the scipy svd implementation here. Note that numpy (version: 1.0.4 ) also has an implementation of svd but I had lots of problems with it. I found it did not work with anything larger than a 2 dimensional matrix.
#Sigma comes out as a list rather than a matrixu,sigma,vt = linalg.svd(self.matrix)
4 Reduce the dimensions of Sigma
We generally delete the smallest coefficients in the diagonal matrix Sigma to produce Sigma’. The reduction of the dimensions of Sigma combines some dimensions such that they are on more than one term. The number of coefficients deleted can depend of the corpus used. It should be large enough to fit the real structure in the data, but small enough such that noise or unimportant details are not modelled.
The real difficulty and weakness of LSA is knowing how many dimensions to remove. There is no exact method of finding the right dimensions. Generally L2-norm or Frobenius norm are used.
5 Calculate the Product with New Sigma’
Finally we calculate:
U . SIGMA' . VT = MATRIX'
#Reconstruct MATRIX'reconstructedMatrix= dot(dot(u,linalg.diagsvd(sigma,len(self.matrix),len(vt))),vt)
Giving use our final LSA matrix. We can now apply the same functionality used in Vector space search: searching and finding related scores for documents.
LSA In Action – Matrices
We start with out Model-Term frequency matrix with is generated from creating a Vector Space Search with four documents (D1-D4):
D1:”The cat in the hat disabled”
D2:”A cat is a fine pet ponies.”
D3:”Dogs and cats make good pets”
D4:”I haven’t got a hat.”
good pet hat make dog cat poni fine disabl D1 [+0.00 +0.00 +1.00 +0.00 +0.00 +1.00 +0.00 +0.00 +1.00 ] D2 [+0.00 +1.00 +0.00 +0.00 +0.00 +1.00 +1.00 +1.00 +0.00 ] D3 [+1.00 +1.00 +0.00 +1.00 +1.00 +1.00 +0.00 +0.00 +0.00 ] D4 [+0.00 +0.00 +1.00 +0.00 +0.00 +0.00 +0.00 +0.00 +0.00 ]
Apply tf-idf transform:
D1 [+0.00 +0.00 +0.23 +0.00 +0.00 +0.10 +0.00 +0.00 +0.46 ]
D2 [+0.00 +0.17 +0.00 +0.00 +0.00 +0.07 +0.35 +0.35 +0.00 ]
D3 [+0.28 +0.14 +0.00 +0.28 +0.28 +0.06 +0.00 +0.00 +0.00 ]
D4 [+0.00 +0.00 +0.69 +0.00 +0.00 +0.00 +0.00 +0.00 +0.00 ]
Perform SVD – Reduce Sigmas dimensions(removing the 2 smallest coefficients)
D1 [+0.01 +0.01 +0.34 +0.01 +0.01 +0.03 +0.02 +0.02 +0.11 ]
D2 [-0.00 +0.17 -0.01 -0.00 -0.00 +0.08 +0.35 +0.35 +0.02 ]
D3 [+0.28 +0.14 -0.01 +0.28 +0.28 +0.06 -0.00 -0.00 +0.02 ]
D4 [-0.01 -0.01 +0.63 -0.01 -0.01 +0.04 -0.01 -0.01 +0.19 ]
Note the Word ‘disabl’ despite not being in D4 now has a weighting in that document.
Dependencies
Problems
LSA assumes the Normal distribution where the Poisson distribution has actually been observed.
Source Code
Alternatively available at github:
git clone git://github.com/josephwilk/semanticpy.git
from scipy import linalg,array,dot,matfrom math import *from pprint import pprintclass LSA:""" Latent Semantic Analysis(LSA).Apply transforms to a document-term matrix to bring out latent relationships.These are found by analysing relationships between the documents and the terms theycontain."""def __init__(self, matrix):self.matrix = array(matrix)def __repr__(self,):""" Make the matrix look pretty """stringRepresentation=""rows,cols = self.matrix.shapefor row in xrange(0,rows):stringRepresentation += "["for col in xrange(0,cols):stringRepresentation+= "%+0.2f "%self.matrix[row][col]stringRepresentation += "]\n"return stringRepresentationdef __getTermDocumentOccurences(self,col):""" Find how many documents a term occurs in"""termDocumentOccurences=0rows,cols = self.matrix.shapefor n in xrange(0,rows):if self.matrix[n][col]>0: #Term appears in documenttermDocumentOccurences+=1return termDocumentOccurencesdef tfidfTransform(self,):""" Apply TermFrequency(tf)*inverseDocumentFrequency(idf) for each matrix element.This evaluates how important a word is to a document in a corpusWith a document-term matrix: matrix[x][y]tf[x][y] = frequency of term y in document x / frequency of all terms in document xidf[x][y] = log( abs(total number of documents in corpus) / abs(number of documents with term y) )Note: This is not the only way to calculate tf*idf"""documentTotal = len(self.matrix)rows,cols = self.matrix.shapefor row in xrange(0, rows): #For each documentwordTotal= reduce(lambda x, y: x+y, self.matrix[row] )for col in xrange(0,cols): #For each term#For consistency ensure all self.matrix values are floatsself.matrix[row][col] = float(self.matrix[row][col])if self.matrix[row][col]!=0:termDocumentOccurences = self.__getTermDocumentOccurences(col)termFrequency = self.matrix[row][col] / float(wordTotal)inverseDocumentFrequency = log(abs(documentTotal / float(termDocumentOccurences)))self.matrix[row][col]=termFrequency*inverseDocumentFrequencydef lsaTransform(self,dimensions=1):""" Calculate SVD of objects matrix: U . SIGMA . VT = MATRIXReduce the dimension of sigma by specified factor producing sigma'.Then dot product the matrices: U . SIGMA' . VT = MATRIX'"""rows,cols= self.matrix.shapeif dimensions <= rows: #Its a valid reduction#Sigma comes out as a list rather than a matrixu,sigma,vt = linalg.svd(self.matrix)#Dimension reduction, build SIGMA'for index in xrange(rows-dimensions, rows):sigma[index]=0#print linalg.diagsvd(sigma,len(self.matrix), len(vt))#Reconstruct MATRIX'reconstructedMatrix= dot(dot(u,linalg.diagsvd(sigma,len(self.matrix),len(vt))),vt)#Save transformself.matrix=reconstructedMatrixelse:print "dimension reduction cannot be greater than %s" % rowsif __name__ == '__main__':#Example document-term matrix# Vector dimensions: good, pet, hat, make, dog, cat, poni, fine, disablmatrix=[[0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0],[0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0],[1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0],[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]#Createlsa = LSA(matrix)print lsa#Preparelsa.tfidfTransform()print lsa#Performlsa.lsaTransform()print lsa
-
Gregg Lind
-
ESER
-
Joseph Wilk
-
Andreas
-
Joseph Wilk
-
Shankar
-
Michael Shilman
-
Joseph Wilk
-
Adrian Kuhn
-
Greg