Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others it looks at all the documents as a whole and the terms within them to identify relationships.

An example of LSA:
Using a search engine search for “sand“.

Documents are returned which do not contain the search term “sand” but contains terms like “beach”.

LSA has identified a latent relationship, “sand” is semantically close to “beach”.

There are some very good papers which describing LSA in detail:

This is an implementation of LSA in Python (2.4+). Thanks to scipy its rather simple!

1 Create the term-document matrix

We use the previous work in Vector Space Search to build this matrix.

2 tf-idf Transform

Apply the tf-idf transform to the term-document matrix. This generally tends to help improve results with LSA.

  1.  
  2.  
  3. def tfidfTransform(self,):
  4. """ Apply TermFrequency(tf)*inverseDocumentFrequency(idf) for each matrix element.
  5. This evaluates how important a word is to a document in a corpus
  6.  
  7. With a document-term matrix: matrix[x][y]
  8. tf[x][y] = frequency of term y in document x / frequency of all terms in document x
  9. idf[x][y] = log( abs(total number of documents in corpus) / abs(number of documents with term y)  )
  10. Note: This is not the only way to calculate tf*idf
  11. """
  12.  
  13. documentTotal = len(self.matrix)
  14. rows,cols = self.matrix.shape
  15.  
  16. for row in xrange(0, rows): #For each document
  17.  
  18. wordTotal= reduce(lambda x, y: x+y, self.matrix[row] )
  19.  
  20. for col in xrange(0,cols): #For each term
  21.  
  22. #For consistency ensure all self.matrix values are floats
  23. self.matrix[row][col] = float(self.matrix[row][col])
  24.  
  25. if self.matrix[row][col]!=0:
  26.  
  27. termDocumentOccurences = self.__getTermDocumentOccurences(col)
  28.  
  29. termFrequency = self.matrix[row][col] / float(wordTotal)
  30. inverseDocumentFrequency = log(abs(documentTotal / float(termDocumentOccurences)))
  31. self.matrix[row][col]=termFrequency*inverseDocumentFrequency
  32.  
  33.  

3 Singular Value Decomposition

SVD: http://en.wikipedia.org/wiki/Singular_value_decomposition

Determine U, Sigma, VT from our MATRIX from previous steps.

 U . SIGMA . VT = MATRIX

We use the scipy svd implementation here. Note that numpy (version: 1.0.4 ) also has an implementation of svd but I had lots of problems with it. I found it did not work with anything larger than a 2 dimensional matrix.

  1.  
  2. #Sigma comes out as a list rather than a matrix
  3. u,sigma,vt = linalg.svd(self.matrix)
  4.  

4 Reduce the dimensions of Sigma

We generally delete the smallest coefficients in the diagonal matrix Sigma to produce Sigma’. The reduction of the dimensions of Sigma combines some dimensions such that they are on more than one term. The number of coefficients deleted can depend of the corpus used. It should be large enough to fit the real structure in the data, but small enough such that noise or unimportant details are not modelled.

The real difficulty and weakness of LSA is knowing how many dimensions to remove. There is no exact method of finding the right dimensions. Generally L2-norm or Frobenius norm are used.

5 Calculate the Product with New Sigma’

Finally we calculate:

U . SIGMA' . VT = MATRIX'
  1.  
  2. #Reconstruct MATRIX'
  3. reconstructedMatrix= dot(dot(u,linalg.diagsvd(sigma,len(self.matrix),len(vt))),vt)
  4.  

Giving use our final LSA matrix. We can now apply the same functionality used in Vector space search: searching and finding related scores for documents.

LSA In Action – Matrices

We start with out Model-Term frequency matrix with is generated from creating a Vector Space Search with four documents (D1-D4):
D1:”The cat in the hat disabled
D2:”A cat is a fine pet ponies.
D3:”Dogs and cats make good pets
D4:”I haven’t got a hat.

    good  pet   hat   make  dog   cat   poni  fine  disabl
D1 [+0.00 +0.00 +1.00 +0.00 +0.00 +1.00 +0.00 +0.00 +1.00 ]
D2 [+0.00 +1.00 +0.00 +0.00 +0.00 +1.00 +1.00 +1.00 +0.00 ]
D3 [+1.00 +1.00 +0.00 +1.00 +1.00 +1.00 +0.00 +0.00 +0.00 ]
D4 [+0.00 +0.00 +1.00 +0.00 +0.00 +0.00 +0.00 +0.00 +0.00 ]

Apply tf-idf transform:

D1 [+0.00 +0.00 +0.23 +0.00 +0.00 +0.10 +0.00 +0.00 +0.46 ]
D2 [+0.00 +0.17 +0.00 +0.00 +0.00 +0.07 +0.35 +0.35 +0.00 ]
D3 [+0.28 +0.14 +0.00 +0.28 +0.28 +0.06 +0.00 +0.00 +0.00 ]
D4 [+0.00 +0.00 +0.69 +0.00 +0.00 +0.00 +0.00 +0.00 +0.00 ]

Perform SVD – Reduce Sigmas dimensions(removing the 2 smallest coefficients)

D1 [+0.01 +0.01 +0.34 +0.01 +0.01 +0.03 +0.02 +0.02 +0.11 ]
D2 [-0.00 +0.17 -0.01 -0.00 -0.00 +0.08 +0.35 +0.35 +0.02 ]
D3 [+0.28 +0.14 -0.01 +0.28 +0.28 +0.06 -0.00 -0.00 +0.02 ]
D4 [-0.01 -0.01 +0.63 -0.01 -0.01 +0.04 -0.01 -0.01 +0.19 ]

Note the Word ‘disabl’ despite not being in D4 now has a weighting in that document.

Dependencies

http://www.scipy.org/

Problems

LSA assumes the Normal distribution where the Poisson distribution has actually been observed.

Source Code

lsa.py

Alternatively available at github:

git clone git://github.com/josephwilk/semanticpy.git
  1. from scipy import linalg,array,dot,mat
  2. from math import *
  3. from pprint import pprint
  4.  
  5.  
  6. class LSA:
  7. """ Latent Semantic Analysis(LSA).
  8. Apply transforms to a document-term matrix to bring out latent relationships.
  9. These are found by analysing relationships between the documents and the terms they
  10. contain.
  11. """
  12.  
  13.  
  14. def __init__(self, matrix):
  15. self.matrix = array(matrix)
  16.  
  17.  
  18. def __repr__(self,):
  19. """ Make the matrix look pretty """
  20. stringRepresentation=""
  21.  
  22. rows,cols = self.matrix.shape
  23.  
  24. for row in xrange(0,rows):
  25. stringRepresentation += "["
  26.  
  27. for col in xrange(0,cols):
  28. stringRepresentation+= "%+0.2f "%self.matrix[row][col]
  29. stringRepresentation += "]\n"
  30.  
  31. return stringRepresentation
  32.  
  33.  
  34. def __getTermDocumentOccurences(self,col):
  35. """ Find how many documents a term occurs in"""
  36.  
  37. termDocumentOccurences=0
  38.  
  39. rows,cols = self.matrix.shape
  40.  
  41. for n in xrange(0,rows):
  42. if self.matrix[n][col]>0: #Term appears in document
  43. termDocumentOccurences+=1
  44. return termDocumentOccurences
  45.  
  46.  
  47. def tfidfTransform(self,):
  48. """ Apply TermFrequency(tf)*inverseDocumentFrequency(idf) for each matrix element.
  49. This evaluates how important a word is to a document in a corpus
  50.  
  51. With a document-term matrix: matrix[x][y]
  52. tf[x][y] = frequency of term y in document x / frequency of all terms in document x
  53. idf[x][y] = log( abs(total number of documents in corpus) / abs(number of documents with term y)  )
  54. Note: This is not the only way to calculate tf*idf
  55. """
  56.  
  57. documentTotal = len(self.matrix)
  58. rows,cols = self.matrix.shape
  59.  
  60. for row in xrange(0, rows): #For each document
  61.  
  62. wordTotal= reduce(lambda x, y: x+y, self.matrix[row] )
  63.  
  64. for col in xrange(0,cols): #For each term
  65.  
  66. #For consistency ensure all self.matrix values are floats
  67. self.matrix[row][col] = float(self.matrix[row][col])
  68.  
  69. if self.matrix[row][col]!=0:
  70.  
  71. termDocumentOccurences = self.__getTermDocumentOccurences(col)
  72.  
  73. termFrequency = self.matrix[row][col] / float(wordTotal)
  74. inverseDocumentFrequency = log(abs(documentTotal / float(termDocumentOccurences)))
  75. self.matrix[row][col]=termFrequency*inverseDocumentFrequency
  76.  
  77.  
  78. def lsaTransform(self,dimensions=1):
  79. """ Calculate SVD of objects matrix: U . SIGMA . VT = MATRIX
  80. Reduce the dimension of sigma by specified factor producing sigma'.
  81. Then dot product the matrices:  U . SIGMA' . VT = MATRIX'
  82. """
  83. rows,cols= self.matrix.shape
  84.  
  85. if dimensions <= rows: #Its a valid reduction
  86.  
  87. #Sigma comes out as a list rather than a matrix
  88. u,sigma,vt = linalg.svd(self.matrix)
  89.  
  90. #Dimension reduction, build SIGMA'
  91. for index in xrange(rows-dimensions, rows):
  92. sigma[index]=0
  93.  
  94. #print linalg.diagsvd(sigma,len(self.matrix), len(vt))
  95.  
  96. #Reconstruct MATRIX'
  97. reconstructedMatrix= dot(dot(u,linalg.diagsvd(sigma,len(self.matrix),len(vt))),vt)
  98.  
  99. #Save transform
  100. self.matrix=reconstructedMatrix
  101.  
  102. else:
  103. print "dimension reduction cannot be greater than %s" % rows
  104.  
  105.  
  106. if __name__ == '__main__':
  107.  
  108. #Example document-term matrix
  109. # Vector dimensions: good, pet, hat, make, dog, cat, poni, fine, disabl
  110. matrix=[[0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0],
  111. [0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0],
  112. [1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0],
  113. [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]
  114.  
  115. #Create
  116. lsa = LSA(matrix)
  117. print lsa
  118.  
  119. #Prepare
  120. lsa.tfidfTransform()
  121. print lsa
  122.  
  123. #Perform
  124. lsa.lsaTransform()
  125. print lsa
  126.  
  • One major problem with LSA is that is doesn't scale well to large sizes, because of the LDU matrix decomposition.... it's O(docs^2 + unique_terms^2), even when calculated using sparse methods. It can also be very hard interpret what the "factors" really mean.
  • Andreas: we're applying LSA (actually, PLSI) to textual data for our recommendationa algorithm, which we launched a demo of a few days ago.

    As Joseph points out, the real trick is dimensionality reduction. How you tune your dimensionality reduction algorithm and/or heuristic will have a significant impact on precision and recall (and performance).

    Overall, however, LSA works very well for computing recommendations from textual data. Currently we're applying PLSI to only the English Wikipedia corpus; however, it would probably work well on the IMDB dataset, too.

    -ESer.org
  • Hello Andreas.

    Ultimately with LSA we are using a vector space model. No matter what way we derive the weights (pLSA, LSA, LDA) we map to that model.

    So within the model we can use cosine to measure the relatedness of documents in the vector space.

    With your example I can see you could find subtitles that where related. As for recommendations it really depends on what your movie data set is like. Whether it would make sense mapping a movie data item to a document in the vector space (where each vector is a subtitle document) and using cosine to find recommendations for subtitles.
  • Andreas
    Hello to all,
    great post


    I am a student and my thesis is on recommendation systems
    i am working on a movie data set ( using svd )

    i was wondering if we could use LSA on a subtitle data set
    ( if there is one ) in order to get recommendations

    its just an idea, i see you dont have recommendations in
    LSA’s applications so i just had to ask

    thank you
  • Joseph Wilk
    I agree the lack of statistical foundations of LSA does leave some room for improvement.

    My next release will be Latent Dirichlet allocation and this will be implemented in ruby. I'll post here as soon as I've got the code up (and I'll release the code before I write up the blog if you want to see it early).
  • Shankar
    can you please put up source code of PLSA also....it sounds more challenging. LSA has a lot of scope for improvement, PLSA being one method. So please work and put up source code if you can.
  • Minor typo. Line 67 should read:
    self.matrix[row][col] = float(self.matrix[row][col])

    >Thanks, corrected typo.
    >Joseph Wilk
  • Joseph Wilk
    LSA's applications are varied. While it is well known in search with projects like Lucene and Ferret,
    in the wider scope its a feature extraction technique.

    So it has been used in applications outside of just searching documents such as
    <ul>
    <li> Categorization of documents</li>
    <li> Improving OCR of Chinese</li>
    <li> Identifying similarities in source code</li>
    <li> Detecting certain Proteins</li>
    </ul>


    pLSA (Probabilistic latent semantic analysis) and LDA (Latent Dirichlet allocation) are two other techniques that come from LSA. I'm in the middle of writing LDA in ruby at the moment.
  • For example, you can build a thesauraus, find synonyms, or ... cluster software by topics! Detecting high-level, ie beyond mere structure, aspects.

    http://smallwiki.unibe.ch/adriankuhn/hapax/
  • This is very interesting. I'm wondering what applications this has. It is just a sophisticated way to search documents, or are there other things you can do with it as well?
blog comments powered by Disqus