Joseph Wilk

Joseph Wilk

Programming bits and bobs

Latent Semantic Analysis in Python

Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others it looks at all the documents as a whole and the terms within them to identify relationships.

An example of LSA: Using a search engine search for “sand”.

Documents are returned which do not contain the search term “sand” but contains terms like “beach”.

LSA has identified a latent relationship, “sand” is semantically close to “beach”.

There are some very good papers which describing LSA in detail:

This is an implementation of LSA in Python (2.4+). Thanks to scipy its rather simple!

1 Create the term-document matrix

We use the previous work in Vector Space Search to build this matrix.

2 tf-idf Transform

Apply the tf-idf transform to the term-document matrix. This generally tends to help improve results with LSA.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def tfidfTransform(self,):
        """ Apply TermFrequency(tf)*inverseDocumentFrequency(idf) for each matrix element.
            This evaluates how important a word is to a document in a corpus

            With a document-term matrix: matrix[x][y]
                tf[x][y] = frequency of term y in document x / frequency of all terms in document x
                idf[x][y] = log( abs(total number of documents in corpus) / abs(number of documents with term y)  )
            Note: This is not the only way to calculate tf*idf
        """

        documentTotal = len(self.matrix)
        rows,cols = self.matrix.shape

        for row in xrange(0, rows): #For each document

                wordTotal= reduce(lambda x, y: x+y, self.matrix[row] )

                for col in xrange(0,cols): #For each term

                        #For consistency ensure all self.matrix values are floats
                        self.matrix[row][col] = float(self.matrix[row][col])

                        if self.matrix[row][col]!=0:

                                termDocumentOccurences = self.__getTermDocumentOccurences(col)

                                termFrequency = self.matrix[row][col] / float(wordTotal)
                                inverseDocumentFrequency = log(abs(documentTotal / float(termDocumentOccurences)))
                                self.matrix[row][col]=termFrequency*inverseDocumentFrequency

3 Singular Value Decomposition

SVD: http://en.wikipedia.org/wiki/Singular_value_decomposition

Determine U, Sigma, VT from our MATRIX from previous steps.

 U . SIGMA . VT = MATRIX

We use the scipy svd implementation here. Note that numpy (version: 1.0.4 ) also has an implementation of svd but I had lots of problems with it. I found it did not work with anything larger than a 2 dimensional matrix.

1
2
3
#Sigma comes out as a list rather than a matrix
 u,sigma,vt = linalg.svd(self.matrix)

4 Reduce the dimensions of Sigma

We generally delete the smallest coefficients in the diagonal matrix Sigma to produce Sigma’. The reduction of the dimensions of Sigma combines some dimensions such that they are on more than one term. The number of coefficients deleted can depend of the corpus used. It should be large enough to fit the real structure in the data, but small enough such that noise or unimportant details are not modelled.

The real difficulty and weakness of LSA is knowing how many dimensions to remove. There is no exact method of finding the right dimensions. Generally L2-norm or Frobenius norm are used.

5 Calculate the Product with New Sigma’

Finally we calculate:

U . SIGMA' . VT = MATRIX'
1
2
#Reconstruct MATRIX'
reconstructedMatrix= dot(dot(u,linalg.diagsvd(sigma,len(self.matrix),len(vt))),vt)

Giving use our final LSA matrix. We can now apply the same functionality used in Vector space search: searching and finding related scores for documents.

LSA In Action – Matrices

We start with out Model-Term frequency matrix with is generated from creating a Vector Space Search with four documents (D1-D4): D1:“The cat in the hat disabled” D2:“A cat is a fine pet ponies.” D3:“Dogs and cats make good pets” D4:“I haven’t got a hat.

    <strong>good  pet   hat   make  dog   cat   poni  fine  disabl</strong>
D1 [+0.00 +0.00 +1.00 +0.00 +0.00 +1.00 +0.00 +0.00 +1.00 ]
D2 [+0.00 +1.00 +0.00 +0.00 +0.00 +1.00 +1.00 +1.00 +0.00 ]
D3 [+1.00 +1.00 +0.00 +1.00 +1.00 +1.00 +0.00 +0.00 +0.00 ]
D4 [+0.00 +0.00 +1.00 +0.00 +0.00 +0.00 +0.00 +0.00 +0.00 ]

Apply tf-idf transform:

D1 [+0.00 +0.00 +0.23 +0.00 +0.00 +0.10 +0.00 +0.00 +0.46 ]
D2 [+0.00 +0.17 +0.00 +0.00 +0.00 +0.07 +0.35 +0.35 +0.00 ]
D3 [+0.28 +0.14 +0.00 +0.28 +0.28 +0.06 +0.00 +0.00 +0.00 ]
D4 [+0.00 +0.00 +0.69 +0.00 +0.00 +0.00 +0.00 +0.00 <span style="color: #ff0000;"><strong>+0.00</strong></span> ]

Perform SVD – Reduce Sigmas dimensions(removing the 2 smallest coefficients)

D1 [+0.01 +0.01 +0.34 +0.01 +0.01 +0.03 +0.02 +0.02 +0.11 ]
D2 [-0.00 +0.17 -0.01 -0.00 -0.00 +0.08 +0.35 +0.35 +0.02 ]
D3 [+0.28 +0.14 -0.01 +0.28 +0.28 +0.06 -0.00 -0.00 +0.02 ]
D4 [-0.01 -0.01 +0.63 -0.01 -0.01 +0.04 -0.01 -0.01 <strong><span style="color: #ff0000;">+0.19</span></strong> ]

Note the Word ‘disabl’ despite not being in D4 now has a weighting in that document.

Dependencies

http://www.scipy.org/

Problems

LSA assumes the Normal distribution where the Poisson distribution has actually been observed.

Source Code

Available at github:

git clone git://github.com/josephwilk/semanticpy.git

Comments