Latent Semantic Analysis in Python
December 19th, 2007
Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others it looks at all the documents as a whole and the terms within them to identify relationships.
An example of LSA:
Using a search engine search for “sand“.
Documents are returned which do not contain the search term “sand” but contains terms like “beach”.
LSA has identified a latent relationship, “sand” is semantically close to “beach”.
There are some very good papers which describing LSA in detail:
- An introduction to LSA: http://lsa.colorado.edu/papers/dp1.LSAintro.pdf
- Creating your own LSA space: http://www.andrew.cmu.edu/user/jquesada/pdf/bookSpacesRev1.pdf
- Latent Semantic analysis: http://en.wikipedia.org/wiki/Latent_semantic_indexing
This is an implementation of LSA in Python (2.4+). Thanks to scipy its rather simple!
Building a Vector Space Search Engine in Python
November 27th, 2007
A vector space search involves converting documents into vectors. Each dimension within the vectors represents a term. If a document contains that term then the value within the vector is greater than zero.
Here is an implementation of Vector space searching using python (2.4+). Read the rest of this entry »
Automatic Tag Generation
October 22nd, 2007
This project looked at dynamically generating suggestion tags for content. To simplify the task some constraints where introduced.
- The content which will be tagged is news articles with HTML markup.
- Only English content.
I used the following HTML page to experiment on with suggestion tags: http://news.bbc.co.uk/1/hi/entertainment/6624223.stm
To help evaluate the tagging methods I asked a sample of people to suggest what they thought the best tags would be. They came up with:
paris, hilton, paris hilton, jail, jail sentence, drink-driving



