Tag Archives: Datamining

Latent Semantic Analysis in Python

19 Dec

Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others it looks at all the documents as a whole and the terms within them to identify relationships.

An example of LSA:
Using a search engine search for “sand“.

Documents are returned which do not contain the search term “sand” but contains terms like “beach”.

LSA has identified a latent relationship, “sand” is semantically close to “beach”.

There are some very good papers which describing LSA in detail:

This is an implementation of LSA in Python (2.4+). Thanks to scipy its rather simple!

(more…)

Building a Vector Space Search Engine in Python

27 Nov

A vector space search involves converting documents into vectors. Each dimension within the vectors represents a term. If a document contains that term then the value within the vector is greater than zero.

Here is an implementation of Vector space searching using python (2.4+). (more…)

Automatic Tag Generation

22 Oct

This project looked at dynamically generating suggestion tags for content. To simplify the task some constraints where introduced.

  • The content which will be tagged is news articles with HTML markup.
  • Only English content.

I used the following HTML page to experiment on with suggestion tags: http://news.bbc.co.uk/1/hi/entertainment/6624223.stm

To help evaluate the tagging methods I asked a sample of people to suggest what they thought the best tags would be. They came up with:

paris, hilton, paris hilton, jail, jail sentence, drink-driving

(more…)