Latent Semantic Analysis in Ruby

November 21st, 2008

I’ve had lots of requests for a Ruby version to follow up my Latent Semantic Analysis in Python article. So I’ve rewritten the code and article for Ruby. I wrote LSA from scratch this time and test driven so it has some subtle differences from the Python version.

What is LSA?

Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others it looks at all the documents as a whole and the terms within them to identify relationships.

An example of LSA:
Using a search engine search for “ruby“.

Documents are returned which do not contain the search term “ruby” but contains terms like “rails“.

LSA has identified a latent relationship, “ruby” is semantically close to “rails“.

How does it work?

Given a set of word documents, each word in those documents represents a point in the semantic space. LSA uses a mathematical technique called Singular value decomposition to take the documents/words represented as a matrix and produce a reduced approximation of this matrix. In doing this it reduces the overall noise in the semantic space bringing words together. Hence after applying LSA some words share similar points in the semantic space, they are semantically similar.

These groups of semantically similar words form concepts and those concepts in turn relate to documents.

Term a < ----------->
Term b < -----------> Concept d < ---------> Document e
Term c < ----------->

Read the rest of this entry »

With Search BOSS (Build your Own Search Service) Yahoo has freed up a lot of the restrictions on their previous search service. Like removing the cap on the number of searches and allowing re-purposing of results. I’ve been doing some work on using the service in Ruby. I wrote a little RubyGem called Rboss which wraps around the BOSS webservice. It makes life nice and easy using Ruby and BOSS.

require 'rubygems'
require 'boss'

api = Boss::Api.new('boss-api-key-got-from-yahoo')

#Find news articles that are not older than 7 days
results = api.search_news('monkeys', :age => '7d')
results.each do |news|
   puts news.title
   puts news.abstract
   puts news.date
   puts news.url
end

Install Gem from GitHub:

  1. Add github to gem sources
  2. gem sources -a http://gems.github.com
  3. Install the gem:
  4. sudo gem install eshopworks-rboss
  5. If you don’t already have a BOSS api key signup for one: http://developer.yahoo.com/wsregap

Checkout the Rboss documentation and example usage at: http://github.com/eshopworks/rboss-gem

Thanks to eShopworks for sponsoring this project.

Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others it looks at all the documents as a whole and the terms within them to identify relationships.

An example of LSA:
Using a search engine search for “sand“.

Documents are returned which do not contain the search term “sand” but contains terms like “beach”.

LSA has identified a latent relationship, “sand” is semantically close to “beach”.

There are some very good papers which describing LSA in detail:

This is an implementation of LSA in Python (2.4+). Thanks to scipy its rather simple!

Read the rest of this entry »

A vector space search involves converting documents into vectors. Each dimension within the vectors represents a term. If a document contains that term then the value within the vector is greater than zero.

Here is an implementation of Vector space searching using python (2.4+). Read the rest of this entry »