Building a Vector Space Search Engine in Python

A vector space search involves converting documents into vectors. Each dimension within the vectors represents a term. If a document contains that term then the value within the vector is greater than zero.

Here is an implementation of Vector space searching using python (2.4+).

1 Stemming & Stop words

Fetch all terms within documents and clean – use a stemmer to reduce. A stemmer takes words and tries to reduce them to there base or root. Words which have a common stem often have similar meanings. Example: CONNECTED CONNECTING CONNECTION CONNECTIONS

all map to CONNECT

We also remove any stopwords from the documents. [a,am,an,also,any,and] are all examples of stopwords in English. Stop words have little value in search so we strip them. The stoplist used was taken from: ftp://ftp.cs.cornell.edu/pub/smart/english.stop

 self.stemmer = PorterStemmer()

     def removeStopWords(self,list):
             """ Remove common words which have no search value """
             return [word for word in list if word not in self.stopwords ]


     def tokenise(self, string):
             """ break string up into tokens and stem words """
             string = self.clean(string)
             words = string.split(" ")

             return [self.stemmer.stem(word,0,len(word)-1) for word in words]

2 Map Keywords to Vector Dimensions

Map the vector dimensions to keywords using a dictionary: keyword=>position

def getVectorKeywordIndex(self, documentList):
        """ create the keyword associated to the position of the elements within the document vectors """

        #Mapped documents into a single word string
        vocabularyString = " ".join(documentList)

        vocabularyList = self.parser.tokenise(vocabularyString)
        #Remove common words which have no search value
        vocabularyList = self.parser.removeStopWords(vocabularyList)
        uniqueVocabularyList = util.removeDuplicates(vocabularyList)

        vectorIndex={}
        offset=0
        #Associate a position with the keywords which maps to the dimension on the vector used to represent this word
        for word in uniqueVocabularyList:
                vectorIndex[word]=offset
                offset+=1
        return vectorIndex  #(keyword:position)

3 Map Document Strings to Vectors.

We use the simple Term Count Model. A more accurate model would be to use tf-idf (termFrequency-inverseDocumentFrequnecy).

def makeVector(self, wordString):
        """ @pre: unique(vectorIndex) """

        #Initialise vector with 0's
        vector = [0] * len(self.vectorKeywordIndex)
        wordList = self.parser.tokenise(wordString)
        wordList = self.parser.removeStopWords(wordList)
        for word in wordList:
                vector[self.vectorKeywordIndex[word]] += 1; #Use simple Term Count Model
        return vector

4 Find Related Documents

We now have the ability to find related documents. We can test if two documents are in the concept space by looking at the the cosine of the angle between the document vectors. We use the cosine of the angle as a metric for comparison. If the cosine is 1 then the angle is 0° and hence the vectors are parallel (and the document terms are related). If the cosine is 0 then the angle is 90° and the vectors are perpendicular (and the document terms are not related).

We calculate the cosine between the document vectors in python using scipy.

def cosine(vector1, vector2):
        """ related documents j and q are in the concept space by comparing the vectors :
                cosine  = ( V1 * V2 ) / ||V1|| x ||V2|| """
        return float(dot(vector1,vector2) / (norm(vector1) * norm(vector2)))

5 Search the Vector Space!

In order to perform a search across keywords we need to map the keywords to the vector space. We create a temporary document which represents the search terms and then we compare it against the document vectors using the same cosine measurement mentioned for relatedness.

def search(self,searchList):
        """ search for documents that match based on a list of terms """
        queryVector = self.buildQueryVector(searchList)

        ratings = [util.cosine(queryVector, documentVector) for documentVector in self.documentVectors]
        ratings.sort(reverse=True)
        return ratings

Further Extensions

Use tf-idf rather than the Term count model for term weightings
Instead of linear processing of all document vectors when searching for related content use: Lanczos methods OR a neural network-like approach.
Moving towards Latent Semantic analysis, Probabilistic latent semantic analysis or Latent Dirichlet allocation.

Third Party tools

The stemmer used comes from: http://tartarus.org/~martin/PorterStemmer/python.txt

And the library for performing cosine calculations comes from NumPy: http://www.scipy.org/

Joseph Wilk

Things with code, creativity and computation.