Automatic Tag Generation

This project looked at dynamically generating suggestion tags for content. To simplify the task some constraints where introduced.

The content which will be tagged is news articles with HTML markup.
Only English content.

I used the following HTML page to experiment on with suggestion tags: http://news.bbc.co.uk/1/hi/entertainment/6624223.stm

To help evaluate the tagging methods I asked a sample of people to suggest what they thought the best tags would be. They came up with:

paris, hilton, paris hilton, jail, jail sentence, drink-driving

Whats a tag?

Definition: “A tag is a (relevant) keyword or term associated with or assigned to a piece of information (e.g. a picture, a geographic map, a blog entry, or video clip), thus describing the item and enabling keyword-based classification and search of information.”

Why tag?

Tag relevance draws on the wisdom of crowds
Tags create social awareness
Show Relevance and Popularity
Community contribution
Add searchable text for resources that cannot be accurately searched (videos, music, images, etc)

Examples of sites using tagging

http://www.last.fm/

http://www.flickr.com/

http://www.youtube.com

http://del.icio.us/

Tagging vocabulary

The types of tagging vocabulary:

Factual – Activism, Ajax, algorithms, animation, anime, art, audio, books, code, comics
Subjective – Great, good, great book, ok, good game, good read, great series, pretty good, wonderful.
Personal – Seen at cinema, rented dvd, dave owns it

Proposed Solutions

Determining Semantic Information From the Grammatical Structure.

I decided to use a Part of speech (Pos) Tagger to markup the grammar within the content. I discovered Pos Tagging through Satoshi Sekine Google tech talk presentation on On-Demand Information extraction He suggests a new paradigm of Information Extraction which operates ‘on demand’ in response to a user’s query.

The Pos tagger selected was written in perl and while we are only looking at English it does support multiple languages.

TreeTagger: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

The Pos markup is based on:

Penn Treebank Tags: http://www.cis.upenn.edu/~treebank/home.html

Semantics from Grammar Experiments

Experimentation rules:

Extract all Proper Nouns
Where Proper Nouns are adjacent to each other combine to produce one Proper Noun
Order by frequency

Results – (word, frequency):

[ ('Hilton', 11),
('Paris-Hilton', 5),
('January', 3),
('June', 2),
('February', 2),
('Los-Angeles', 2),
('Howard-Weitzman', 1),
('Sunset-Boulevard', 1),
('paparazzi', 1),
('Paris-Hilton-Hilton', 1),
('Department', 1),
('Kathy', 1),
('Friday', 1),
('Los-Angeles-County-Sheriff', 1),
('Superior-Court', 1),
('Hilton-Hotel', 1),
('BBC', 1),
('David-Willis', 1),
('Metropolitan-Courthouse', 1),
('September', 1),
('California-Highway-Patrol', 1),
('Judge-Michael-Sauer', 1)]

We can see that by assuming a relationship between adjancent proper nouns we have produced some good tags (Paris-Hilton, Los-Angeles-County-Sheriff) be it that they are long. Further feasibility of joining nouns needs to be performed to see if the cases where incorrect joins are made is worth the higher quaility tags. We also note that while the highest frequency results are good ‘January’ and ‘Feburary’ are weaker tags than some of the lower frequency tag.

Name-entity Recognition

Taggers which can use the concept of Hidden Markov Models http://en.wikipedia.org/wiki/Hidden_Markov_model (HMM).

Reform the problem to say that we have a document with names correctly marked up. This document is passed through a noisy channel which removes tags. The problem is to work out what the initial document was from the noisy data.

The HMM introduces a reduction in complexity as we only ever need to consider the previous state when evaluating whether a word is a name.

The Viterbi http://en.wikipedia.org/wiki/Viterbi_algorithm algorithm computes the most likely sequence of hidden states. Very good examples and python code in the wikipedia link.

The emission and transition probabilities need to be set from learning from a relevant corpus.

Application used (Java)

Named Entity Recognition (NER) and Information Extraction (IE) http://nlp.stanford.edu/ner/index.shtml

Useful Links on Name Entity taggers:

http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/

Impact of automatic comma prediction on pos/name tagging of speech http://www.speech.sri.com/people/wwang/papers/slt06-comma-v11-newref.pdf

Nymble model http://portal.acm.org/citation.cfm?doid=974557.974586

Experimentation rules

Parse using Name Entity Tagger – identifying Person, Organisation and Location
reduce any adjoining Person/Organization/Location
Order by frequency

[ ((‘Ms-Hilton’, ‘PERSON’), 10), ((‘Paris’, ‘LOCATION’), 3), ((‘Los-Angeles’, ‘LOCATION’), 3), ((‘Hilton’, ‘PERSON’), 3), ((‘Paris-Hilton’, ‘LOCATION’), 2), ((‘Hilton-Hilton’, ‘PERSON’), 1), ((‘Department’, ‘ORGANIZATION’), 1), ((‘Sunset-Boulevard’, ‘ORGANIZATION’), 1), ((‘Judge-Michael-Sauer’, ‘PERSON’), 1), ((‘California-Highway-Patrol’, ‘ORGANIZATION’), 1), ((‘Kathy’, ‘PERSON’), 1), ((‘BBC’, ‘ORGANIZATION’), 1), ((‘Howard-Weitzman’, ‘PERSON’), 1), ((‘Superior-Court’, ‘ORGANIZATION’), 1), ((‘Metropolitan-Courthouse’, ‘ORGANIZATION’), 1), ((‘Paris-Hilton’, ‘PERSON’), 1), ((‘David-Willis’, ‘PERSON’), 1)]

Execution time: 5 seconds Limiting the select to PERSON/ORGANIZATION/LOCATION helped removed some of the weaker tags suggested in the grammar experiment (like Feburary).

Name-entity Recognition – Web

A similar solution to the Name entity Tagger but rather than using the slower Java tool I used a free web based service. I found the web service produced similar results but considerable faster (around 1 second). service: http://tagthe.net/api/?url=http://news.bbc.co.uk/1/hi/entertainment/6624223.stm

[viewcode] src=../projects/xml/paris-tag-results.xml geshi=xml[/viewcode]

Joseph Wilk

Things with code, creativity and computation.