This project looked at dynamically generating suggestion tags for content. To simplify the task some constraints where introduced.
The content which will be tagged is news articles with HTML markup.
Only English content.
I used the following HTML page to experiment on with suggestion tags: http://news.bbc.co.uk/1/hi/entertainment/6624223.stm
To help evaluate the tagging methods I asked a sample of people to suggest what they thought the best tags would be. They came up with:
paris, hilton, paris hilton, jail, jail sentence, drink-driving
Whats a tag?
Definition: “A tag is a (relevant) keyword or term associated with or assigned to a piece of information (e.g. a picture, a geographic map, a blog entry, or video clip), thus describing the item and enabling keyword-based classification and search of information.”
Why tag?
Tag relevance draws on the wisdom of crowds
Tags create social awareness
Show Relevance and Popularity
Community contribution
Add searchable text for resources that cannot be accurately searched (videos, music, images, etc)
Examples of sites using tagging
Tagging vocabulary
The types of tagging vocabulary:
Factual – Activism, Ajax, algorithms, animation, anime, art, audio, books, code, comics
Subjective – Great, good, great book, ok, good game, good read, great series, pretty good, wonderful.
Personal – Seen at cinema, rented dvd, dave owns it
Proposed Solutions
Determining Semantic Information From the Grammatical Structure.
I decided to use a Part of speech (Pos) Tagger to markup the grammar within the content. I discovered Pos Tagging through Satoshi Sekine Google tech talk presentation on On-Demand Information extraction He suggests a new paradigm of Information Extraction which operates ‘on demand’ in response to a user’s query.
The Pos tagger selected was written in perl and while we are only looking at English it does support multiple languages.
TreeTagger: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
The Pos markup is based on:
Penn Treebank Tags: http://www.cis.upenn.edu/~treebank/home.html
Semantics from Grammar Experiments
Experimentation rules:
Extract all Proper Nouns
Where Proper Nouns are adjacent to each other combine to produce one Proper Noun
Order by frequency
Results – (word, frequency):
[ ('Hilton', 11),
('Paris-Hilton', 5),
('January', 3),
('June', 2),
('February', 2),
('Los-Angeles', 2),
('Howard-Weitzman', 1),
('Sunset-Boulevard', 1),
('paparazzi', 1),
('Paris-Hilton-Hilton', 1),
('Department', 1),
('Kathy', 1),
('Friday', 1),
('Los-Angeles-County-Sheriff', 1),
('Superior-Court', 1),
('Hilton-Hotel', 1),
('BBC', 1),
('David-Willis', 1),
('Metropolitan-Courthouse', 1),
('September', 1),
('California-Highway-Patrol', 1),
('Judge-Michael-Sauer', 1)]
We can see that by assuming a relationship between adjancent proper nouns we have produced some good tags (Paris-Hilton, Los-Angeles-County-Sheriff) be it that they are long. Further feasibility of joining nouns needs to be performed to see if the cases where incorrect joins are made is worth the higher quaility tags. We also note that while the highest frequency results are good ‘January’ and ‘Feburary’ are weaker tags than some of the lower frequency tag.
Name-entity Recognition
Taggers which can use the concept of Hidden Markov Models http://en.wikipedia.org/wiki/Hidden_Markov_model (HMM).
Reform the problem to say that we have a document with names correctly marked up. This document is passed through a noisy channel which removes tags. The problem is to work out what the initial document was from the noisy data.
The HMM introduces a reduction in complexity as we only ever need to consider the previous state when evaluating whether a word is a name.
The Viterbi http://en.wikipedia.org/wiki/Viterbi_algorithm algorithm computes the most likely sequence of hidden states. Very good examples and python code in the wikipedia link.
The emission and transition probabilities need to be set from learning from a relevant corpus.
Application used (Java)
Named Entity Recognition (NER) and Information Extraction (IE) http://nlp.stanford.edu/ner/index.shtml
Useful Links on Name Entity taggers:
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
Impact of automatic comma prediction on pos/name tagging of speech http://www.speech.sri.com/people/wwang/papers/slt06-comma-v11-newref.pdf
Nymble model http://portal.acm.org/citation.cfm?doid=974557.974586
Experimentation rules
Parse using Name Entity Tagger – identifying Person, Organisation and Location
reduce any adjoining Person/Organization/Location
Order by frequency
[ ((‘Ms-Hilton’, ‘PERSON’), 10), ((‘Paris’, ‘LOCATION’), 3), ((‘Los-Angeles’, ‘LOCATION’), 3), ((‘Hilton’, ‘PERSON’), 3), ((‘Paris-Hilton’, ‘LOCATION’), 2), ((‘Hilton-Hilton’, ‘PERSON’), 1), ((‘Department’, ‘ORGANIZATION’), 1), ((‘Sunset-Boulevard’, ‘ORGANIZATION’), 1), ((‘Judge-Michael-Sauer’, ‘PERSON’), 1), ((‘California-Highway-Patrol’, ‘ORGANIZATION’), 1), ((‘Kathy’, ‘PERSON’), 1), ((‘BBC’, ‘ORGANIZATION’), 1), ((‘Howard-Weitzman’, ‘PERSON’), 1), ((‘Superior-Court’, ‘ORGANIZATION’), 1), ((‘Metropolitan-Courthouse’, ‘ORGANIZATION’), 1), ((‘Paris-Hilton’, ‘PERSON’), 1), ((‘David-Willis’, ‘PERSON’), 1)]
Execution time: 5 seconds Limiting the select to PERSON/ORGANIZATION/LOCATION helped removed some of the weaker tags suggested in the grammar experiment (like Feburary).
Name-entity Recognition – Web
A similar solution to the Name entity Tagger but rather than using the slower Java tool I used a free web based service. I found the web service produced similar results but considerable faster (around 1 second). service: http://tagthe.net/api/?url=http://news.bbc.co.uk/1/hi/entertainment/6624223.stm
[viewcode] src=../projects/xml/paris-tag-results.xml geshi=xml[/viewcode]