RSS Feed for This PostCurrent Article

Build Domain Knowledge by Extracting Keywords from DMOZ

Download Source

This is a experiment I am currently doing, extracting keywords from categories in DMOZ to see how accurate it is to be used for web page categorization.

From my previous post, I load the DMOZ categories into a database. The Perl script also generates a pipe delimited file for me, as show below for the Sports category.

|4|http://www.sleepmonsters.com|…
|4|http://www.oarevents.com/|…
|4|http://authentique.aventure.free.fr/….
|4|http://isportsdigest.tripod.com/a…

Using this information, I wrote a simple extractor using the code to extract RDF from web page. The result is a list of keywords sorted by frequences.

badminton:99
adventure racing:69
airsoft:51
Baseball:44
adventure:39
sport:38
mountain biking:35
club:34
sports:34
Archery:33
adventure race:32
team:29
racing:27
orienteering:26
league:25
softair:24

KEA, an algorithm for extracting keyphrases from text documents, will be another interesting thing to look at.


Trackback URL


RSS Feed for This Post1 Comment(s)

  1. Tom | May 2, 2008 | Reply

    I’m currently gearing up for a categorization project driven off dmoz data - would very much enjoy the chance to speak with you. Drop me a line in email if you have a few minutes.

    As a side note, you have great content - awesome site :)

RSS Feed for This PostPost a Comment