Build Domain Knowledge by Extracting Keywords from DMOZ
By admin on Nov 15, 2007 in Java, Programming, semantic web
This is a experiment I am currently doing, extracting keywords from categories in DMOZ to see how accurate it is to be used for web page categorization.
From my previous post, I load the DMOZ categories into a database. The Perl script also generates a pipe delimited file for me, as show below for the Sports category.
|4|http://www.sleepmonsters.com|…
|4|http://www.oarevents.com/|…
|4|http://authentique.aventure.free.fr/….
|4|http://isportsdigest.tripod.com/a…
Using this information, I wrote a simple extractor using the code to extract RDF from web page. The result is a list of keywords sorted by frequences.
badminton:99
adventure racing:69
airsoft:51
Baseball:44
adventure:39
sport:38
mountain biking:35
club:34
sports:34
Archery:33
adventure race:32
team:29
racing:27
orienteering:26
league:25
softair:24
KEA, an algorithm for extracting keyphrases from text documents, will be another interesting thing to look at.
Tom | May 2, 2008 | Reply
I’m currently gearing up for a categorization project driven off dmoz data - would very much enjoy the chance to speak with you. Drop me a line in email if you have a few minutes.
As a side note, you have great content - awesome site