Current Article

Build Domain Knowledge by Extracting Keywords from DMOZ

By admin on Nov 15, 2007 in Java, Programming, semantic web

This is a experiment I am currently doing, extracting keywords from categories in DMOZ to see how accurate it is to be used for web page categorization.

From my previous post, I load the DMOZ categories into a database. The Perl script also generates a pipe delimited file for me, as show below for the Sports category.

|4|http://www.sleepmonsters.com|…
|4|http://www.oarevents.com/|…
|4|http://authentique.aventure.free.fr/….
|4|http://isportsdigest.tripod.com/a…

Using this information, I wrote a simple extractor using the code to extract RDF from web page. The result is a list of keywords sorted by frequences.

badminton:99
adventure racing:69
airsoft:51
Baseball:44
adventure:39
sport:38
mountain biking:35
club:34
sports:34
Archery:33
adventure race:32
team:29
racing:27
orienteering:26
league:25
softair:24

KEA, an algorithm for extracting keyphrases from text documents, will be another interesting thing to look at.

Trackback URL

1 Comment(s)

Tom | May 2, 2008 | Reply

I’m currently gearing up for a categorization project driven off dmoz data - would very much enjoy the chance to speak with you. Drop me a line in email if you have a few minutes.

As a side note, you have great content - awesome site

twit88.com

Current Article

Build Domain Knowledge by Extracting Keywords from DMOZ

1 Comment(s)

Post a Comment

Recent Posts

Categories

Links

semantic web

Recent Comments

twit88.com

Current Article

Build Domain Knowledge by Extracting Keywords from DMOZ

Related Posts

1 Comment(s)

Post a Comment

Recent Posts

Categories

Popular Posts

Links

semantic web

Recent Comments