RSS Feed for This PostCurrent Article

Generate RDF Meta Data from Web Page

Download Source Code

This is the code I used to extract meta data from web pages (keywords, descriptions, sentences, hyperlinks, images, etc) into RDF files.

I used the WebCAT libaries. However, I fixed some bugs in the code as it sometimes cannot detect the language correctly or throw exceptions.

E.g. for this blog, the RDF output will be


<?xml version='1.0' ?>
<r:RDF xmlns:r='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
           xmlns:d='http://purl.org/dc/elements/1.1/'
           xmlns:s='http://www.w3.org/2000/01/rdf-schema#'
           xmlns:h='http://www.w3.org/1999/xx/http#'
           xmlns:t='http://purl.org/dc/terms/'>
<r:Description r:about="">
<d:Title>twit88.com</d:Title>
<t:abstract>
<r:Alt>
<r:li r:ID='DocumentAbstract'></r:li>
<r:li r:ID='DocumentKeywords'>AI,Java,LingPipe, Java, 
Language Analysis,open source,semantic web,DMOZ, ODP, 
open source,Java, 
Machine Learning, AI, Open Source,free software,
windows,Windows, freeware,Programming,
Mobile programming,Java, open source,
Maven proxy,network,open source,
network monitoring,C/C++,C++,
ACE, Object Oriented,.NET,Java,
open source, programming, JSON,java,
programming,Java, open source,Windows,
telco,JBoss, open source, 
SLEE, SIP,Java, open source, 
semantic web, programming,personal,
development,Java, open source, 
Apache JCI,Java, open source, knowledge system,
Java, open source,programming, Twitter</r:li>
</r:Alt>
</t:abstract>
<d:Language>chinese-big5</d:Language>
<d:Type>text/html</d:Type>
<d:Relation>
<r:Alt>
</r:Alt>
</d:Relation>
</r:Description>
</r:RDF>

Of course I could generate more metadata as per my requirements. For the time being, what I need are only the keywords, language, and sentences.

P/S:
cpDetector is good for detecting character encoding for HTML files, based on what I read from http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html


Trackback URL


RSS Feed for This Post1 Comment(s)

  1. Whatever-ishere | Nov 21, 2007 | Reply

    thanks for the GREAT post! Very useful…

1 Trackback(s)

  1. From Build Domain Knowledge by Extracting Keywords from DMOZ | twit88.com | Nov 15, 2007

Sorry, comments for this entry are closed at this time.