Generate RDF Meta Data from Web Page
By admin on Nov 9, 2007 in Java, Programming
This is the code I used to extract meta data from web pages (keywords, descriptions, sentences, hyperlinks, images, etc) into RDF files.
I used the WebCAT libaries. However, I fixed some bugs in the code as it sometimes cannot detect the language correctly or throw exceptions.
E.g. for this blog, the RDF output will be
<?xml version='1.0' ?> <r:RDF xmlns:r='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:d='http://purl.org/dc/elements/1.1/' xmlns:s='http://www.w3.org/2000/01/rdf-schema#' xmlns:h='http://www.w3.org/1999/xx/http#' xmlns:t='http://purl.org/dc/terms/'> <r:Description r:about=""> <d:Title>twit88.com</d:Title> <t:abstract> <r:Alt> <r:li r:ID='DocumentAbstract'></r:li> <r:li r:ID='DocumentKeywords'>AI,Java,LingPipe, Java, Language Analysis,open source,semantic web,DMOZ, ODP, open source,Java, Machine Learning, AI, Open Source,free software, windows,Windows, freeware,Programming, Mobile programming,Java, open source, Maven proxy,network,open source, network monitoring,C/C++,C++, ACE, Object Oriented,.NET,Java, open source, programming, JSON,java, programming,Java, open source,Windows, telco,JBoss, open source, SLEE, SIP,Java, open source, semantic web, programming,personal, development,Java, open source, Apache JCI,Java, open source, knowledge system, Java, open source,programming, Twitter</r:li> </r:Alt> </t:abstract> <d:Language>chinese-big5</d:Language> <d:Type>text/html</d:Type> <d:Relation> <r:Alt> </r:Alt> </d:Relation> </r:Description> </r:RDF>
Of course I could generate more metadata as per my requirements. For the time being, what I need are only the keywords, language, and sentences.
P/S:
cpDetector is good for detecting character encoding for HTML files, based on what I read from http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html
Whatever-ishere | Nov 21, 2007 | Reply
thanks for the GREAT post! Very useful…