RSS Feed for This PostCurrent Article

Java: Generate RDF from Web Pages

WebCAT is an extensible tool to extract meta-data and generate RDF descriptions from existing Web documents. Implemented in Java, it provides a set of APIs (Application Programming Interfaces) that allow one to analyse text documents from the Web without having to write complicated parsers.

Among other things, WebCAT provides:
  – Language and encoding detection.
  – Hyperlink extraction.
  – Text tokenization (words, n-grams, sentences).
  – Document fingerprinting.
  – Format conversion.
  – Metadata extraction and normalization.
  – Named Entity Extraction.
  – Document classification.

The considered meta-data elements are particularly suited to the domain of automated search, making this a good tool to use in other information retrieval and extraction projects.


Trackback URL


RSS Feed for This Post1 Comment(s)

  1. Elliot | Sep 15, 2009 | Reply

    AlchemyAPI is another cloud-based service that is capable of generating RDF from text and/or web pages.

    http://www.alchemyapi.com/api/entity/ldata.html

    Also provides Linked Data support: linkages to DBpedia, OpenCyc and other online datasets.

Sorry, comments for this entry are closed at this time.