RSS Feed for This PostCurrent Article

Web Scraping using Web-Harvest

Web-Harvest is the tool that I used to extra information from structured web pages. I need to correlate information from several web sites and I need a flexible and configurable tool that I can easily integrate into my Java application.

As quoted from the website, Web-Harvest is an open source Web data extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

The good thing is that Web-Harvest now comes with graphical user environment which eases development and testing of XML configurations.

One simple example configuration provided is as below:

<xpath expression="//a[@shape='rect']/@href">
    <html-to-xml>
        <http url="http://www.somesite.com/"/>
    </html-to-xml>
</xpath>

When Web-Harvest executes this part of configuration, the following steps occur:

  1. http processor downloads content from the specified URL.
  2. html-to-xml processor cleans up that HTML producing XHTML content.
  3. xpath processor searches specific links in XHTML from previous step giving URL sequence as a result.

There are other good examples provided to extract information from New York Times, Yahoo Mail, Flick, etc.

Popularity: 9% [?]


Trackback URL


RSS Feed for This Post3 Comment(s)

  1. Thinzar | Jul 18, 2011 | Reply

    I m problem with offline page for congfig xml file to extract data dynamically automatic change web page from input box. Please kindly answer to me.

  2. augenlasern | Feb 7, 2012 | Reply

    Ich bin oft bis läuft ein Blog und i really bewundern Ihre Inhalte. Der Artikel wurde Wirklichkeit Gipfel mein Interesse. Ich werde , um Ihre Lesezeichen web site und hold Prüfung für neue Informationen.

  3. onsugi | Feb 7, 2012 | Reply

    Diese Website ist wirklich ein Spaziergang – durch von für alle Informationen Sie gesucht über dies und wusste nicht, wen Sie fragen. Glimpse right here, und Sie ‘ll unzweifelhaft aufzudecken es.

4 Trackback(s)

  1. From Solvent - Firefox Extension for Screen Scraping and XQuery Generator | twit88.com | Nov 19, 2007
  2. From Java - Writing a Web Page Scraper or Web Data Extraction Tool | twit88.com | Jan 6, 2008
  3. From Website Scraping for Dummies | The BookmarkMoney Blog | Apr 19, 2008
  4. From nils-kaiser.de » Time to crawl back! Download Google Groups using a crawler | Sep 8, 2008

RSS Feed for This PostPost a Comment