RSS Feed for This PostCurrent Article

Solvent – Firefox Extension for Screen Scraping and XQuery Generator

One thing I like about Firefox is the abundance of extensions out there which can do almost anything that you can think of.

Solvent is a Firefox extension that helps you write screen scrapers for Piggy Bank.

Piggy Bank is a Firefox extension that turns your browser into a mashup platform, by allowing you to extract data from different web sites and mix them together. Piggy Bank also allows you to store this extracted information locally for you to search later and to exchange at need the collected information with others.

Solvent in this case can help to generate semantic data in RDF format from the web pages to be used in Piggy Bank.

Imagine the following XQuery,

/html/body/table/tbody/tr/td[1]/table/tbody/tr/td/table/
tbody/tr[2]/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td/
table[@class=”body2″]/tbody/tr[1]/td[@class=”title_blk”]/a

Without using any tool like Solvent, it is quite impossible to come out with the query manually. In short, Solvent greatly simplifies web page scraping using XQuery. Using it togther with WebHarvest, I am able to write my own custom scraper in Java easily.


Trackback URL


RSS Feed for This Post2 Comment(s)

  1. Ruben Zevallos Jr. | Mar 22, 2008 | Reply

    I’ve tryed a lot of screen scraping tools, but none are good enoght so that I ca use at my web site, that reads news from lot’s of web sites… so.. I still have to program for each source…

    If you know something like a web page compare, so that it delete everything that equals and let us to work with rest…

    I’m trying to do it by my self, but… it is not working well to…

    Thanks for your info…

  2. Fuller | Dec 1, 2009 | Reply

    I took the same way to author the web data extractor, named as MetaSeeker. Now the tool is adopted by a lot of organizations and personals. Unfortunately I encountered a technology problem. I am afraid the Fireforx platform cannot open too many brower windows at the same time and cannot download too many Web pages today. I encountered white screen occasionally. I don’t know why by now. My tool is free: http://www.gooseeker.com/en/node/download/front

2 Trackback(s)

  1. From Java - Writing a Web Page Scraper or Web Data Extraction Tool | twit88.com | Jan 6, 2008
  2. From PHP: Write a Web Page Scraper | twit88.com | Feb 4, 2008

Sorry, comments for this entry are closed at this time.