Solvent – Firefox Extension for Screen Scraping and XQuery Generator
By admin on Nov 19, 2007 in firefox plugin, open source
One thing I like about Firefox is the abundance of extensions out there which can do almost anything that you can think of.
Solvent is a Firefox extension that helps you write screen scrapers for Piggy Bank.
Piggy Bank is a Firefox extension that turns your browser into a mashup platform, by allowing you to extract data from different web sites and mix them together. Piggy Bank also allows you to store this extracted information locally for you to search later and to exchange at need the collected information with others.
Solvent in this case can help to generate semantic data in RDF format from the web pages to be used in Piggy Bank.
Imagine the following XQuery,
/html/body/table/tbody/tr/td[1]/table/tbody/tr/td/table/
tbody/tr[2]/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td/
table[@class=”body2″]/tbody/tr[1]/td[@class=”title_blk”]/a
Without using any tool like Solvent, it is quite impossible to come out with the query manually. In short, Solvent greatly simplifies web page scraping using XQuery. Using it togther with WebHarvest, I am able to write my own custom scraper in Java easily.
Ruben Zevallos Jr. | Mar 22, 2008 | Reply
I’ve tryed a lot of screen scraping tools, but none are good enoght so that I ca use at my web site, that reads news from lot’s of web sites… so.. I still have to program for each source…
If you know something like a web page compare, so that it delete everything that equals and let us to work with rest…
I’m trying to do it by my self, but… it is not working well to…
Thanks for your info…
Fuller | Dec 1, 2009 | Reply
I took the same way to author the web data extractor, named as MetaSeeker. Now the tool is adopted by a lot of organizations and personals. Unfortunately I encountered a technology problem. I am afraid the Fireforx platform cannot open too many brower windows at the same time and cannot download too many Web pages today. I encountered white screen occasionally. I don’t know why by now. My tool is free: http://www.gooseeker.com/en/node/download/front