Comments on: Java - Writing a Web Page Scraper or Web Data Extraction Tool

By: janapati siva prasad rao

janapati siva prasad rao — Tue, 16 Feb 2010 11:53:16 +0000

Hi,
I am trying to use web harvest tool to extract data from web sites.I have gone through the examples given in the web harvest tool.I have tried to extract the data from the site,by giving the url.It is working fine.Now, i am trying to implement the same approach,to extract the entire site(just like canon example).Can you please give any inputs to write configuration file for this?

One more thing , i am not understanding how to write the configuration file.Please explain how to write the configuration file.

Regards,
Siva

By: Fuller

Fuller — Wed, 02 Dec 2009 01:10:29 +0000

I have authored a tool, i.e. MetaSeeker, to calculate the web data extraction instructions automatically after semantic annotating over the sample page by the operator. Free download: http://www.gooseeker.com/en/node/download/front

By: Heriberto Janosch González

Heriberto Janosch González — Tue, 16 Jun 2009 15:35:46 +0000

Hello,

Can you help me with Web Harvest?

I need to load this page:

http://contrataciondelestado.es/wps/wcm/connect/PLACE_es/Site/area/docAccCmpnt?srv=cmpnt&cmpntname=GetDocumentsById&source=library&DocumentIdParam=1b082c004e7f2b08b5f4ffbe46495314

When you place it in a browser you will see that it is a Xml document.

But when you put it in a

That is because (I believe) the meta refresh loads another page 0 seconds after loading the first one …

How can I solve this problem with Web Harvest?

Thanks in advance for your kind attention!

Please, if you have an answer write to: heribertojanosch@yahoo.com

By: sam

sam — Wed, 10 Jun 2009 12:49:51 +0000

can you tell me if there’s any function in PL/SQL for page scraping??

By: nils-kaiser.de » Time to crawl back! Download Google Groups using a crawler

nils-kaiser.de » Time to crawl back! Download Google Groups using a crawler — Sat, 14 Feb 2009 00:48:49 +0000

[…] Hope this helps! Feel free to change the script and to notify me of any useful addition. To start changing the script, I recommend to have a look at the user manual and the examples. Also have a look at some other uses here and here. […]

By: Expertaya » HtmlUnit as Java Screen Scraping Library

Expertaya » HtmlUnit as Java Screen Scraping Library — Fri, 23 Jan 2009 10:38:18 +0000

[…] neither of them is as good as this library. For example, writing a screen scraper with Web Harvest is an easy task, but badly formatted pages cause xml parser to break and this happened to me a lot of times. […]

By: Radu

Radu — Sat, 17 May 2008 17:15:07 +0000

Hi, if there were more than 1 article, how would you show them all?
Could you show me how to loop the query?
Thanks!

By: santosh

santosh — Sun, 27 Apr 2008 22:06:29 +0000

i tried to use the java code but it says cannot find the imported files.Can you please guide me what am i missing here?

By: Website Scraping for Dummies | The BookmarkMoney Blog

Website Scraping for Dummies | The BookmarkMoney Blog — Sat, 19 Apr 2008 21:51:44 +0000

[…] The Twit88 blog has two excellent tutorials on using Java/Web Harvest to extract data from websites. Web Scraping using Web Harvest, and Java - Writing a Web Page Scraper or Web Data Extraction Tool. […]

By: Ruben Zevallos Jr.

Ruben Zevallos Jr. — Fri, 15 Feb 2008 13:38:18 +0000

Thank you for your article… it open my mind for some other things that I doing.

Best