RSS Feed for This PostCurrent Article

Java – Writing a Web Page Scraper or Web Data Extraction Tool

Download Source Code

In my previous article I wrote about Web-Harvest which is an open source software that can be used for web data scraping, here I am going to show you a real-life example of using it to scrap data from this web site.

To write a web data scrapping tool, normally the web pages must be structural. This is what we normally called structured or semi-structured web pages. E.g., all the articles in this web site are using a standard layout, which actually makes the extraction possible using XPath and XQuery.

Here is the configuration file that I used to scrap the article information from all articles in this web site.

<?xml version="1.0" encoding="UTF-8"?>
<config charset="UTF-8">

    <var-def name="article">
    	<xquery>
    	  <xq-param name="doc">
                <html-to-xml 
                 outputtype="browser-compact" prunetags="yes">
                    <http url="${url}"/>
                </html-to-xml>
          </xq-param> 
                            
          <xq-expression><![CDATA[ 
            declare variable $doc as node() external;          
	        let $title := data($doc//div[@class="post"]/h1)
	        let $link 
                     := data
                     ($doc//div[@class="post"]/h1/a/@href)
	        let $author 
                  := 
                  data(
                  $doc//div[@class="post"]/p[@class="postinfo"]/a
                  )	      
	        return	            
	        	<article>
	              <title>{$title}</title>
	              <author>{$author[1]}</author>
	               <link>{$link}</link>
	            </article>           	      
	       ]]>
	       </xq-expression> 
            
        </xquery> 
   </var-def>    

</config>

XQuery expression is used to extract the required information. You can easily get the pattern using Solvent.

Here is the Java code that is used to do the real work.

try {
    ScraperConfiguration config =
            new ScraperConfiguration("c:/twit88.xml");
    Scraper scraper = new Scraper(config, "c:/temp/");

    // Proxy server and port
    //scraper.getHttpClientManager().setHttpProxy
        ("proxy-server", 8001);

    scraper.addVariableToContext("url", "http://twit88.com/blog/2007/
         12/31/design-pattern-in-
         java-101-builder-pattern-creational-pattern/");
    scraper.setDebug(true);
    scraper.execute();

    // takes variable created during execution
    Variable article = 
      (Variable)scraper.getContext().getVar("article");

    // do something with articles...
    System.out.println(article.toString());
} catch (FileNotFoundException e) {
    System.out.println(e.getMessage());
}

In the code, I set the configuration file, working folder and also passed in the URL of the article from which I wanted to extract information. The extracted information is saved and returned via the article variable.

The output from the program

<article>
   <title>Design Pattern in Java 101 - 
   Builder Pattern (Creational Pattern)</title>
   <author>admin</author>
   <link>http://twit88.com/blog/
     2007/12/31/design-pattern-in-
    java-101-builder-pattern-creational-pattern/</link>
</article>


Trackback URL


RSS Feed for This Post44 Comment(s)

  1. Ruben Zevallos Jr. | Feb 15, 2008 | Reply

    Thank you for your article… it open my mind for some other things that I doing.

    Best

  2. santosh | Apr 27, 2008 | Reply

    i tried to use the java code but it says cannot find the imported files.Can you please guide me what am i missing here?

  3. Radu | May 17, 2008 | Reply

    Hi, if there were more than 1 article, how would you show them all?
    Could you show me how to loop the query?
    Thanks!

  4. sam | Jun 10, 2009 | Reply

    can you tell me if there’s any function in PL/SQL for page scraping??

  5. Heriberto Janosch González | Jun 16, 2009 | Reply

    Hello,

    Can you help me with Web Harvest?

    I need to load this page:

    http://contrataciondelestado.es/wps/wcm/connect/PLACE_es/Site/area/docAccCmpnt?srv=cmpnt&cmpntname=GetDocumentsById&source=library&DocumentIdParam=1b082c004e7f2b08b5f4ffbe46495314

    When you place it in a browser you will see that it is a Xml document.

    But when you put it in a <http url=” … instruction form Web Harvest, it loads something like

    <meta http-equiv=”refresh” content=”0;url='/wps/wcm/connect/? …

    That is because (I believe) the meta refresh loads another page 0 seconds after loading the first one …

    How can I solve this problem with Web Harvest?

    Thanks in advance for your kind attention!

    Please, if you have an answer write to: heribertojanosch@yahoo.com

  6. Fuller | Dec 1, 2009 | Reply

    I have authored a tool, i.e. MetaSeeker, to calculate the web data extraction instructions automatically after semantic annotating over the sample page by the operator. Free download: http://www.gooseeker.com/en/node/download/front

  7. janapati siva prasad rao | Feb 16, 2010 | Reply

    Hi,
    I am trying to use web harvest tool to extract data from web sites.I have gone through the examples given in the web harvest tool.I have tried to extract the data from the site,by giving the url.It is working fine.Now, i am trying to implement the same approach,to extract the entire site(just like canon example).Can you please give any inputs to write configuration file for this?

    One more thing , i am not understanding how to write the configuration file.Please explain how to write the configuration file.

    Regards,
    Siva

  8. vietspider | May 19, 2010 | Reply

    You can use VietSpider XML from
    http://sourceforge.net/projects/binhgiang/files/

    download VietSpider3_16_XML_Windows.zip or VietSpider3_16_XML_Linux.zip

    VietSpider Web Data Extractor: Software crawls the data from the websites ((Data Scraper)), format to XML standard (Text, CDATA) then store in the relation database. Product supports the various of RDBMs such as Oracle, MySQL, SQL Server, H2, HSQL, Apache Derby, Postgres …VietSpider Crawler supports Session (login, query by form input), multi downloading, JavaScript handling, Proxy (and multi proxy by auto scan the proxies from website),…

  9. Md Amirul Islam | May 20, 2010 | Reply

    Oasis Designs is the leading web development companies with office in North America , Europe and Asia. They expertise in Web Design, Web Marketing, SEO, Graphics Designs, Animation, 3D Modeling
    http://www.exploreoasis.com

  10. Thulani Mamba | Jun 22, 2010 | Reply

    Would someone assist me with a dummy step by step of how to implement the example above, i am a newbie in programming

    i have copied the java part onto netbean and the fisrt error i get is ScraperConfiguration type….where does it get defined?

    I will really appreciate any form of assistance

  11. Thulani Mamba | Jun 22, 2010 | Reply

    Would someone assist me with a dummy step by step of how to implement the example above, i am a newbie in programming

    i have copied the java part onto netbean and the fisrt error i get is ScraperConfiguration type….where does it get defined?

    I will really appreciate any form of assistance

  12. Daniel | Jul 13, 2010 | Reply

    Hi,

    are there any possibilities to catch and process errors in Web-Harvest when the site to be parsed is not available?

  13. Quan | Oct 6, 2010 | Reply

    Hi,
    your code is very nice.
    Could you please tell me, how can i search one word in one “input” website. It looks like Search funktion by almost Browers.
    Many thanks and best regards
    Quan

  14. web Data Extraction | Oct 23, 2010 | Reply

    Thanks you for providing such good information , we are also expert in Web data extraction Let us know, but we do web data extraction in .net

  15. Thinzar | Mar 10, 2011 | Reply

    I found following error. how can i solve these error.
    log4j:WARN No appenders could be found for logger (org.webharvest.definition.XmlParser).
    log4j:WARN Please initialize the log4j system properly.
    Exception in thread “main” org.webharvest.exception.HttpException: IO error during HTTP execution for URL: http://twit88.com/blog/2007/12/31/design-pattern-in-java-101-builder-pattern-creational-pattern/
    at org.webharvest.runtime.web.HttpClientManager.execute(Unknown Source)
    at org.webharvest.runtime.processors.HttpProcessor.execute(Unknown Source)
    at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
    at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)
    at org.webharvest.runtime.processors.BaseProcessor.getBodyTextContent(Unknown Source)
    at org.webharvest.runtime.processors.BaseProcessor.getBodyTextContent(Unknown Source)
    at org.webharvest.runtime.processors.BaseProcessor.getBodyTextContent(Unknown Source)
    at org.webharvest.runtime.processors.HtmlToXmlProcessor.execute(Unknown Source)
    at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
    at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)
    at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
    at org.webharvest.runtime.processors.BaseProcessor.getBodyTextContent(Unknown Source)
    at org.webharvest.runtime.processors.XQueryProcessor.execute(Unknown Source)
    at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
    at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)
    at org.webharvest.runtime.processors.VarDefProcessor.execute(Unknown Source)
    at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
    at org.webharvest.runtime.Scraper.execute(Unknown Source)
    at org.webharvest.runtime.Scraper.execute(Unknown Source)
    at App.main(App.java:18)
    Caused by: java.net.ConnectException: Connection timed out: connect
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.PlainSocketImpl.doConnect(Unknown Source)
    at java.net.PlainSocketImpl.connectToAddress(Unknown Source)
    at java.net.PlainSocketImpl.connect(Unknown Source)
    at java.net.SocksSocketImpl.connect(Unknown Source)
    at java.net.Socket.connect(Unknown Source)
    at java.net.Socket.connect(Unknown Source)
    at java.net.Socket.(Unknown Source)
    at java.net.Socket.(Unknown Source)
    at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
    at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
    at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
    at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
    at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
    … 20 more

  16. Thinzar | Mar 16, 2011 | Reply

    I’m trying to extract data record from list page. How can i apply web harvest tool to correctly extract data item. Please give me example code for xml configuration file.

  17. Amar | May 31, 2011 | Reply

    I keep getting the error – IO error during HTTP execution for URL: http://www.google.com.

    Any idea why is this so

  18. kailash | Jun 14, 2012 | Reply

    Hey, I am using this URL “http://freesearch.naukri.com/preview/preview?uname=95a0c4f6fa0d3e0e66d376259a40ce0f180c560f17514f4a401f5&sid=75044817&LT=1339651393″
    I want to scrap the name, current Location, prefferd location etc.For this I am using this in xml file.

    declare variable $doc as node() external;
    let $title := data($doc//div[@id="bdrGry"])
    let $link := data($doc//div[@class="bdrGry"]/div[@class="boxHD1"]/a/@href)
    let $author := data($doc//div[@class="bdrGry"]/div[@class="boxHD1"]/a/@href)
    return

    {$title}

    but not getting result.?

  19. popular 10 sites | Jan 2, 2013 | Reply

    Good write-up. I’m a regular visitor of your web site and appreciate you taking the time to maintain the excellent site.

  20. hotels with fireworks | May 4, 2013 | Reply

    I simply could not depart your site prior to suggesting that I actually enjoyed the usual information an
    individual provide on your guests? Is going to be again ceaselessly in order to investigate cross-check new posts

  21. Irvin | Jun 16, 2013 | Reply

    There is clearly a lot to

    realize about this. I believe you made some good points in

    features also.

  22. Best Food Storage | Jun 24, 2013 | Reply

    I do accept as true with all of the ideas you have presented for your
    post. They are really convincing and can certainly work.

    Nonetheless, the posts are too short for beginners.

    May you please lengthen them a little from subsequent
    time? Thank you for the post.

  23. http://bio.cc | Jul 3, 2013 | Reply

    Everyone loves what you guys tend to be up too. This kind of clever work
    and reporting! Keep up the wonderful works guys I’ve added you guys to blogroll.

  24. sumithra | Jul 10, 2013 | Reply

    @Thinzar ,@Amar am also getting same error .did u find the solution ?If so plz let me know ,plz mail to my id.
    The error is :” ERROR – IO error during HTTP execution for URL: http://www.google.com/

  25. Sindu | Jul 22, 2013 | Reply

    hi i want to scrap the website if the information i need is present using web harvest.so i need to conigure some words in web harvest can any body help in this

  26. Gerald | Sep 10, 2013 | Reply

    Very good information. Lucky me I ran across your website by accident (stumbleupon).
    I have book marked it for later!

  27. Java training in chennai | Sep 23, 2013 | Reply

    Is this extraction is possible in the website which developed in open CMS like wordpress, drupal etc.

  28. Lantech Soft | Jan 3, 2014 | Reply

    Best Web Data Extractor Software

    Data extraction methodology used with this website software kit is incredibly superb and authentic. The well customized data obtaining method will let the user achieve their aim in a stipulated time manner. It also ensures that drawn data is put in an order that best suit the interest of a user. Web data scraper is in huge demand these days seeing its authentic and flawless style of working with multiple websites. This is being procured by the users.

    For more detail please visit our website: http://www.lantechsoft.com/web-data-extractor.html

  29. melhor suplemento para começar a malhar | Jan 6, 2014 | Reply

    Wow, fantastic blog layout! How long have you been blogging
    for? you made blogging look easy. The overall loiok of your website is great, let alone the content!

  30. Giada Tiebout | Jan 25, 2014 | Reply

    In my opinion that a property foreclosures can have a significant effect on the client’s life.
    Foreclosures can have a Several to several years negative impact on a
    client’s credit report. Any borrower having applied for home financing or virtually any loans for
    instance, knows that the particular worse credit rating can be, the more tricky it is to secure a decent bank loan.
    In addition, it could affect any borrower’s chance to find a reasonable place to let or hire, if that becomes the alternative real estate solution.
    Interesting blog post.

  31. visit | Mar 4, 2014 | Reply

    Today, I went to the beach with my kids. I found a sea shell and gave it to my 4 year old daughter and said “You can hear the ocean if you put this to your ear.” She placed the shell to her ear and screamed.
    There was a hermit crab inside and it pinched her ear.
    She never wants to go back! LoL I know this is totally off topic
    but I had to tell someone!

  32. Bridgette | Mar 21, 2014 | Reply

    Having read this I thought it was extremely enlightening.
    I appreciate you finding the time and energy to put
    this information together. I once again find myself personally spending a lot of time both reading and leaving comments.

    But so what, it was still worthwhile!

  33. Trudy | May 22, 2014 | Reply

    Hi everyone, it’s my first visit at this web page, and paragraph is truly fruitful
    in favor of me, keep up posting such articles or reviews.

  34. Leora | May 24, 2014 | Reply

    Good post. I learn something new and challenging on blogs I
    stumbleupon every day. It will always be interesting to read through content from other writers and use a little something from other
    sites.

  35. Santo | May 24, 2014 | Reply

    I tried to execute the above code in Eclipse but it is giving following errors. I have imported all required .jar files. Plz help.

    Line – Scraper scraper = new Scraper(config, “c:/temp/”);
    Error – The constructor Scraper(ScraperConfiguration, String) refers to the missing type ScraperConfiguration

    Line – Variable article = (Variable)scraper.getContext().getVar(“article”);
    Error – The method getContext() from the type Scraper refers to the missing type ScraperContext

  36. Chandra | Jun 20, 2014 | Reply

    We have used Java and Php to extract data in realtime from multipls website. Data Scarping is not the problem but storing it. we have use Cloud computing along with Hadoop to extract and manage Millinions of data. For more inof visit http://www.ptsius.com

  37. Briella Sherman | Jul 3, 2014 | Reply

    Greetings from Florida! I’m bored to death at work so I decided to check
    out your blog on my iphone during lunch break. I enjoy the information you provide here and can’t wait to take a look when I get home.

    I’m amazed at how fast your blog loaded on my cell phone ..
    I’m not even using WIFI, just 3G .. Anyhow, great blog!

  38. Romaine | Jul 9, 2014 | Reply

    Hey there! This is my first visit to your blog!

    We are a collection of volunteers and starting a new initiative in a
    community in the same niche. Your blog provided us beneficial information to work on. You have done
    a extraordinary job!

  39. Kristofer | Jul 19, 2014 | Reply

    Howdy! Do you know if they make any plugins to protect against hackers?
    I’m kinda paranoid about losing everything I’ve worked hard on. Any
    tips?

  40. riot point cards | Sep 11, 2014 | Reply

    Although there are ratings for several actresses in Hollywood, the opinion largely differs from individual to individual about the craze.

    This is achieved by earning gold from killing minions and enemy champions to acquire better items.

    The middle lane’s role is to provide high damage in the
    form of ability power or AP.

  41. the best spinner | Sep 15, 2014 | Reply

    Way сool! Ѕome extremely valiԀ points! I appreciate you writing this artiϲlе and also the rest
    oof the websitе is also veryy good.

  42. Eve Wishengrad | Sep 21, 2014 | Reply

    I haven’t checked in here for a while as I thought it
    was getting boring, but the last few posts are great quality so I guess I’ll add you back to my daily bloglist.
    You deserve it my friend :)

  43. golden voice | Sep 24, 2014 | Reply

    Hi there! This is kind of off topic but I need some help from an established blog.
    Is it difficult to set up your own blog? I’m not very
    techincal but I can figure things out pfetty fast.
    I’m thinking about making my own but I’m not sure where to start.
    Do you have any tips or suggestions? Appreciate it

  44. Alfredo Kendrick | Sep 28, 2014 | Reply

    Thanks for your write-up. One other thing is always that
    individual states in the United states of america have their own laws of which affect property owners,
    which makes it very, very hard for the the nation’s lawmakers to come up with a brand new
    set of guidelines concerning home foreclosure on house owners.
    The problem is that each state has got own legislation which
    may interact in a damaging manner in regards to foreclosure insurance policies.

4 Trackback(s)

  1. From Website Scraping for Dummies | The BookmarkMoney Blog | Apr 19, 2008
  2. From Expertaya » HtmlUnit as Java Screen Scraping Library | Jan 23, 2009
  3. From nils-kaiser.de » Time to crawl back! Download Google Groups using a crawler | Feb 13, 2009
  4. From mic (mic100) | Pearltrees | Mar 16, 2012

RSS Feed for This PostPost a Comment

*