RSS Feed for This PostCurrent Article

Java – Writing a Web Page Scraper or Web Data Extraction Tool

Download Source Code

In my previous article I wrote about Web-Harvest which is an open source software that can be used for web data scraping, here I am going to show you a real-life example of using it to scrap data from this web site.

To write a web data scrapping tool, normally the web pages must be structural. This is what we normally called structured or semi-structured web pages. E.g., all the articles in this web site are using a standard layout, which actually makes the extraction possible using XPath and XQuery.

Here is the configuration file that I used to scrap the article information from all articles in this web site.

<?xml version="1.0" encoding="UTF-8"?>
<config charset="UTF-8">

    <var-def name="article">
        <xquery>
          <xq-param name="doc">
                <html-to-xml
                 outputtype="browser-compact" prunetags="yes">
                    <http url="${url}"/>
                </html-to-xml>
          </xq-param> 

          <xq-expression><![CDATA[
            declare variable $doc as node() external;
                let $title := data($doc//div[@class="post"]/h1)
                let $link
                     := data
                     ($doc//div[@class="post"]/h1/a/@href)
                let $author
                  :=
                  data(
                  $doc//div[@class="post"]/p[@class="postinfo"]/a
                  )
                return
                        <article>
                      <title>{$title}</title>
                      <author>{$author[1]}</author>
                       <link>{$link}</link>
                    </article>
               ]]>
               </xq-expression> 

        </xquery>
   </var-def>    

</config>

XQuery expression is used to extract the required information. You can easily get the pattern using Solvent.

Here is the Java code that is used to do the real work.

try {
    ScraperConfiguration config =
            new ScraperConfiguration("c:/twit88.xml");
    Scraper scraper = new Scraper(config, "c:/temp/");

    // Proxy server and port
    //scraper.getHttpClientManager().setHttpProxy
        ("proxy-server", 8001);

    scraper.addVariableToContext("url", "https://twit88.com/blog/2007/
         12/31/design-pattern-in-
         java-101-builder-pattern-creational-pattern/");
    scraper.setDebug(true);
    scraper.execute();

    // takes variable created during execution
    Variable article =
      (Variable)scraper.getContext().getVar("article");

    // do something with articles...
    System.out.println(article.toString());
} catch (FileNotFoundException e) {
    System.out.println(e.getMessage());
}

In the code, I set the configuration file, working folder and also passed in the URL of the article from which I wanted to extract information. The extracted information is saved and returned via the article variable.

The output from the program

<article>
   <title>Design Pattern in Java 101 -
   Builder Pattern (Creational Pattern)</title>
   <author>admin</author>
   <link>https://twit88.com/blog/
     2007/12/31/design-pattern-in-
    java-101-builder-pattern-creational-pattern/</link>
</article>

Popularity: 13% [?]


Trackback URL


RSS Feed for This Post14 Comment(s)

  1. Ruben Zevallos Jr. | Feb 15, 2008 | Reply

    Thank you for your article… it open my mind for some other things that I doing.

    Best

  2. santosh | Apr 27, 2008 | Reply

    i tried to use the java code but it says cannot find the imported files.Can you please guide me what am i missing here?

  3. Radu | May 17, 2008 | Reply

    Hi, if there were more than 1 article, how would you show them all?
    Could you show me how to loop the query?
    Thanks!

  4. sam | Jun 10, 2009 | Reply

    can you tell me if there’s any function in PL/SQL for page scraping??

  5. Heriberto Janosch González | Jun 16, 2009 | Reply

    Hello,

    Can you help me with Web Harvest?

    I need to load this page:

    http://contrataciondelestado.es/wps/wcm/connect/PLACE_es/Site/area/docAccCmpnt?srv=cmpnt&cmpntname=GetDocumentsById&source=library&DocumentIdParam=1b082c004e7f2b08b5f4ffbe46495314

    When you place it in a browser you will see that it is a Xml document.

    But when you put it in a <http url=” … instruction form Web Harvest, it loads something like

    <meta http-equiv=”refresh” content=”0;url='/wps/wcm/connect/? …

    That is because (I believe) the meta refresh loads another page 0 seconds after loading the first one …

    How can I solve this problem with Web Harvest?

    Thanks in advance for your kind attention!

    Please, if you have an answer write to: [email protected]

  6. Fuller | Dec 1, 2009 | Reply

    I have authored a tool, i.e. MetaSeeker, to calculate the web data extraction instructions automatically after semantic annotating over the sample page by the operator. Free download: http://www.gooseeker.com/en/node/download/front

  7. janapati siva prasad rao | Feb 16, 2010 | Reply

    Hi,
    I am trying to use web harvest tool to extract data from web sites.I have gone through the examples given in the web harvest tool.I have tried to extract the data from the site,by giving the url.It is working fine.Now, i am trying to implement the same approach,to extract the entire site(just like canon example).Can you please give any inputs to write configuration file for this?

    One more thing , i am not understanding how to write the configuration file.Please explain how to write the configuration file.

    Regards,
    Siva

  8. vietspider | May 19, 2010 | Reply

    You can use VietSpider XML from
    http://sourceforge.net/projects/binhgiang/files/

    download VietSpider3_16_XML_Windows.zip or VietSpider3_16_XML_Linux.zip

    VietSpider Web Data Extractor: Software crawls the data from the websites ((Data Scraper)), format to XML standard (Text, CDATA) then store in the relation database. Product supports the various of RDBMs such as Oracle, MySQL, SQL Server, H2, HSQL, Apache Derby, Postgres …VietSpider Crawler supports Session (login, query by form input), multi downloading, JavaScript handling, Proxy (and multi proxy by auto scan the proxies from website),…

  9. Md Amirul Islam | May 20, 2010 | Reply

    Oasis Designs is the leading web development companies with office in North America , Europe and Asia. They expertise in Web Design, Web Marketing, SEO, Graphics Designs, Animation, 3D Modeling
    http://www.exploreoasis.com

  10. Thulani Mamba | Jun 22, 2010 | Reply

    Would someone assist me with a dummy step by step of how to implement the example above, i am a newbie in programming

    i have copied the java part onto netbean and the fisrt error i get is ScraperConfiguration type….where does it get defined?

    I will really appreciate any form of assistance

  11. Thulani Mamba | Jun 22, 2010 | Reply

    Would someone assist me with a dummy step by step of how to implement the example above, i am a newbie in programming

    i have copied the java part onto netbean and the fisrt error i get is ScraperConfiguration type….where does it get defined?

    I will really appreciate any form of assistance

  12. Daniel | Jul 13, 2010 | Reply

    Hi,

    are there any possibilities to catch and process errors in Web-Harvest when the site to be parsed is not available?

  13. Quan | Oct 6, 2010 | Reply

    Hi,
    your code is very nice.
    Could you please tell me, how can i search one word in one “input” website. It looks like Search funktion by almost Browers.
    Many thanks and best regards
    Quan

  14. web Data Extraction | Oct 23, 2010 | Reply

    Thanks you for providing such good information , we are also expert in Web data extraction Let us know, but we do web data extraction in .net

3 Trackback(s)

  1. From Website Scraping for Dummies | The BookmarkMoney Blog | Apr 19, 2008
  2. From Expertaya » HtmlUnit as Java Screen Scraping Library | Jan 23, 2009
  3. From nils-kaiser.de » Time to crawl back! Download Google Groups using a crawler | Feb 13, 2009

RSS Feed for This PostPost a Comment