RSS Feed for This PostCurrent Article

Java – Writing a Web Page Scraper or Web Data Extraction Tool

Download Source Code

In my previous article I wrote about Web-Harvest which is an open source software that can be used for web data scraping, here I am going to show you a real-life example of using it to scrap data from this web site.

To write a web data scrapping tool, normally the web pages must be structural. This is what we normally called structured or semi-structured web pages. E.g., all the articles in this web site are using a standard layout, which actually makes the extraction possible using XPath and XQuery.

Here is the configuration file that I used to scrap the article information from all articles in this web site.

<?xml version="1.0" encoding="UTF-8"?>
<config charset="UTF-8">

    <var-def name="article">
        <xquery>
          <xq-param name="doc">
                <html-to-xml 
                 outputtype="browser-compact" prunetags="yes">
                    <http url="${url}"/>
                </html-to-xml>
          </xq-param> 
                            
          <xq-expression><![CDATA[ 
            declare variable $doc as node() external;          
                let $title := data($doc//div[@class="post"]/h1)
                let $link 
                     := data
                     ($doc//div[@class="post"]/h1/a/@href)
                let $author 
                  := 
                  data(
                  $doc//div[@class="post"]/p[@class="postinfo"]/a
                  )           
                return              
                        <article>
                      <title>{$title}</title>
                      <author>{$author[1]}</author>
                       <link>{$link}</link>
                    </article>                  
               ]]>
               </xq-expression> 
            
        </xquery> 
   </var-def>    

</config>

XQuery expression is used to extract the required information. You can easily get the pattern using Solvent.

Here is the Java code that is used to do the real work.

try {
    ScraperConfiguration config =
            new ScraperConfiguration("c:/twit88.xml");
    Scraper scraper = new Scraper(config, "c:/temp/");

    // Proxy server and port
    //scraper.getHttpClientManager().setHttpProxy
        ("proxy-server", 8001);

    scraper.addVariableToContext("url", "https://twit88.com/blog/2007/
         12/31/design-pattern-in-
         java-101-builder-pattern-creational-pattern/");
    scraper.setDebug(true);
    scraper.execute();

    // takes variable created during execution
    Variable article = 
      (Variable)scraper.getContext().getVar("article");

    // do something with articles...
    System.out.println(article.toString());
} catch (FileNotFoundException e) {
    System.out.println(e.getMessage());
}

In the code, I set the configuration file, working folder and also passed in the URL of the article from which I wanted to extract information. The extracted information is saved and returned via the article variable.

The output from the program

<article>
   <title>Design Pattern in Java 101 - 
   Builder Pattern (Creational Pattern)</title>
   <author>admin</author>
   <link>https://twit88.com/blog/
     2007/12/31/design-pattern-in-
    java-101-builder-pattern-creational-pattern/</link>
</article>


Trackback URL


RSS Feed for This Post7 Comment(s)

  1. Ruben Zevallos Jr. | Feb 15, 2008 | Reply

    Thank you for your article… it open my mind for some other things that I doing.

    Best

  2. santosh | Apr 27, 2008 | Reply

    i tried to use the java code but it says cannot find the imported files.Can you please guide me what am i missing here?

  3. Radu | May 17, 2008 | Reply

    Hi, if there were more than 1 article, how would you show them all?
    Could you show me how to loop the query?
    Thanks!

  4. sam | Jun 10, 2009 | Reply

    can you tell me if there’s any function in PL/SQL for page scraping??

  5. Heriberto Janosch González | Jun 16, 2009 | Reply

    Hello,

    Can you help me with Web Harvest?

    I need to load this page:

    http://contrataciondelestado.es/wps/wcm/connect/PLACE_es/Site/area/docAccCmpnt?srv=cmpnt&cmpntname=GetDocumentsById&source=library&DocumentIdParam=1b082c004e7f2b08b5f4ffbe46495314

    When you place it in a browser you will see that it is a Xml document.

    But when you put it in a <http url=” … instruction form Web Harvest, it loads something like

    <meta http-equiv=”refresh” content=”0;url='/wps/wcm/connect/? …

    That is because (I believe) the meta refresh loads another page 0 seconds after loading the first one …

    How can I solve this problem with Web Harvest?

    Thanks in advance for your kind attention!

    Please, if you have an answer write to: [email protected]

  6. Fuller | Dec 1, 2009 | Reply

    I have authored a tool, i.e. MetaSeeker, to calculate the web data extraction instructions automatically after semantic annotating over the sample page by the operator. Free download: http://www.gooseeker.com/en/node/download/front

  7. janapati siva prasad rao | Feb 16, 2010 | Reply

    Hi,
    I am trying to use web harvest tool to extract data from web sites.I have gone through the examples given in the web harvest tool.I have tried to extract the data from the site,by giving the url.It is working fine.Now, i am trying to implement the same approach,to extract the entire site(just like canon example).Can you please give any inputs to write configuration file for this?

    One more thing , i am not understanding how to write the configuration file.Please explain how to write the configuration file.

    Regards,
    Siva

3 Trackback(s)

  1. From Website Scraping for Dummies | The BookmarkMoney Blog | Apr 19, 2008
  2. From Expertaya » HtmlUnit as Java Screen Scraping Library | Jan 23, 2009
  3. From nils-kaiser.de » Time to crawl back! Download Google Groups using a crawler | Feb 13, 2009

Sorry, comments for this entry are closed at this time.