RSS Feed for This PostCurrent Article

Java - Writing a Web Page Scraper or Web Data Extraction Tool

Download Source Code

In my previous article I wrote about Web-Harvest which is an open source software that can be used for web data scraping, here I am going to show you a real-life example of using it to scrap data from this web site.

To write a web data scrapping tool, normally the web pages must be structural. This is what we normally called structured or semi-structured web pages. E.g., all the articles in this web site are using a standard layout, which actually makes the extraction possible using XPath and XQuery.

Here is the configuration file that I used to scrap the article information from all articles in this web site.

<?xml version="1.0" encoding="UTF-8"?>
<config charset="UTF-8">

    <var-def name="article">
        <xquery>
          <xq-param name="doc">
                <html-to-xml
                 outputtype="browser-compact" prunetags="yes">
                    <http url="${url}"/>
                </html-to-xml>
          </xq-param> 

          <xq-expression><![CDATA[
            declare variable $doc as node() external;
                let $title := data($doc//div[@class="post"]/h1)
                let $link
                     := data
                     ($doc//div[@class="post"]/h1/a/@href)
                let $author
                  :=
                  data(
                  $doc//div[@class="post"]/p[@class="postinfo"]/a
                  )
                return
                        <article>
                      <title>{$title}</title>
                      <author>{$author[1]}</author>
                       <link>{$link}</link>
                    </article>
               ]]>
               </xq-expression> 

        </xquery>
   </var-def>    

</config>

XQuery expression is used to extract the required information. You can easily get the pattern using Solvent.

Here is the Java code that is used to do the real work.

try {
    ScraperConfiguration config =
            new ScraperConfiguration("c:/twit88.xml");
    Scraper scraper = new Scraper(config, "c:/temp/");

    // Proxy server and port
    //scraper.getHttpClientManager().setHttpProxy
        ("proxy-server", 8001);

    scraper.addVariableToContext("url", "https://twit88.com/blog/2007/
         12/31/design-pattern-in-
         java-101-builder-pattern-creational-pattern/");
    scraper.setDebug(true);
    scraper.execute();

    // takes variable created during execution
    Variable article =
      (Variable)scraper.getContext().getVar("article");

    // do something with articles...
    System.out.println(article.toString());
} catch (FileNotFoundException e) {
    System.out.println(e.getMessage());
}

In the code, I set the configuration file, working folder and also passed in the URL of the article from which I wanted to extract information. The extracted information is saved and returned via the article variable.

The output from the program

<article>
   <title>Design Pattern in Java 101 -
   Builder Pattern (Creational Pattern)</title>
   <author>admin</author>
   <link>https://twit88.com/blog/
     2007/12/31/design-pattern-in-
    java-101-builder-pattern-creational-pattern/</link>
</article>


Trackback URL


RSS Feed for This Post3 Comment(s)

  1. Ruben Zevallos Jr. | Feb 15, 2008 | Reply

    Thank you for your article… it open my mind for some other things that I doing.

    Best

  2. santosh | Apr 27, 2008 | Reply

    i tried to use the java code but it says cannot find the imported files.Can you please guide me what am i missing here?

  3. Radu | May 17, 2008 | Reply

    Hi, if there were more than 1 article, how would you show them all?
    Could you show me how to loop the query?
    Thanks!

1 Trackback(s)

  1. From Website Scraping for Dummies | The BookmarkMoney Blog | Apr 19, 2008

RSS Feed for This PostPost a Comment