Java - Writing a Web Page Scraper or Web Data Extraction Tool
By admin on Jan 6, 2008 in Java, Programming
In my previous article I wrote about Web-Harvest which is an open source software that can be used for web data scraping, here I am going to show you a real-life example of using it to scrap data from this web site.
To write a web data scrapping tool, normally the web pages must be structural. This is what we normally called structured or semi-structured web pages. E.g., all the articles in this web site are using a standard layout, which actually makes the extraction possible using XPath and XQuery.
Here is the configuration file that I used to scrap the article information from all articles in this web site.
<?xml version="1.0" encoding="UTF-8"?> <config charset="UTF-8"> <var-def name="article"> <xquery> <xq-param name="doc"> <html-to-xml outputtype="browser-compact" prunetags="yes"> <http url="${url}"/> </html-to-xml> </xq-param> <xq-expression><![CDATA[ declare variable $doc as node() external; let $title := data($doc//div[@class="post"]/h1) let $link := data ($doc//div[@class="post"]/h1/a/@href) let $author := data( $doc//div[@class="post"]/p[@class="postinfo"]/a ) return <article> <title>{$title}</title> <author>{$author[1]}</author> <link>{$link}</link> </article> ]]> </xq-expression> </xquery> </var-def> </config>
XQuery expression is used to extract the required information. You can easily get the pattern using Solvent.
Here is the Java code that is used to do the real work.
try { ScraperConfiguration config = new ScraperConfiguration("c:/twit88.xml"); Scraper scraper = new Scraper(config, "c:/temp/"); // Proxy server and port //scraper.getHttpClientManager().setHttpProxy ("proxy-server", 8001); scraper.addVariableToContext("url", "https://twit88.com/blog/2007/ 12/31/design-pattern-in- java-101-builder-pattern-creational-pattern/"); scraper.setDebug(true); scraper.execute(); // takes variable created during execution Variable article = (Variable)scraper.getContext().getVar("article"); // do something with articles... System.out.println(article.toString()); } catch (FileNotFoundException e) { System.out.println(e.getMessage()); }
In the code, I set the configuration file, working folder and also passed in the URL of the article from which I wanted to extract information. The extracted information is saved and returned via the article variable.
The output from the program
<article> <title>Design Pattern in Java 101 - Builder Pattern (Creational Pattern)</title> <author>admin</author> <link>https://twit88.com/blog/ 2007/12/31/design-pattern-in- java-101-builder-pattern-creational-pattern/</link> </article>
Ruben Zevallos Jr. | Feb 15, 2008 | Reply
Thank you for your article… it open my mind for some other things that I doing.
Best
santosh | Apr 27, 2008 | Reply
i tried to use the java code but it says cannot find the imported files.Can you please guide me what am i missing here?
Radu | May 17, 2008 | Reply
Hi, if there were more than 1 article, how would you show them all?
Could you show me how to loop the query?
Thanks!