RSS Feed for This PostCurrent Article

Use HtmlUnit for Web Scraping

HtmlUnit is a unit testing framework for web applications but it also can be used for web page scraping considering its capabilities.

HtmlUnit is a “browser for Java programs”. It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc… just like you do in your “normal” browser.

  • Support for the HTTP and HTTPS protocols.
  • Support for cookies
  • Ability to specify whether failing responses from the server should throw exceptions or should be returned as pages of the appropriate type (based on content type)
  • Support for submit methods POST and GET (as well as HEAD, DELETE, …)
  • Ability to customize the request headers being sent to the server
  • Support for HTML responses
    • Wrapper for HTML pages that provides easy access to all information contained inside them
    • Support for submitting forms
    • Support for clicking links
    • Support for walking the DOM model of the HTML document
  • Proxy server support
  • Support for basic and NTLM authentication
  • Good support for JavaScript (see the JavaScript section below)


Trackback URL


RSS Feed for This Post1 Comment(s)

  1. anom | May 4, 2008 | Reply

    Strongly agreed, as with other web testing tools like

    Selenium
    httpunit
    screenscraper

RSS Feed for This PostPost a Comment