Use HtmlUnit for Web Scraping
By admin on Apr 21, 2008 in Java, open source
HtmlUnit is a unit testing framework for web applications but it also can be used for web page scraping considering its capabilities.
HtmlUnit is a “browser for Java programs”. It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc… just like you do in your “normal” browser.
- Support for the HTTP and HTTPS protocols.
- Support for cookies
- Ability to specify whether failing responses from the server should throw exceptions or should be returned as pages of the appropriate type (based on content type)
- Support for submit methods POST and GET (as well as HEAD, DELETE, …)
- Ability to customize the request headers being sent to the server
- Support for HTML responses
- Wrapper for HTML pages that provides easy access to all information contained inside them
- Support for submitting forms
- Support for clicking links
- Support for walking the DOM model of the HTML document
- Proxy server support
- Support for basic and NTLM authentication
- Good support for JavaScript (see the JavaScript section below)
anom | May 4, 2008 | Reply
Strongly agreed, as with other web testing tools like
Selenium
httpunit
screenscraper