Current Article

Java Library for Parsing HTML

By admin on Jun 10, 2012 in Java, open source

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
manipulate the HTML elements, attributes, and text
clean user-submitted content against a safe white-list, to prevent XSS attacks
output tidy HTML

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

   1: Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
   2: Elements newsHeadlines = doc.select("#mp-itn b a");

jsoup is an open source project distributed under the liberal MIT license. The source code is available at GitHub.

Trackback URL

Sorry, comments for this entry are closed at this time.

twit88.com

Current Article

Java Library for Parsing HTML

Subscribe

Recent Posts

Categories

Archives

twit88.com

Current Article

Java Library for Parsing HTML

Related Posts

Subscribe

Recent Posts

Categories

Popular Posts

Archives