RSS Feed for This PostCurrent Article

Java: Smart and Simple Web Crawler

Smart and Simple Web Crawler

  • Smart and easy framework thats crawls a web site
  • Integrated Lucene support
  • It’s simple to integrate the framework in own applications
  • The crawler can start from one or from a list of links
  • Two crawling models available:
    • Max Iterations: Crawls a web site through a limited number of links: Fast model with a small memory footprint and cpu usage.
    • Max Depth: A simple graph model parser without recording in and outcoming links. Fast as the max interations model.
  • Accept filter interface to limit the links to be crawled
  • Core accept filters available: ServerFilter, BeginningPathFilter and RegularExpressionFilter
  • Combining the accept filters with AND, OR and NOT possible
  • Plugable http connection libraries HttpClient (default) and HTMLParser (optional)
  • Own listeners can be added in the parsing process
  • The framework is not a GUI based tool to mirror a website and browse the site offline!


Trackback URL


RSS Feed for This PostPost a Comment