Java: Smart and Simple Web Crawler
By admin on Dec 23, 2008 in Java, open source
- Smart and easy framework thats crawls a web site
- Integrated Lucene support
- It’s simple to integrate the framework in own applications
- The crawler can start from one or from a list of links
- Two crawling models available:
- Max Iterations: Crawls a web site through a limited number of links: Fast model with a small memory footprint and cpu usage.
- Max Depth: A simple graph model parser without recording in and outcoming links. Fast as the max interations model.
- Accept filter interface to limit the links to be crawled
- Core accept filters available: ServerFilter, BeginningPathFilter and RegularExpressionFilter
- Combining the accept filters with AND, OR and NOT possible
- Plugable http connection libraries HttpClient (default) and HTMLParser (optional)
- Own listeners can be added in the parsing process
- The framework is not a GUI based tool to mirror a website and browse the site offline!
