Introduction
WebMagic is a Java web crawling framework modeled after Scrapy. It separates crawling into four clean components — Downloader, PageProcessor, Scheduler, and Pipeline — so developers can customize extraction logic without dealing with HTTP connection management or threading.
What WebMagic Does
- Downloads pages with configurable HTTP clients (HttpClient or Selenium for JS rendering)
- Extracts data using CSS selectors, XPath, or regex through a fluent Selectable API
- Manages URL scheduling with deduplication via HashSet, Redis, or Bloom filter
- Processes extracted data through pipelines for console output, JSON, or database storage
- Supports multi-threaded crawling with configurable parallelism
Architecture Overview
WebMagic follows a four-component architecture inspired by Scrapy. The Downloader fetches pages and returns an HTTP response. The PageProcessor extracts structured data and discovers new URLs. The Scheduler queues and deduplicates URLs. The Pipeline persists or displays results. These components are coordinated by a Spider thread pool that drives the crawl loop.
Self-Hosting & Configuration
- Add webmagic-core and webmagic-extension as Maven dependencies
- Implement PageProcessor to define extraction logic for your target site
- Configure thread count, sleep interval, and retry policy on the Spider builder
- Use webmagic-selenium for JavaScript-rendered pages
- Choose a Scheduler: HashSetDedupScheduler for small crawls, RedisScheduler for distributed
Key Features
- Clean four-component architecture makes customization straightforward
- Fluent Selectable API chains CSS, XPath, and regex extractors
- Built-in annotation-based model extraction via @TargetUrl and @ExtractBy
- Distributed crawling support through Redis-based URL scheduling
- Proxy pool integration for rotating IPs during large-scale crawls
Comparison with Similar Tools
- Scrapy — Python's leading crawler; WebMagic brings a similar architecture to Java
- Jsoup — HTML parser only; WebMagic adds scheduling, threading, and pipeline processing
- Crawlee — Node.js crawling framework; WebMagic serves the Java ecosystem
- Apache Nutch — Hadoop-scale web crawling; WebMagic is lighter and easier to embed in applications
FAQ
Q: Can WebMagic handle JavaScript-rendered pages? A: Yes. Use the Selenium downloader module to render pages in a headless browser before extraction.
Q: Does it support distributed crawling? A: Yes. Replace the default scheduler with RedisScheduler to share the URL queue across multiple JVMs.
Q: How does deduplication work? A: The scheduler tracks visited URLs in a HashSet (default) or Redis set, preventing re-crawls.
Q: Is it suitable for production scraping workloads? A: Yes, with appropriate rate limiting and proxy rotation. Many teams use it for data collection pipelines.