ScriptsJul 1, 2026·3 min read

WebMagic — Scalable Web Crawler Framework for Java

A simple, flexible web crawling framework for Java that provides page extraction, multi-threaded downloading, and pipeline-based data processing out of the box.

Agent ready

Ready-to-run agent install

This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
WebMagic Overview
Direct install command
npx -y tokrepo@latest install 54c37814-7527-11f1-9bc6-00163e2b0d79 --target codex

Run after dry-run confirms the install plan.

Introduction

WebMagic is a Java web crawling framework modeled after Scrapy. It separates crawling into four clean components — Downloader, PageProcessor, Scheduler, and Pipeline — so developers can customize extraction logic without dealing with HTTP connection management or threading.

What WebMagic Does

  • Downloads pages with configurable HTTP clients (HttpClient or Selenium for JS rendering)
  • Extracts data using CSS selectors, XPath, or regex through a fluent Selectable API
  • Manages URL scheduling with deduplication via HashSet, Redis, or Bloom filter
  • Processes extracted data through pipelines for console output, JSON, or database storage
  • Supports multi-threaded crawling with configurable parallelism

Architecture Overview

WebMagic follows a four-component architecture inspired by Scrapy. The Downloader fetches pages and returns an HTTP response. The PageProcessor extracts structured data and discovers new URLs. The Scheduler queues and deduplicates URLs. The Pipeline persists or displays results. These components are coordinated by a Spider thread pool that drives the crawl loop.

Self-Hosting & Configuration

  • Add webmagic-core and webmagic-extension as Maven dependencies
  • Implement PageProcessor to define extraction logic for your target site
  • Configure thread count, sleep interval, and retry policy on the Spider builder
  • Use webmagic-selenium for JavaScript-rendered pages
  • Choose a Scheduler: HashSetDedupScheduler for small crawls, RedisScheduler for distributed

Key Features

  • Clean four-component architecture makes customization straightforward
  • Fluent Selectable API chains CSS, XPath, and regex extractors
  • Built-in annotation-based model extraction via @TargetUrl and @ExtractBy
  • Distributed crawling support through Redis-based URL scheduling
  • Proxy pool integration for rotating IPs during large-scale crawls

Comparison with Similar Tools

  • Scrapy — Python's leading crawler; WebMagic brings a similar architecture to Java
  • Jsoup — HTML parser only; WebMagic adds scheduling, threading, and pipeline processing
  • Crawlee — Node.js crawling framework; WebMagic serves the Java ecosystem
  • Apache Nutch — Hadoop-scale web crawling; WebMagic is lighter and easier to embed in applications

FAQ

Q: Can WebMagic handle JavaScript-rendered pages? A: Yes. Use the Selenium downloader module to render pages in a headless browser before extraction.

Q: Does it support distributed crawling? A: Yes. Replace the default scheduler with RedisScheduler to share the URL queue across multiple JVMs.

Q: How does deduplication work? A: The scheduler tracks visited URLs in a HashSet (default) or Redis set, preventing re-crawls.

Q: Is it suitable for production scraping workloads? A: Yes, with appropriate rate limiting and proxy rotation. Many teams use it for data collection pipelines.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets