Scripts2026年7月1日·1 分钟阅读

WebMagic — Scalable Web Crawler Framework for Java

A simple, flexible web crawling framework for Java that provides page extraction, multi-threaded downloading, and pipeline-based data processing out of the box.

Agent 就绪

Agent 可直接安装

这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
WebMagic Overview
直接安装命令
npx -y tokrepo@latest install 54c37814-7527-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run 确认安装计划,再运行此命令。

Introduction

WebMagic is a Java web crawling framework modeled after Scrapy. It separates crawling into four clean components — Downloader, PageProcessor, Scheduler, and Pipeline — so developers can customize extraction logic without dealing with HTTP connection management or threading.

What WebMagic Does

  • Downloads pages with configurable HTTP clients (HttpClient or Selenium for JS rendering)
  • Extracts data using CSS selectors, XPath, or regex through a fluent Selectable API
  • Manages URL scheduling with deduplication via HashSet, Redis, or Bloom filter
  • Processes extracted data through pipelines for console output, JSON, or database storage
  • Supports multi-threaded crawling with configurable parallelism

Architecture Overview

WebMagic follows a four-component architecture inspired by Scrapy. The Downloader fetches pages and returns an HTTP response. The PageProcessor extracts structured data and discovers new URLs. The Scheduler queues and deduplicates URLs. The Pipeline persists or displays results. These components are coordinated by a Spider thread pool that drives the crawl loop.

Self-Hosting & Configuration

  • Add webmagic-core and webmagic-extension as Maven dependencies
  • Implement PageProcessor to define extraction logic for your target site
  • Configure thread count, sleep interval, and retry policy on the Spider builder
  • Use webmagic-selenium for JavaScript-rendered pages
  • Choose a Scheduler: HashSetDedupScheduler for small crawls, RedisScheduler for distributed

Key Features

  • Clean four-component architecture makes customization straightforward
  • Fluent Selectable API chains CSS, XPath, and regex extractors
  • Built-in annotation-based model extraction via @TargetUrl and @ExtractBy
  • Distributed crawling support through Redis-based URL scheduling
  • Proxy pool integration for rotating IPs during large-scale crawls

Comparison with Similar Tools

  • Scrapy — Python's leading crawler; WebMagic brings a similar architecture to Java
  • Jsoup — HTML parser only; WebMagic adds scheduling, threading, and pipeline processing
  • Crawlee — Node.js crawling framework; WebMagic serves the Java ecosystem
  • Apache Nutch — Hadoop-scale web crawling; WebMagic is lighter and easier to embed in applications

FAQ

Q: Can WebMagic handle JavaScript-rendered pages? A: Yes. Use the Selenium downloader module to render pages in a headless browser before extraction.

Q: Does it support distributed crawling? A: Yes. Replace the default scheduler with RedisScheduler to share the URL queue across multiple JVMs.

Q: How does deduplication work? A: The scheduler tracks visited URLs in a HashSet (default) or Redis set, preventing re-crawls.

Q: Is it suitable for production scraping workloads? A: Yes, with appropriate rate limiting and proxy rotation. Many teams use it for data collection pipelines.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产