Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsJul 1, 2026·3 min de lecture

WebMagic — Scalable Web Crawler Framework for Java

A simple, flexible web crawling framework for Java that provides page extraction, multi-threaded downloading, and pipeline-based data processing out of the box.

Prêt pour agents

Installation agent prête

Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
WebMagic Overview
Commande d'installation directe
npx -y tokrepo@latest install 54c37814-7527-11f1-9bc6-00163e2b0d79 --target codex

À exécuter après confirmation du plan en dry-run.

Introduction

WebMagic is a Java web crawling framework modeled after Scrapy. It separates crawling into four clean components — Downloader, PageProcessor, Scheduler, and Pipeline — so developers can customize extraction logic without dealing with HTTP connection management or threading.

What WebMagic Does

  • Downloads pages with configurable HTTP clients (HttpClient or Selenium for JS rendering)
  • Extracts data using CSS selectors, XPath, or regex through a fluent Selectable API
  • Manages URL scheduling with deduplication via HashSet, Redis, or Bloom filter
  • Processes extracted data through pipelines for console output, JSON, or database storage
  • Supports multi-threaded crawling with configurable parallelism

Architecture Overview

WebMagic follows a four-component architecture inspired by Scrapy. The Downloader fetches pages and returns an HTTP response. The PageProcessor extracts structured data and discovers new URLs. The Scheduler queues and deduplicates URLs. The Pipeline persists or displays results. These components are coordinated by a Spider thread pool that drives the crawl loop.

Self-Hosting & Configuration

  • Add webmagic-core and webmagic-extension as Maven dependencies
  • Implement PageProcessor to define extraction logic for your target site
  • Configure thread count, sleep interval, and retry policy on the Spider builder
  • Use webmagic-selenium for JavaScript-rendered pages
  • Choose a Scheduler: HashSetDedupScheduler for small crawls, RedisScheduler for distributed

Key Features

  • Clean four-component architecture makes customization straightforward
  • Fluent Selectable API chains CSS, XPath, and regex extractors
  • Built-in annotation-based model extraction via @TargetUrl and @ExtractBy
  • Distributed crawling support through Redis-based URL scheduling
  • Proxy pool integration for rotating IPs during large-scale crawls

Comparison with Similar Tools

  • Scrapy — Python's leading crawler; WebMagic brings a similar architecture to Java
  • Jsoup — HTML parser only; WebMagic adds scheduling, threading, and pipeline processing
  • Crawlee — Node.js crawling framework; WebMagic serves the Java ecosystem
  • Apache Nutch — Hadoop-scale web crawling; WebMagic is lighter and easier to embed in applications

FAQ

Q: Can WebMagic handle JavaScript-rendered pages? A: Yes. Use the Selenium downloader module to render pages in a headless browser before extraction.

Q: Does it support distributed crawling? A: Yes. Replace the default scheduler with RedisScheduler to share the URL queue across multiple JVMs.

Q: How does deduplication work? A: The scheduler tracks visited URLs in a HashSet (default) or Redis set, preventing re-crawls.

Q: Is it suitable for production scraping workloads? A: Yes, with appropriate rate limiting and proxy rotation. Many teams use it for data collection pipelines.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires