What is WebMagic — Scalable Web Crawler Framework for Java?

A simple, flexible web crawling framework for Java that provides page extraction, multi-threaded downloading, and pipeline-based data processing out of the box.

Is WebMagic — Scalable Web Crawler Framework for Java free to use?

Yes. WebMagic — Scalable Web Crawler Framework for Java is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install WebMagic — Scalable Web Crawler Framework for Java?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

WebMagic — Scalable Web Crawler Framework for Java

Introduction

WebMagic is a Java web crawling framework modeled after Scrapy. It separates crawling into four clean components — Downloader, PageProcessor, Scheduler, and Pipeline — so developers can customize extraction logic without dealing with HTTP connection management or threading.

What WebMagic Does

Downloads pages with configurable HTTP clients (HttpClient or Selenium for JS rendering)
Extracts data using CSS selectors, XPath, or regex through a fluent Selectable API
Manages URL scheduling with deduplication via HashSet, Redis, or Bloom filter
Processes extracted data through pipelines for console output, JSON, or database storage
Supports multi-threaded crawling with configurable parallelism

Architecture Overview

WebMagic follows a four-component architecture inspired by Scrapy. The Downloader fetches pages and returns an HTTP response. The PageProcessor extracts structured data and discovers new URLs. The Scheduler queues and deduplicates URLs. The Pipeline persists or displays results. These components are coordinated by a Spider thread pool that drives the crawl loop.

Self-Hosting & Configuration

Add webmagic-core and webmagic-extension as Maven dependencies
Implement PageProcessor to define extraction logic for your target site
Configure thread count, sleep interval, and retry policy on the Spider builder
Use webmagic-selenium for JavaScript-rendered pages
Choose a Scheduler: HashSetDedupScheduler for small crawls, RedisScheduler for distributed

Key Features

Clean four-component architecture makes customization straightforward
Fluent Selectable API chains CSS, XPath, and regex extractors
Built-in annotation-based model extraction via @TargetUrl and @ExtractBy
Distributed crawling support through Redis-based URL scheduling
Proxy pool integration for rotating IPs during large-scale crawls

Comparison with Similar Tools

Scrapy — Python's leading crawler; WebMagic brings a similar architecture to Java
Jsoup — HTML parser only; WebMagic adds scheduling, threading, and pipeline processing
Crawlee — Node.js crawling framework; WebMagic serves the Java ecosystem
Apache Nutch — Hadoop-scale web crawling; WebMagic is lighter and easier to embed in applications

FAQ

Q: Can WebMagic handle JavaScript-rendered pages? A: Yes. Use the Selenium downloader module to render pages in a headless browser before extraction.

Q: Does it support distributed crawling? A: Yes. Replace the default scheduler with RedisScheduler to share the URL queue across multiple JVMs.

Q: How does deduplication work? A: The scheduler tracks visited URLs in a HashSet (default) or Redis set, preventing re-crawls.

Q: Is it suitable for production scraping workloads? A: Yes, with appropriate rate limiting and proxy rotation. Many teams use it for data collection pipelines.

WebMagic — Scalable Web Crawler Framework for Java

Agent 可直接安装

Introduction

What WebMagic Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Angular — The Enterprise Web Application Framework

Scrapy — Fast High-Level Web Crawling Framework for Python

GoFrame — Modular Full-Stack Web Framework for Go

Next.js — The Full-Stack React Framework for the Web