How do I install Scrapy — Fast High-Level Web Crawling Framework for Python?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Scrapy — Fast High-Level Web Crawling Framework for Python

Introduction

Scrapy is the de-facto web scraping framework for Python. Since 2008 it has powered crawlers from side projects to production pipelines handling millions of pages per day. Built on Twisted, it provides an asynchronous, battle-tested foundation for extracting structured data from websites.

With over 51,000 GitHub stars, Scrapy is used by price-monitoring companies, search engines, academic researchers, and data teams everywhere.

What Scrapy Does

Scrapy gives you Spiders (classes that define how to follow links and parse pages), Items (structured data containers), Pipelines (process and store scraped data), and Middlewares (hooks for requests/responses). It handles concurrency, retries, cookies, HTTP caching, user-agent rotation, and respects robots.txt out of the box.

Architecture Overview

[Spider]
  start_urls + parse()
      |
  [Scheduler] --> [Downloader] --> [Response]
      ^                              |
      |                              v
  [Request]  <-- [Parse Callback] -- [Extracted Items]
                                     |
                                [Item Pipelines]
                                     |
                          JSON / CSV / DB / S3

Self-Hosting & Configuration

# myproject/spiders/quotes.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]
    custom_settings = {
        "DOWNLOAD_DELAY": 0.5,
        "CONCURRENT_REQUESTS": 16,
        "USER_AGENT": "mybot/1.0",
    }

    def parse(self, response):
        for q in response.css("div.quote"):
            yield {
                "text": q.css("span.text::text").get(),
                "author": q.css("small.author::text").get(),
                "tags": q.css("div.tags a.tag::text").getall(),
            }
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Key Features

Async by default — Twisted reactor handles thousands of concurrent requests
Selectors — CSS and XPath for clean extraction
Middlewares — request/response hooks (proxies, UA rotation, headers)
Item Pipelines — cleaning, validation, deduplication, storage
Built-in exports — JSON, JSON Lines, CSV, XML with one flag
AutoThrottle — automatically adjust concurrency based on server load
Scrapyd — deploy and schedule spiders on servers
Contracts — spider unit tests via docstring assertions

Comparison with Similar Tools

Feature	Scrapy	Playwright	BeautifulSoup	Selenium	HTTPX + lxml
JavaScript	Via splash/playwright	Yes (native)	No	Yes	No
Async	Yes	Yes	No	No	Yes
Scale	Excellent	Moderate	Small	Small	Moderate
Learning Curve	Moderate	Low	Very Low	Low	Low
Best For	Large crawls	SPA scraping	Parsing	Browser tests	Simple fetches

FAQ

Q: Does Scrapy handle JavaScript-rendered pages? A: Not natively. Use scrapy-playwright or scrapy-splash to render JS, or reverse-engineer the underlying API calls (often faster).

Q: How do I avoid getting blocked? A: Set DOWNLOAD_DELAY, use AutoThrottle, rotate user agents, use residential proxies, and respect robots.txt. Scrapy has middlewares for all of these.

Q: Scrapy vs requests+BeautifulSoup? A: Use requests+bs4 for one-off scripts. Use Scrapy when you need concurrency, link following, retries, pipelines, or crawling thousands of pages.

Q: How do I deploy spiders? A: Use Scrapyd (self-hosted) or Scrapy Cloud (Zyte). Both let you schedule spiders and collect logs/items via API.

Sources

GitHub: https://github.com/scrapy/scrapy
Docs: https://docs.scrapy.org
License: BSD-3-Clause

Scrapy — Fast High-Level Web Crawling Framework for Python

Introduction

What Scrapy Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Invidious — Alternative Privacy-First Frontend for YouTube

Zulip — Threaded Team Chat That Actually Scales to Thousands of Topics

PhotoPrism — AI-Powered Photo Management for the Self-Hosted Era