Introduction
Scrapy is the de-facto web scraping framework for Python. Since 2008 it has powered crawlers from side projects to production pipelines handling millions of pages per day. Built on Twisted, it provides an asynchronous, battle-tested foundation for extracting structured data from websites.
With over 51,000 GitHub stars, Scrapy is used by price-monitoring companies, search engines, academic researchers, and data teams everywhere.
What Scrapy Does
Scrapy gives you Spiders (classes that define how to follow links and parse pages), Items (structured data containers), Pipelines (process and store scraped data), and Middlewares (hooks for requests/responses). It handles concurrency, retries, cookies, HTTP caching, user-agent rotation, and respects robots.txt out of the box.
Architecture Overview
[Spider]
start_urls + parse()
|
[Scheduler] --> [Downloader] --> [Response]
^ |
| v
[Request] <-- [Parse Callback] -- [Extracted Items]
|
[Item Pipelines]
|
JSON / CSV / DB / S3Self-Hosting & Configuration
# myproject/spiders/quotes.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
custom_settings = {
"DOWNLOAD_DELAY": 0.5,
"CONCURRENT_REQUESTS": 16,
"USER_AGENT": "mybot/1.0",
}
def parse(self, response):
for q in response.css("div.quote"):
yield {
"text": q.css("span.text::text").get(),
"author": q.css("small.author::text").get(),
"tags": q.css("div.tags a.tag::text").getall(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)Key Features
- Async by default — Twisted reactor handles thousands of concurrent requests
- Selectors — CSS and XPath for clean extraction
- Middlewares — request/response hooks (proxies, UA rotation, headers)
- Item Pipelines — cleaning, validation, deduplication, storage
- Built-in exports — JSON, JSON Lines, CSV, XML with one flag
- AutoThrottle — automatically adjust concurrency based on server load
- Scrapyd — deploy and schedule spiders on servers
- Contracts — spider unit tests via docstring assertions
Comparison with Similar Tools
| Feature | Scrapy | Playwright | BeautifulSoup | Selenium | HTTPX + lxml |
|---|---|---|---|---|---|
| JavaScript | Via splash/playwright | Yes (native) | No | Yes | No |
| Async | Yes | Yes | No | No | Yes |
| Scale | Excellent | Moderate | Small | Small | Moderate |
| Learning Curve | Moderate | Low | Very Low | Low | Low |
| Best For | Large crawls | SPA scraping | Parsing | Browser tests | Simple fetches |
FAQ
Q: Does Scrapy handle JavaScript-rendered pages? A: Not natively. Use scrapy-playwright or scrapy-splash to render JS, or reverse-engineer the underlying API calls (often faster).
Q: How do I avoid getting blocked? A: Set DOWNLOAD_DELAY, use AutoThrottle, rotate user agents, use residential proxies, and respect robots.txt. Scrapy has middlewares for all of these.
Q: Scrapy vs requests+BeautifulSoup? A: Use requests+bs4 for one-off scripts. Use Scrapy when you need concurrency, link following, retries, pipelines, or crawling thousands of pages.
Q: How do I deploy spiders? A: Use Scrapyd (self-hosted) or Scrapy Cloud (Zyte). Both let you schedule spiders and collect logs/items via API.
Sources
- GitHub: https://github.com/scrapy/scrapy
- Docs: https://docs.scrapy.org
- License: BSD-3-Clause