Scrapy — Fast High-Level Web Crawling Framework for Python
Scrapy is the most battle-tested web scraping framework for Python. It handles concurrency, retries, throttling, cookies, and export pipelines — letting you write spiders that scale from one page to millions with the same code.
What it is
Scrapy is the most battle-tested web scraping framework for Python. It handles concurrency, retries, throttling, cookies, and export pipelines, letting you write spiders that scale from one page to millions with the same code. You define how to follow links and extract data, and Scrapy handles the rest: scheduling, deduplication, middleware, and output formatting.
Scrapy targets data engineers, researchers, and developers who need structured data from websites. It is an asynchronous framework built on Twisted, capable of handling thousands of concurrent requests while respecting rate limits and site policies.
Why it saves time or tokens
Building a web scraper from scratch requires handling HTTP connections, retries, rate limiting, cookie management, and data storage. Scrapy provides all of this as configuration. You focus exclusively on the extraction logic. When using AI assistants to build scrapers, Scrapy's well-defined Spider class and Item/Pipeline pattern produce consistent, working code because the framework constraints reduce ambiguity.
How to use
- Install Scrapy:
pip install scrapy - Create a project:
scrapy startproject myproject - Create a spider:
scrapy genspider example example.comand define the parse method
Example
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
def parse(self, response):
for product in response.css('div.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('span.price::text').get(),
'url': product.css('a::attr(href)').get(),
}
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
| Component | Purpose |
|---|---|
| Spider | Define crawl logic and extraction |
| Item | Structured data container |
| Pipeline | Process, validate, store items |
| Middleware | Modify requests/responses |
| Settings | Configure concurrency, delays |
Related on TokRepo
- AI tools for web scraping — web scraping tools and frameworks on TokRepo
- AI tools for automation — data collection automation
Common pitfalls
- Scrapy runs asynchronously; using blocking libraries (requests, time.sleep) inside spiders deadlocks the event loop
- Websites change their HTML structure; selectors break silently and return empty data rather than errors, so add validation in pipelines
- Aggressive crawling gets your IP blocked; always configure DOWNLOAD_DELAY and CONCURRENT_REQUESTS_PER_DOMAIN in settings
Frequently Asked Questions
Scrapy alone does not execute JavaScript. For JS-rendered pages, integrate Scrapy with Splash (a headless browser) or Playwright via scrapy-playwright. These middleware solutions render JavaScript before Scrapy extracts data, though they add overhead compared to plain HTTP scraping.
Scrapy has built-in settings for DOWNLOAD_DELAY (seconds between requests), CONCURRENT_REQUESTS (total parallel requests), and CONCURRENT_REQUESTS_PER_DOMAIN. The AutoThrottle extension dynamically adjusts delays based on server response times, automatically slowing down when the target site is overloaded.
Scrapy exports data to JSON, JSON Lines, CSV, XML, and custom formats through Feed Exports. You configure the output format and destination in settings or on the command line. For databases, write a custom Pipeline that inserts items into PostgreSQL, MongoDB, or any other store.
BeautifulSoup is a parsing library that extracts data from HTML. Scrapy is a complete framework that handles crawling, scheduling, concurrency, and data pipelines. BeautifulSoup is simpler for one-off page parsing. Scrapy is better for large-scale crawling with many pages, retries, and structured output.
Yes. Scrapy respects robots.txt by default through the ROBOTSTXT_OBEY setting (True by default). It downloads and parses the robots.txt file before crawling and skips disallowed URLs. You can disable this for legitimate use cases, but always check the site's terms of service.
Citations (3)
- Scrapy GitHub— Scrapy is a web scraping framework for Python
- Scrapy Docs— Scrapy architecture with spiders, items, and pipelines
- robotstxt.org— robots.txt standard for web crawlers
Related on TokRepo
Discussion
Related Assets
AlphaFold — AI-Powered 3D Protein Structure Prediction
AlphaFold by Google DeepMind predicts three-dimensional protein structures from amino acid sequences with atomic-level accuracy, enabling breakthroughs in drug discovery, enzyme engineering, and structural biology research.
Flash Attention — Fast Memory-Efficient Exact Attention for Transformers
Flash Attention is a CUDA kernel library that computes exact scaled dot-product attention 2-4x faster and with up to 20x less memory than standard implementations by using IO-aware tiling to minimize GPU memory reads and writes.
ChatGLM — Open Bilingual Chat Model by Tsinghua KEG
ChatGLM is a family of open bilingual language models from Tsinghua University that support English and Chinese conversation, code generation, and tool use, with variants optimized for consumer GPU deployment.