# Scrapy — Fast High-Level Web Crawling Framework for Python

> Scrapy is the most battle-tested web scraping framework for Python. It handles concurrency, retries, throttling, cookies, and export pipelines — letting you write spiders that scale from one page to millions with the same code.

## Install

Save as a script file and run:

# Scrapy — Fast High-Level Web Crawling Framework

## Quick Use
```bash
pip install scrapy
scrapy startproject myproject
cd myproject
scrapy genspider quotes quotes.toscrape.com
scrapy crawl quotes -O quotes.json
```

## Introduction
Scrapy is the de-facto web scraping framework for Python. Since 2008 it has powered crawlers from side projects to production pipelines handling millions of pages per day. Built on Twisted, it provides an asynchronous, battle-tested foundation for extracting structured data from websites.

With over 51,000 GitHub stars, Scrapy is used by price-monitoring companies, search engines, academic researchers, and data teams everywhere.

## What Scrapy Does
Scrapy gives you Spiders (classes that define how to follow links and parse pages), Items (structured data containers), Pipelines (process and store scraped data), and Middlewares (hooks for requests/responses). It handles concurrency, retries, cookies, HTTP caching, user-agent rotation, and respects robots.txt out of the box.

## Architecture Overview
```
[Spider]
  start_urls + parse()
      |
  [Scheduler] --> [Downloader] --> [Response]
      ^                              |
      |                              v
  [Request]  <-- [Parse Callback] -- [Extracted Items]
                                     |
                                [Item Pipelines]
                                     |
                          JSON / CSV / DB / S3
```

## Self-Hosting & Configuration
```python
# myproject/spiders/quotes.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]
    custom_settings = {
        "DOWNLOAD_DELAY": 0.5,
        "CONCURRENT_REQUESTS": 16,
        "USER_AGENT": "mybot/1.0",
    }

    def parse(self, response):
        for q in response.css("div.quote"):
            yield {
                "text": q.css("span.text::text").get(),
                "author": q.css("small.author::text").get(),
                "tags": q.css("div.tags a.tag::text").getall(),
            }
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)
```

## Key Features
- **Async by default** — Twisted reactor handles thousands of concurrent requests
- **Selectors** — CSS and XPath for clean extraction
- **Middlewares** — request/response hooks (proxies, UA rotation, headers)
- **Item Pipelines** — cleaning, validation, deduplication, storage
- **Built-in exports** — JSON, JSON Lines, CSV, XML with one flag
- **AutoThrottle** — automatically adjust concurrency based on server load
- **Scrapyd** — deploy and schedule spiders on servers
- **Contracts** — spider unit tests via docstring assertions

## Comparison with Similar Tools
| Feature | Scrapy | Playwright | BeautifulSoup | Selenium | HTTPX + lxml |
|---|---|---|---|---|---|
| JavaScript | Via splash/playwright | Yes (native) | No | Yes | No |
| Async | Yes | Yes | No | No | Yes |
| Scale | Excellent | Moderate | Small | Small | Moderate |
| Learning Curve | Moderate | Low | Very Low | Low | Low |
| Best For | Large crawls | SPA scraping | Parsing | Browser tests | Simple fetches |

## FAQ
**Q: Does Scrapy handle JavaScript-rendered pages?**
A: Not natively. Use scrapy-playwright or scrapy-splash to render JS, or reverse-engineer the underlying API calls (often faster).

**Q: How do I avoid getting blocked?**
A: Set DOWNLOAD_DELAY, use AutoThrottle, rotate user agents, use residential proxies, and respect robots.txt. Scrapy has middlewares for all of these.

**Q: Scrapy vs requests+BeautifulSoup?**
A: Use requests+bs4 for one-off scripts. Use Scrapy when you need concurrency, link following, retries, pipelines, or crawling thousands of pages.

**Q: How do I deploy spiders?**
A: Use Scrapyd (self-hosted) or Scrapy Cloud (Zyte). Both let you schedule spiders and collect logs/items via API.

## Sources
- GitHub: https://github.com/scrapy/scrapy
- Docs: https://docs.scrapy.org
- License: BSD-3-Clause

---
Source: https://tokrepo.com/en/workflows/cd40eff3-37b4-11f1-9bc6-00163e2b0d79
Author: Script Depot