Esta página se muestra en inglés. Una traducción al español está en curso.

SkillsApr 14, 2026·3 min de lectura

Scrapy — Fast High-Level Web Crawling Framework for Python

Scrapy is the most battle-tested web scraping framework for Python. It handles concurrency, retries, throttling, cookies, and export pipelines — letting you write spiders that scale from one page to millions with the same code.

Script Depot · Community

Listo para agents

Instalación lista para agent

Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.

Native · 98/100Política: permitir

Superficie agent

Cualquier agent MCP/CLI

Tipo

Skill

Instalación

Single

Confianza

Confianza: Established

Entrada

step-1.md

Comando de instalación directa

npx -y tokrepo@latest install cd40eff3-37b4-11f1-9bc6-00163e2b0d79 --target codex

Ejecutar después de confirmar el plan con dry-run.

TL;DR

Scrapy handles concurrent scraping, retries, and data pipelines so you write just the spider.

§01

What it is

Scrapy is the most battle-tested web scraping framework for Python. It handles concurrency, retries, throttling, cookies, and export pipelines, letting you write spiders that scale from one page to millions with the same code. You define how to follow links and extract data, and Scrapy handles the rest: scheduling, deduplication, middleware, and output formatting.

Scrapy targets data engineers, researchers, and developers who need structured data from websites. It is an asynchronous framework built on Twisted, capable of handling thousands of concurrent requests while respecting rate limits and site policies.

§02

Why it saves time or tokens

Building a web scraper from scratch requires handling HTTP connections, retries, rate limiting, cookie management, and data storage. Scrapy provides all of this as configuration. You focus exclusively on the extraction logic. When using AI assistants to build scrapers, Scrapy's well-defined Spider class and Item/Pipeline pattern produce consistent, working code because the framework constraints reduce ambiguity.

§03

How to use

Install Scrapy: pip install scrapy
Create a project: scrapy startproject myproject
Create a spider: scrapy genspider example example.com and define the parse method

§04

Example

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('div.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('span.price::text').get(),
                'url': product.css('a::attr(href)').get(),
            }

        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Component	Purpose
Spider	Define crawl logic and extraction
Item	Structured data container
Pipeline	Process, validate, store items
Middleware	Modify requests/responses
Settings	Configure concurrency, delays

§05

Related on TokRepo

AI tools for web scraping — web scraping tools and frameworks on TokRepo
AI tools for automation — data collection automation

§06

Common pitfalls

Scrapy runs asynchronously; using blocking libraries (requests, time.sleep) inside spiders deadlocks the event loop
Websites change their HTML structure; selectors break silently and return empty data rather than errors, so add validation in pipelines
Aggressive crawling gets your IP blocked; always configure DOWNLOAD_DELAY and CONCURRENT_REQUESTS_PER_DOMAIN in settings

Preguntas frecuentes

Can Scrapy handle JavaScript-rendered pages?+

Scrapy alone does not execute JavaScript. For JS-rendered pages, integrate Scrapy with Splash (a headless browser) or Playwright via scrapy-playwright. These middleware solutions render JavaScript before Scrapy extracts data, though they add overhead compared to plain HTTP scraping.

How does Scrapy handle rate limiting?+

Scrapy has built-in settings for DOWNLOAD_DELAY (seconds between requests), CONCURRENT_REQUESTS (total parallel requests), and CONCURRENT_REQUESTS_PER_DOMAIN. The AutoThrottle extension dynamically adjusts delays based on server response times, automatically slowing down when the target site is overloaded.

What output formats does Scrapy support?+

Scrapy exports data to JSON, JSON Lines, CSV, XML, and custom formats through Feed Exports. You configure the output format and destination in settings or on the command line. For databases, write a custom Pipeline that inserts items into PostgreSQL, MongoDB, or any other store.

How does Scrapy compare to BeautifulSoup?+

BeautifulSoup is a parsing library that extracts data from HTML. Scrapy is a complete framework that handles crawling, scheduling, concurrency, and data pipelines. BeautifulSoup is simpler for one-off page parsing. Scrapy is better for large-scale crawling with many pages, retries, and structured output.

Can Scrapy respect robots.txt?+

Yes. Scrapy respects robots.txt by default through the ROBOTSTXT_OBEY setting (True by default). It downloads and parses the robots.txt file before crawling and skips disallowed URLs. You can disable this for legitimate use cases, but always check the site's terms of service.

Referencias (3)

Scrapy GitHub— Scrapy is a web scraping framework for Python
Scrapy Docs— Scrapy architecture with spiders, items, and pipelines
robotstxt.org— robots.txt standard for web crawlers

Relacionados en TokRepo

Web scraping tools Automation tools Featured workflows

Discusión

Inicia sesión para unirte a la discusión.

Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados

Express — Fast Unopinionated Minimalist Web Framework for Node.js

Express is the original, most popular web framework for Node.js. Minimal, flexible, and the foundation of countless APIs. The go-to starting point for Node.js backends that inspired Koa, Hono, Fastify, and many others.

Skills

Script Depot

Gin — High-Performance HTTP Web Framework for Go

Gin is a high-performance HTTP web framework written in Go. Provides a Martini-like API but with significantly better performance — up to 40 times faster thanks to httprouter. The most popular Go web framework for REST APIs and microservices.

Skills

AI Open Source

Echo — High Performance Minimalist Go Web Framework

Echo is a high performance, minimalist Go web framework. Clean API, automatic TLS, HTTP/2, data binding, middleware, and group routing. A strong alternative to Gin with excellent documentation and built-in features.

Skills

AI Open Source

Actix Web — Extremely Fast Web Framework for Rust

Actix Web is a powerful, pragmatic, and extremely fast web framework for Rust. Consistently tops TechEmpower benchmarks. Built on the Actix actor framework with a rich middleware system, WebSocket support, and HTTP/2.

Skills

AI Open Source