# Crawlee — Web Scraping and Browser Automation Library

> Build reliable web scrapers in Node.js or Python. Crawlee handles proxy rotation, browser fingerprints, auto-scaling, and anti-bot bypassing out of the box.

## Install

Save as a script file and run:

## Quick Use

```bash
npx crawlee create my-scraper
cd my-scraper
npm start
```

Or in Python:

```bash
pip install crawlee[playwright]
```

## What is Crawlee?

Crawlee is a web scraping and browser automation library that handles the hard parts — proxy rotation, browser fingerprints, retries, auto-scaling, and storage — so you can focus on the extraction logic. Available for Node.js and Python.

**Answer-Ready**: Crawlee is a web scraping library for Node.js and Python that handles proxy rotation, browser fingerprints, auto-scaling, and anti-bot bypassing for reliable data extraction.

## Core Features

### 1. Multiple Crawler Types

```typescript
// HTTP crawler (fastest, for simple pages)
import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $ }) {
        const title = $('title').text();
        await Dataset.pushData({ url: request.url, title });
    },
});

await crawler.run(['https://example.com']);
```

```typescript
// Browser crawler (for JS-rendered pages)
import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page }) {
        await page.waitForSelector('.product');
        const items = await page.$$eval('.product', els =>
            els.map(el => ({ name: el.textContent }))
        );
    },
});
```

### 2. Anti-Bot Features
Built-in fingerprint randomization and session management:

```typescript
const crawler = new PlaywrightCrawler({
    useSessionPool: true,
    sessionPoolOptions: { maxPoolSize: 100 },
    browserPoolOptions: {
        fingerprintOptions: {
            fingerprintGeneratorOptions: {
                browsers: ['chrome', 'firefox'],
            },
        },
    },
});
```

### 3. Proxy Rotation

```typescript
import { ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1:8080',
        'http://proxy2:8080',
    ],
});

const crawler = new CheerioCrawler({
    proxyConfiguration,
    // Automatically rotates per request
});
```

### 4. Auto-Scaling
Adjusts concurrency based on system resources and target site response:

```typescript
const crawler = new CheerioCrawler({
    minConcurrency: 1,
    maxConcurrency: 100,
    // Auto-scales between these limits
});
```

### 5. Built-in Storage

```typescript
// Dataset for structured data
await Dataset.pushData({ title, price, url });
await Dataset.exportToCSV('results');

// Key-value store for files
await KeyValueStore.setValue('screenshot', buffer, { contentType: 'image/png' });

// Request queue for URLs
await RequestQueue.addRequest({ url: 'https://...' });
```

## Python Version

```python
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

crawler = PlaywrightCrawler()

@crawler.router.default_handler
async def handler(context: PlaywrightCrawlingContext):
    title = await context.page.title()
    await context.push_data({'title': title})

await crawler.run(['https://example.com'])
```

## FAQ

**Q: How does it compare to Scrapy?**
A: Crawlee has first-class browser support, built-in anti-bot features, and works in both JS and Python. Scrapy is Python-only and HTTP-focused.

**Q: Is it from the Apify team?**
A: Yes, Crawlee is open-source by Apify. It can run standalone or deploy to Apify cloud.

**Q: Can it handle SPAs?**
A: Yes, PlaywrightCrawler renders JavaScript and waits for dynamic content.

## Source & Thanks

- GitHub: [apify/crawlee](https://github.com/apify/crawlee) (16k+ stars)
- Docs: [crawlee.dev](https://crawlee.dev)

<!-- ZH -->

## 快速使用

```bash
npx crawlee create my-scraper
```

一键创建爬虫项目，内置代理轮换和反检测。

## 什么是 Crawlee？

Crawlee 是 Node.js/Python 网页抓取库，自动处理代理轮换、浏览器指纹、重试、自动扩缩和数据存储。

**一句话总结**：Crawlee 是网页抓取库，支持 Node.js 和 Python，内置代理轮换、反检测和自动扩缩。

## 核心功能

### 1. 多种爬虫类型
HTTP 爬虫（快速）和浏览器爬虫（JS 渲染）。

### 2. 反检测
内置浏览器指纹随机化和会话管理。

### 3. 代理轮换
每请求自动轮换代理。

### 4. 自动扩缩
根据系统资源和目标网站响应自动调整并发。

### 5. 内置存储
结构化数据集、键值存储、请求队列。

## 常见问题

**Q: 和 Scrapy 比较？**
A: Crawlee 原生支持浏览器、内置反检测，JS+Python 双语言。Scrapy 仅 Python 且以 HTTP 为主。

## 来源与致谢

- GitHub: [apify/crawlee](https://github.com/apify/crawlee) (16k+ stars)

---
Source: https://tokrepo.com/en/workflows/8f2c0ae9-1327-481f-a519-d473751bdd76
Author: MCP Hub