# Crawlee — Production Web Scraping for Node.js

> Build reliable crawlers with automatic proxy rotation, request queuing, and browser automation. By Apify. 22K+ stars.

## Install

Save as a script file and run:

# Crawlee — Production Web Scraping for Node.js

## Quick Use

```bash
npx crawlee create my-crawler
cd my-crawler
npm start
```

Or add to an existing project:

```bash
npm install crawlee playwright
```

```javascript
import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks }) {
        const title = await page.title();
        console.log(`${title} — ${request.url}`);

        // Extract data
        const data = await page.$$eval('article', articles =>
            articles.map(a => ({ title: a.querySelector('h2')?.textContent }))
        );

        // Follow links
        await enqueueLinks({ globs: ['https://example.com/blog/**'] });
    },
});

await crawler.run(['https://example.com/blog']);
```

---

## Intro

Crawlee is a web scraping and browser automation library for Node.js built by Apify, with 22,600+ GitHub stars. It provides a unified interface for building production-grade crawlers using raw HTTP requests (Cheerio), headless browsers (Playwright/Puppeteer), or adaptive crawling that automatically switches between them. With built-in proxy rotation, request queuing, automatic retries, and persistent storage, Crawlee handles the hard parts of web scraping so you can focus on data extraction logic. Ideal for feeding data to AI/LLM pipelines, RAG systems, and training datasets.

Works with: Node.js, TypeScript, Playwright, Puppeteer, Cheerio. Best for developers building data pipelines for AI applications. Setup time: under 3 minutes.

---

## Crawlee Crawler Types & Features

### Three Crawler Types

| Crawler | Engine | Best For | Speed |
|---------|--------|----------|-------|
| **CheerioCrawler** | HTTP + Cheerio | Static HTML pages | Fastest |
| **PlaywrightCrawler** | Playwright browser | JavaScript-heavy SPAs | Medium |
| **PuppeteerCrawler** | Puppeteer browser | Chrome-specific features | Medium |
| **AdaptivePlaywrightCrawler** | Auto-switching | Mixed content sites | Smart |

### CheerioCrawler (Fast HTTP)

For static pages — no browser overhead:

```javascript
import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $ }) {
        const title = $('h1').text();
        const prices = $('span.price').map((_, el) => $(el).text()).get();
        await Dataset.pushData({ url: request.url, title, prices });
    },
});
```

### PlaywrightCrawler (Browser)

For JavaScript-rendered content:

```javascript
import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    headless: true,
    async requestHandler({ page, request }) {
        // Wait for dynamic content
        await page.waitForSelector('.product-list');

        // Scroll to load more
        await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
        await page.waitForTimeout(2000);

        const products = await page.$$eval('.product', items =>
            items.map(item => ({
                name: item.querySelector('.name')?.textContent,
                price: item.querySelector('.price')?.textContent,
            }))
        );
    },
});
```

### Proxy Rotation

Built-in proxy management with session persistence:

```javascript
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1:8080',
        'http://proxy2:8080',
        'http://proxy3:8080',
    ],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    sessionPoolOptions: { maxPoolSize: 100 },
});
```

### Request Queue & Auto-Retry

Persistent queue survives crashes, with configurable retry logic:

```javascript
const crawler = new PlaywrightCrawler({
    maxRequestRetries: 3,
    requestHandlerTimeoutSecs: 60,
    maxConcurrency: 10,
    async requestHandler({ request }) { /* ... */ },
    async failedRequestHandler({ request }) {
        console.log(`Failed after retries: ${request.url}`);
    },
});
```

### Dataset Storage

Structured data export without external dependencies:

```javascript
import { Dataset } from 'crawlee';

// Save data
await Dataset.pushData({ title: 'Product A', price: '$29.99' });

// Export to JSON/CSV
const dataset = await Dataset.open();
await dataset.exportToJSON('output.json');
await dataset.exportToCSV('output.csv');
```

### AI/LLM Integration

Feed crawled data directly to AI pipelines:

```javascript
const crawler = new CheerioCrawler({
    async requestHandler({ $, request }) {
        // Extract clean text for LLM consumption
        $('nav, footer, script, style').remove();
        const cleanText = $('body').text().replace(/\s+/g, ' ').trim();

        await Dataset.pushData({
            url: request.url,
            content: cleanText,
            // Ready for RAG ingestion
        });
    },
});
```

---

## FAQ

**Q: What is Crawlee?**
A: Crawlee is a Node.js/TypeScript web scraping and browser automation library by Apify with 22,600+ GitHub stars. It provides HTTP and browser-based crawlers with built-in proxy rotation, request queuing, and auto-retries for production use.

**Q: How is Crawlee different from Puppeteer or Playwright alone?**
A: Crawlee adds production features on top of Puppeteer/Playwright: request queuing, automatic retries, proxy rotation, session management, and structured storage. Raw Puppeteer/Playwright are browser automation tools; Crawlee is a complete crawling framework.

**Q: Is Crawlee free?**
A: Yes, fully free and open-source under Apache-2.0. Apify offers optional cloud hosting for running crawlers at scale, but the library itself is completely free.

---

## Source & Thanks

> Created by [Apify](https://github.com/apify). Licensed under Apache-2.0.
>
> [crawlee](https://github.com/apify/crawlee) — ⭐ 22,600+

Thanks to the Apify team for building the most robust open-source web scraping framework for Node.js.

---

<!-- ZH -->

## 快速使用

```bash
npx crawlee create my-crawler
cd my-crawler
npm start
```

或添加到已有项目：

```bash
npm install crawlee playwright
```

```javascript
import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks }) {
        const title = await page.title();
        console.log(`${title} — ${request.url}`);
        await enqueueLinks({ globs: ['https://example.com/blog/**'] });
    },
});

await crawler.run(['https://example.com/blog']);
```

---

## 简介

Crawlee 是 Apify 团队打造的 Node.js 网页抓取和浏览器自动化库，拥有 22,600+ GitHub stars。提供统一接口构建生产级爬虫，支持 HTTP 请求（Cheerio）、无头浏览器（Playwright/Puppeteer）和自适应爬取。内置代理轮转、请求队列、自动重试和持久化存储，是为 AI/LLM 管线准备数据的理想工具。

适用于：Node.js、TypeScript、Playwright、Puppeteer、Cheerio。适合为 AI 应用构建数据管线的开发者。

---

## 核心功能

### 三种爬虫类型
- **CheerioCrawler** — 纯 HTTP 请求，速度最快，适合静态页面
- **PlaywrightCrawler** — 浏览器渲染，适合 JS 重度单页应用
- **AdaptivePlaywrightCrawler** — 自动在 HTTP 和浏览器间切换

### 代理轮转
内置代理管理和会话持久化，防止 IP 封禁。

### 请求队列
持久化队列，崩溃后可恢复，支持可配置重试策略。

### 结构化存储
无需外部依赖的数据集存储，支持 JSON/CSV 导出。

### AI/LLM 集成
提取干净文本，直接输入 RAG 管线或 LLM 训练数据集。

---

## 来源与感谢

> Created by [Apify](https://github.com/apify). Licensed under Apache-2.0.
>
> [crawlee](https://github.com/apify/crawlee) — ⭐ 22,600+


---
Source: https://tokrepo.com/en/workflows/3e8c6e91-e10e-45ba-9206-d6e3a9958c6a
Author: Script Depot