What is Crawlee — Production Web Scraping for Node.js?

Build reliable crawlers with automatic proxy rotation, request queuing, and browser automation. By Apify. 22K+ stars.

Is Crawlee — Production Web Scraping for Node.js free to use?

Yes. Crawlee — Production Web Scraping for Node.js is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Crawlee — Production Web Scraping for Node.js?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Crawlee — Production Web Scraping for Node.js

Crawlee is a web scraping and browser automation library for Node.js built by Apify, with 22,600+ GitHub stars. It provides a unified interface for building production-grade crawlers using raw HTTP requests (Cheerio), headless browsers (Playwright/Puppeteer), or adaptive crawling that automatically switches between them. With built-in proxy rotation, request queuing, automatic retries, and persistent storage, Crawlee handles the hard parts of web scraping so you can focus on data extraction logic. Ideal for feeding data to AI/LLM pipelines, RAG systems, and training datasets.

Works with: Node.js, TypeScript, Playwright, Puppeteer, Cheerio. Best for developers building data pipelines for AI applications. Setup time: under 3 minutes.

Crawlee Crawler Types & Features

Three Crawler Types

Crawler	Engine	Best For	Speed
CheerioCrawler	HTTP + Cheerio	Static HTML pages	Fastest
PlaywrightCrawler	Playwright browser	JavaScript-heavy SPAs	Medium
PuppeteerCrawler	Puppeteer browser	Chrome-specific features	Medium
AdaptivePlaywrightCrawler	Auto-switching	Mixed content sites	Smart

CheerioCrawler (Fast HTTP)

For static pages — no browser overhead:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $ }) {
        const title = $('h1').text();
        const prices = $('span.price').map((_, el) => $(el).text()).get();
        await Dataset.pushData({ url: request.url, title, prices });
    },
});

PlaywrightCrawler (Browser)

For JavaScript-rendered content:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    headless: true,
    async requestHandler({ page, request }) {
        // Wait for dynamic content
        await page.waitForSelector('.product-list');

        // Scroll to load more
        await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
        await page.waitForTimeout(2000);

        const products = await page.$$eval('.product', items =>
            items.map(item => ({
                name: item.querySelector('.name')?.textContent,
                price: item.querySelector('.price')?.textContent,
            }))
        );
    },
});

Proxy Rotation

Built-in proxy management with session persistence:

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1:8080',
        'http://proxy2:8080',
        'http://proxy3:8080',
    ],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    sessionPoolOptions: { maxPoolSize: 100 },
});

Request Queue & Auto-Retry

Persistent queue survives crashes, with configurable retry logic:

const crawler = new PlaywrightCrawler({
    maxRequestRetries: 3,
    requestHandlerTimeoutSecs: 60,
    maxConcurrency: 10,
    async requestHandler({ request }) { /* ... */ },
    async failedRequestHandler({ request }) {
        console.log(`Failed after retries: ${request.url}`);
    },
});

Dataset Storage

Structured data export without external dependencies:

import { Dataset } from 'crawlee';

// Save data
await Dataset.pushData({ title: 'Product A', price: '$29.99' });

// Export to JSON/CSV
const dataset = await Dataset.open();
await dataset.exportToJSON('output.json');
await dataset.exportToCSV('output.csv');

AI/LLM Integration

Feed crawled data directly to AI pipelines:

const crawler = new CheerioCrawler({
    async requestHandler({ $, request }) {
        // Extract clean text for LLM consumption
        $('nav, footer, script, style').remove();
        const cleanText = $('body').text().replace(/\s+/g, ' ').trim();

        await Dataset.pushData({
            url: request.url,
            content: cleanText,
            // Ready for RAG ingestion
        });
    },
});

FAQ

Q: What is Crawlee? A: Crawlee is a Node.js/TypeScript web scraping and browser automation library by Apify with 22,600+ GitHub stars. It provides HTTP and browser-based crawlers with built-in proxy rotation, request queuing, and auto-retries for production use.

Q: How is Crawlee different from Puppeteer or Playwright alone? A: Crawlee adds production features on top of Puppeteer/Playwright: request queuing, automatic retries, proxy rotation, session management, and structured storage. Raw Puppeteer/Playwright are browser automation tools; Crawlee is a complete crawling framework.

Q: Is Crawlee free? A: Yes, fully free and open-source under Apache-2.0. Apify offers optional cloud hosting for running crawlers at scale, but the library itself is completely free.

Crawlee — Production Web Scraping for Node.js

Crawlee Crawler Types & Features

Three Crawler Types

CheerioCrawler (Fast HTTP)

PlaywrightCrawler (Browser)

Proxy Rotation

Request Queue & Auto-Retry

Dataset Storage

AI/LLM Integration

FAQ

Fuente y agradecimientos

Discusión

Activos relacionados

Unkey — Open-Source API Key Management Platform

Flagsmith — Open-Source Feature Flags and Remote Config

OpenStatus — Open-Source Monitoring and Status Page Platform