ScriptsApr 2, 2026·2 min read

Crawlee — Production Web Scraping for Node.js

Build reliable crawlers with automatic proxy rotation, request queuing, and browser automation. By Apify. 22K+ stars.

TL;DR
Crawlee is a Node.js web scraping library with proxy rotation, queuing, and adaptive crawling.
§01

What it is

Crawlee is a web scraping and browser automation library for Node.js built by Apify. It provides a unified interface for building production-grade crawlers using raw HTTP requests (Cheerio), headless browsers (Playwright or Puppeteer), or adaptive crawling that automatically switches between them.

Crawlee is designed for developers building data pipelines for AI and LLM systems, RAG applications, and training datasets. It handles proxy rotation, request queuing, automatic retries, and persistent storage so you can focus on data extraction logic.

§02

How it saves time or tokens

Crawlee eliminates boilerplate code for proxy management, retry logic, and request queuing that every production crawler needs. Its adaptive crawling mode automatically picks the cheapest method (raw HTTP with Cheerio) when JavaScript rendering is not needed, falling back to Playwright only when required. This reduces compute costs and speeds up crawls. The built-in request queue with deduplication prevents wasted requests on already-visited pages, and automatic fingerprint rotation reduces blocking rates.

§03

How to use

  1. Create a new crawler project with the CLI scaffolding tool:
npx crawlee create my-crawler
cd my-crawler
npm start
  1. Or add Crawlee to an existing project and write a crawler:
npm install crawlee playwright
  1. Define your crawler with a request handler:
import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks }) {
        const title = await page.title();
        console.log(`${title} - ${request.url}`);
        await enqueueLinks({ globs: ['https://example.com/blog/**'] });
    },
});

await crawler.run(['https://example.com/blog']);
§04

Example

Extracting structured data from product pages with Cheerio (no browser needed):

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $ }) {
        const products = [];
        $('div.product-card').each((_, el) => {
            products.push({
                name: $(el).find('h2').text().trim(),
                price: $(el).find('.price').text().trim(),
                url: request.url,
            });
        });
        await Dataset.pushData(products);
    },
    maxRequestsPerCrawl: 100,
});

await crawler.run(['https://shop.example.com/products']);
§05

Related on TokRepo

§06

Common pitfalls

  • Using PlaywrightCrawler for every page wastes resources. Start with CheerioCrawler and only switch to browser-based crawling for JavaScript-heavy sites.
  • Not setting maxRequestsPerCrawl can cause runaway crawls that scrape far more pages than intended. Always set a limit during development.
  • Ignoring the built-in session pool leads to higher blocking rates. Enable session rotation when scraping sites with rate limits.

Frequently Asked Questions

What is the difference between CheerioCrawler and PlaywrightCrawler?+

CheerioCrawler makes raw HTTP requests and parses HTML with Cheerio (jQuery-like). It is faster and uses less memory but cannot handle JavaScript-rendered content. PlaywrightCrawler runs a full headless browser, handling SPAs, dynamic content, and infinite scroll pages.

Does Crawlee handle proxy rotation automatically?+

Yes. Crawlee has built-in proxy management that rotates proxies per request, handles proxy failures with automatic retries, and supports session-based proxy assignment. You provide a list of proxy URLs and Crawlee manages the rotation.

Can Crawlee be used to feed data into AI and LLM pipelines?+

Yes. Crawlee is commonly used to build data ingestion pipelines for RAG systems, training datasets, and LLM context windows. The extracted data can be stored as JSON, pushed to a database, or piped directly into embedding workflows.

How does adaptive crawling work in Crawlee?+

Adaptive crawling starts with CheerioCrawler (raw HTTP) and automatically detects when a page requires JavaScript rendering. It then switches to PlaywrightCrawler for those specific pages, keeping costs low while ensuring full coverage.

Is Crawlee related to Apify?+

Crawlee is built and maintained by Apify. It works standalone as an open-source library but can also deploy to the Apify cloud platform for managed infrastructure, scheduling, and proxy pools.

Citations (3)
🙏

Source & Thanks

Created by Apify. Licensed under Apache-2.0.

crawlee — ⭐ 22,600+

Thanks to the Apify team for building the most robust open-source web scraping framework for Node.js.

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets