Crawlee — Production Web Scraping for Node.js
Build reliable crawlers with automatic proxy rotation, request queuing, and browser automation. By Apify. 22K+ stars.
Safe staging for this asset
This asset is staged first. The copied prompt tells the agent to inspect the staged files and ask before activating scripts, MCP config, or global config.
npx -y tokrepo@latest install 3e8c6e91-e10e-45ba-9206-d6e3a9958c6a --target codexStages files first; activation requires review of the staged README and plan.
What it is
Crawlee is a web scraping and browser automation library for Node.js built by Apify. It provides a unified interface for building production-grade crawlers using raw HTTP requests (Cheerio), headless browsers (Playwright or Puppeteer), or adaptive crawling that automatically switches between them.
Crawlee is designed for developers building data pipelines for AI and LLM systems, RAG applications, and training datasets. It handles proxy rotation, request queuing, automatic retries, and persistent storage so you can focus on data extraction logic.
How it saves time or tokens
Crawlee eliminates boilerplate code for proxy management, retry logic, and request queuing that every production crawler needs. Its adaptive crawling mode automatically picks the cheapest method (raw HTTP with Cheerio) when JavaScript rendering is not needed, falling back to Playwright only when required. This reduces compute costs and speeds up crawls. The built-in request queue with deduplication prevents wasted requests on already-visited pages, and automatic fingerprint rotation reduces blocking rates.
How to use
- Create a new crawler project with the CLI scaffolding tool:
npx crawlee create my-crawler
cd my-crawler
npm start
- Or add Crawlee to an existing project and write a crawler:
npm install crawlee playwright
- Define your crawler with a request handler:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks }) {
const title = await page.title();
console.log(`${title} - ${request.url}`);
await enqueueLinks({ globs: ['https://example.com/blog/**'] });
},
});
await crawler.run(['https://example.com/blog']);
Example
Extracting structured data from product pages with Cheerio (no browser needed):
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $ }) {
const products = [];
$('div.product-card').each((_, el) => {
products.push({
name: $(el).find('h2').text().trim(),
price: $(el).find('.price').text().trim(),
url: request.url,
});
});
await Dataset.pushData(products);
},
maxRequestsPerCrawl: 100,
});
await crawler.run(['https://shop.example.com/products']);
Related on TokRepo
- Web scraping tools — More web scraping and data extraction tools curated on TokRepo.
- Automation tools — Browse automation frameworks for data pipelines and workflows.
Common pitfalls
- Using PlaywrightCrawler for every page wastes resources. Start with CheerioCrawler and only switch to browser-based crawling for JavaScript-heavy sites.
- Not setting maxRequestsPerCrawl can cause runaway crawls that scrape far more pages than intended. Always set a limit during development.
- Ignoring the built-in session pool leads to higher blocking rates. Enable session rotation when scraping sites with rate limits.
Frequently Asked Questions
CheerioCrawler makes raw HTTP requests and parses HTML with Cheerio (jQuery-like). It is faster and uses less memory but cannot handle JavaScript-rendered content. PlaywrightCrawler runs a full headless browser, handling SPAs, dynamic content, and infinite scroll pages.
Yes. Crawlee has built-in proxy management that rotates proxies per request, handles proxy failures with automatic retries, and supports session-based proxy assignment. You provide a list of proxy URLs and Crawlee manages the rotation.
Yes. Crawlee is commonly used to build data ingestion pipelines for RAG systems, training datasets, and LLM context windows. The extracted data can be stored as JSON, pushed to a database, or piped directly into embedding workflows.
Adaptive crawling starts with CheerioCrawler (raw HTTP) and automatically detects when a page requires JavaScript rendering. It then switches to PlaywrightCrawler for those specific pages, keeping costs low while ensuring full coverage.
Crawlee is built and maintained by Apify. It works standalone as an open-source library but can also deploy to the Apify cloud platform for managed infrastructure, scheduling, and proxy pools.
Citations (3)
- Crawlee GitHub— Crawlee is built by Apify for production web scraping
- Playwright Documentation— Playwright browser automation framework
- Cheerio GitHub— Cheerio HTML parsing library for Node.js
Related on TokRepo
Source & Thanks
Discussion
Related Assets
Crawlee — Web Scraping and Browser Automation Library
Build reliable web scrapers in Node.js or Python. Crawlee handles proxy rotation, browser fingerprints, auto-scaling, and anti-bot bypassing out of the box.
Apify Actor SDK — Headless Web Automation at Cloud Scale
The Apify SDK turns a Crawlee/Playwright script into a managed cloud Actor. Auto-retries, proxy rotation, dataset storage, request queue out of the box.
Next.js — The Full-Stack React Framework for the Web
Next.js is the most popular React framework for building full-stack web applications. It provides server-side rendering, static generation, API routes, file-based routing, and React Server Components — making React production-ready out of the box.
Koa — Expressive Middleware Framework for Node.js
Koa is a web framework for Node.js designed by the team behind Express. It uses async/await natively for cleaner middleware composition and a smaller, more expressive core.