Crawlee — Production Web Scraping for Node.js
Build reliable crawlers with automatic proxy rotation, request queuing, and browser automation. By Apify. 22K+ stars.
What it is
Crawlee is a web scraping and browser automation library for Node.js built by Apify. It provides a unified interface for building production-grade crawlers using raw HTTP requests (Cheerio), headless browsers (Playwright or Puppeteer), or adaptive crawling that automatically switches between them.
Crawlee is designed for developers building data pipelines for AI and LLM systems, RAG applications, and training datasets. It handles proxy rotation, request queuing, automatic retries, and persistent storage so you can focus on data extraction logic.
How it saves time or tokens
Crawlee eliminates boilerplate code for proxy management, retry logic, and request queuing that every production crawler needs. Its adaptive crawling mode automatically picks the cheapest method (raw HTTP with Cheerio) when JavaScript rendering is not needed, falling back to Playwright only when required. This reduces compute costs and speeds up crawls. The built-in request queue with deduplication prevents wasted requests on already-visited pages, and automatic fingerprint rotation reduces blocking rates.
How to use
- Create a new crawler project with the CLI scaffolding tool:
npx crawlee create my-crawler
cd my-crawler
npm start
- Or add Crawlee to an existing project and write a crawler:
npm install crawlee playwright
- Define your crawler with a request handler:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks }) {
const title = await page.title();
console.log(`${title} - ${request.url}`);
await enqueueLinks({ globs: ['https://example.com/blog/**'] });
},
});
await crawler.run(['https://example.com/blog']);
Example
Extracting structured data from product pages with Cheerio (no browser needed):
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $ }) {
const products = [];
$('div.product-card').each((_, el) => {
products.push({
name: $(el).find('h2').text().trim(),
price: $(el).find('.price').text().trim(),
url: request.url,
});
});
await Dataset.pushData(products);
},
maxRequestsPerCrawl: 100,
});
await crawler.run(['https://shop.example.com/products']);
Related on TokRepo
- Web scraping tools — More web scraping and data extraction tools curated on TokRepo.
- Automation tools — Browse automation frameworks for data pipelines and workflows.
Common pitfalls
- Using PlaywrightCrawler for every page wastes resources. Start with CheerioCrawler and only switch to browser-based crawling for JavaScript-heavy sites.
- Not setting maxRequestsPerCrawl can cause runaway crawls that scrape far more pages than intended. Always set a limit during development.
- Ignoring the built-in session pool leads to higher blocking rates. Enable session rotation when scraping sites with rate limits.
Frequently Asked Questions
CheerioCrawler makes raw HTTP requests and parses HTML with Cheerio (jQuery-like). It is faster and uses less memory but cannot handle JavaScript-rendered content. PlaywrightCrawler runs a full headless browser, handling SPAs, dynamic content, and infinite scroll pages.
Yes. Crawlee has built-in proxy management that rotates proxies per request, handles proxy failures with automatic retries, and supports session-based proxy assignment. You provide a list of proxy URLs and Crawlee manages the rotation.
Yes. Crawlee is commonly used to build data ingestion pipelines for RAG systems, training datasets, and LLM context windows. The extracted data can be stored as JSON, pushed to a database, or piped directly into embedding workflows.
Adaptive crawling starts with CheerioCrawler (raw HTTP) and automatically detects when a page requires JavaScript rendering. It then switches to PlaywrightCrawler for those specific pages, keeping costs low while ensuring full coverage.
Crawlee is built and maintained by Apify. It works standalone as an open-source library but can also deploy to the Apify cloud platform for managed infrastructure, scheduling, and proxy pools.
Citations (3)
- Crawlee GitHub— Crawlee is built by Apify for production web scraping
- Playwright Documentation— Playwright browser automation framework
- Cheerio GitHub— Cheerio HTML parsing library for Node.js
Related on TokRepo
Source & Thanks
Discussion
Related Assets
Kornia — Differentiable Computer Vision Library for PyTorch
Kornia is a differentiable computer vision library built on PyTorch that provides GPU-accelerated implementations of classical vision algorithms including geometric transforms, color conversions, filtering, feature detection, and augmentations, all with full autograd support for end-to-end learning.
AlphaFold — AI-Powered 3D Protein Structure Prediction
AlphaFold by Google DeepMind predicts three-dimensional protein structures from amino acid sequences with atomic-level accuracy, enabling breakthroughs in drug discovery, enzyme engineering, and structural biology research.
Flash Attention — Fast Memory-Efficient Exact Attention for Transformers
Flash Attention is a CUDA kernel library that computes exact scaled dot-product attention 2-4x faster and with up to 20x less memory than standard implementations by using IO-aware tiling to minimize GPU memory reads and writes.