Crawlee — Web Scraping and Browser Automation Library
Build reliable web scrapers in Node.js or Python. Crawlee handles proxy rotation, browser fingerprints, auto-scaling, and anti-bot bypassing out of the box.
What it is
Crawlee is a web scraping and browser automation library for Node.js and Python. It handles the hard parts of web scraping: proxy rotation, browser fingerprints, automatic retries, request queuing, auto-scaling, and anti-bot bypassing. Crawlee supports HTTP crawling (Cheerio/BeautifulSoup), headless browsers (Playwright/Puppeteer), and adaptive switching between modes.
Crawlee targets developers building production web scrapers who need reliability and scale. Instead of writing retry logic, proxy management, and fingerprint rotation from scratch, Crawlee provides these as built-in features.
How it saves time or tokens
Building a reliable web scraper means handling rate limiting, CAPTCHAs, IP blocks, JavaScript rendering, and data extraction. Crawlee bundles all of these concerns into a single library. The auto-scaling feature adjusts concurrency based on system resources and target server response times. Proxy rotation and browser fingerprint management reduce blocks without custom code.
How to use
- Create a new scraper:
npx crawlee create my-scraper
cd my-scraper
npm start
- Or install manually:
npm install crawlee playwright
- Python version:
pip install crawlee[playwright]
Example
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 100,
async requestHandler({ page, request, enqueueLinks }) {
const title = await page.title();
const price = await page.$eval('.price', el => el.textContent);
console.log(`${request.url}: ${title} - ${price}`);
// Follow pagination links
await enqueueLinks({
selector: '.pagination a',
strategy: 'same-domain',
});
},
});
await crawler.run(['https://example.com/products']);
# Python version
from crawlee.playwright_crawler import PlaywrightCrawler
crawler = PlaywrightCrawler(max_requests_per_crawl=100)
@crawler.router.default_handler
async def handler(context):
title = await context.page.title()
context.log.info(f'{context.request.url}: {title}')
await context.enqueue_links(strategy='same-domain')
await crawler.run(['https://example.com'])
Related on TokRepo
- Web Scraping Tools — Scraping and data extraction tools
- Browser Automation — Automate browser interactions
This tool integrates with standard development workflows and requires minimal configuration to get started. It is available as open-source software with documentation and community support through the official repository. The project follows semantic versioning for stable releases.
For teams evaluating this tool, the key advantage is reducing manual work in repetitive tasks. The automation provided by the built-in features means less custom code to maintain and fewer integration points to manage. This translates directly to lower maintenance costs and faster iteration cycles.
Common pitfalls
- PlaywrightCrawler launches real browsers which consume significant memory; use CheerioCrawler for pages that do not require JavaScript rendering.
- Proxy rotation requires proxy URLs configured in the crawler options; Crawlee does not provide proxies, only the rotation logic.
- Respect robots.txt and website terms of service; Crawlee provides the technical capability but compliance is your responsibility.
Frequently Asked Questions
CheerioCrawler makes HTTP requests and parses HTML with Cheerio (no browser). PlaywrightCrawler launches a headless browser for pages that require JavaScript rendering. Use CheerioCrawler when possible for better performance and lower resource usage.
Yes. Crawlee includes browser fingerprint rotation, request header randomization, and session management to reduce detection. For advanced anti-bot systems, combine with proxy rotation and human-like browsing patterns.
Yes. Crawlee has official Python support with the same features as the Node.js version. Install with pip install crawlee[playwright] for browser-based scraping or crawlee[beautifulsoup] for HTTP scraping.
Yes. Crawlee includes auto-scaling that adjusts concurrency based on system resources and server response times. The request queue handles millions of URLs with automatic deduplication and retry logic.
Yes. Crawlee is open-source under the Apache 2.0 license. It is developed by Apify but can be used independently without an Apify account. Apify offers a managed platform for running Crawlee scrapers in the cloud.
Citations (3)
- Crawlee GitHub— Crawlee handles proxy rotation, fingerprints, and anti-bot bypassing
- Crawlee Documentation— Crawlee supports Node.js and Python with Playwright and Cheerio/BeautifulSoup
- Crawlee Official Site— Crawlee is open-source under Apache 2.0 by Apify
Related on TokRepo
Source & Thanks
- GitHub: apify/crawlee (16k+ stars)
- Docs: crawlee.dev
Discussion
Related Assets
NAPI-RS — Build Node.js Native Addons in Rust
Write high-performance Node.js native modules in Rust with automatic TypeScript type generation and cross-platform prebuilt binaries.
Mamba — Fast Cross-Platform Package Manager
A drop-in conda replacement written in C++ that resolves environments in seconds instead of minutes.
Plasmo — The Browser Extension Framework
Build, test, and publish browser extensions for Chrome, Firefox, and Edge using React or Vue with hot-reload and automatic manifest generation.