Crawlee Crawler Types & Features
Three Crawler Types
| Crawler | Engine | Best For | Speed |
|---|---|---|---|
| CheerioCrawler | HTTP + Cheerio | Static HTML pages | Fastest |
| PlaywrightCrawler | Playwright browser | JavaScript-heavy SPAs | Medium |
| PuppeteerCrawler | Puppeteer browser | Chrome-specific features | Medium |
| AdaptivePlaywrightCrawler | Auto-switching | Mixed content sites | Smart |
CheerioCrawler (Fast HTTP)
For static pages — no browser overhead:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $ }) {
const title = $('h1').text();
const prices = $('span.price').map((_, el) => $(el).text()).get();
await Dataset.pushData({ url: request.url, title, prices });
},
});PlaywrightCrawler (Browser)
For JavaScript-rendered content:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
headless: true,
async requestHandler({ page, request }) {
// Wait for dynamic content
await page.waitForSelector('.product-list');
// Scroll to load more
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(2000);
const products = await page.$$eval('.product', items =>
items.map(item => ({
name: item.querySelector('.name')?.textContent,
price: item.querySelector('.price')?.textContent,
}))
);
},
});Proxy Rotation
Built-in proxy management with session persistence:
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy1:8080',
'http://proxy2:8080',
'http://proxy3:8080',
],
});
const crawler = new PlaywrightCrawler({
proxyConfiguration,
sessionPoolOptions: { maxPoolSize: 100 },
});Request Queue & Auto-Retry
Persistent queue survives crashes, with configurable retry logic:
const crawler = new PlaywrightCrawler({
maxRequestRetries: 3,
requestHandlerTimeoutSecs: 60,
maxConcurrency: 10,
async requestHandler({ request }) { /* ... */ },
async failedRequestHandler({ request }) {
console.log(`Failed after retries: ${request.url}`);
},
});Dataset Storage
Structured data export without external dependencies:
import { Dataset } from 'crawlee';
// Save data
await Dataset.pushData({ title: 'Product A', price: '$29.99' });
// Export to JSON/CSV
const dataset = await Dataset.open();
await dataset.exportToJSON('output.json');
await dataset.exportToCSV('output.csv');AI/LLM Integration
Feed crawled data directly to AI pipelines:
const crawler = new CheerioCrawler({
async requestHandler({ $, request }) {
// Extract clean text for LLM consumption
$('nav, footer, script, style').remove();
const cleanText = $('body').text().replace(/\s+/g, ' ').trim();
await Dataset.pushData({
url: request.url,
content: cleanText,
// Ready for RAG ingestion
});
},
});FAQ
Q: What is Crawlee? A: Crawlee is a Node.js/TypeScript web scraping and browser automation library by Apify with 22,600+ GitHub stars. It provides HTTP and browser-based crawlers with built-in proxy rotation, request queuing, and auto-retries for production use.
Q: How is Crawlee different from Puppeteer or Playwright alone? A: Crawlee adds production features on top of Puppeteer/Playwright: request queuing, automatic retries, proxy rotation, session management, and structured storage. Raw Puppeteer/Playwright are browser automation tools; Crawlee is a complete crawling framework.
Q: Is Crawlee free? A: Yes, fully free and open-source under Apache-2.0. Apify offers optional cloud hosting for running crawlers at scale, but the library itself is completely free.