# Crawlee — Web Scraping and Browser Automation Library > Build reliable web scrapers in Node.js or Python. Crawlee handles proxy rotation, browser fingerprints, auto-scaling, and anti-bot bypassing out of the box. ## Install Save as a script file and run: ## Quick Use ```bash npx crawlee create my-scraper cd my-scraper npm start ``` Or in Python: ```bash pip install crawlee[playwright] ``` ## What is Crawlee? Crawlee is a web scraping and browser automation library that handles the hard parts — proxy rotation, browser fingerprints, retries, auto-scaling, and storage — so you can focus on the extraction logic. Available for Node.js and Python. **Answer-Ready**: Crawlee is a web scraping library for Node.js and Python that handles proxy rotation, browser fingerprints, auto-scaling, and anti-bot bypassing for reliable data extraction. ## Core Features ### 1. Multiple Crawler Types ```typescript // HTTP crawler (fastest, for simple pages) import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ async requestHandler({ request, $ }) { const title = $('title').text(); await Dataset.pushData({ url: request.url, title }); }, }); await crawler.run(['https://example.com']); ``` ```typescript // Browser crawler (for JS-rendered pages) import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ async requestHandler({ page }) { await page.waitForSelector('.product'); const items = await page.$$eval('.product', els => els.map(el => ({ name: el.textContent })) ); }, }); ``` ### 2. Anti-Bot Features Built-in fingerprint randomization and session management: ```typescript const crawler = new PlaywrightCrawler({ useSessionPool: true, sessionPoolOptions: { maxPoolSize: 100 }, browserPoolOptions: { fingerprintOptions: { fingerprintGeneratorOptions: { browsers: ['chrome', 'firefox'], }, }, }, }); ``` ### 3. Proxy Rotation ```typescript import { ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ proxyUrls: [ 'http://proxy1:8080', 'http://proxy2:8080', ], }); const crawler = new CheerioCrawler({ proxyConfiguration, // Automatically rotates per request }); ``` ### 4. Auto-Scaling Adjusts concurrency based on system resources and target site response: ```typescript const crawler = new CheerioCrawler({ minConcurrency: 1, maxConcurrency: 100, // Auto-scales between these limits }); ``` ### 5. Built-in Storage ```typescript // Dataset for structured data await Dataset.pushData({ title, price, url }); await Dataset.exportToCSV('results'); // Key-value store for files await KeyValueStore.setValue('screenshot', buffer, { contentType: 'image/png' }); // Request queue for URLs await RequestQueue.addRequest({ url: 'https://...' }); ``` ## Python Version ```python from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext crawler = PlaywrightCrawler() @crawler.router.default_handler async def handler(context: PlaywrightCrawlingContext): title = await context.page.title() await context.push_data({'title': title}) await crawler.run(['https://example.com']) ``` ## FAQ **Q: How does it compare to Scrapy?** A: Crawlee has first-class browser support, built-in anti-bot features, and works in both JS and Python. Scrapy is Python-only and HTTP-focused. **Q: Is it from the Apify team?** A: Yes, Crawlee is open-source by Apify. It can run standalone or deploy to Apify cloud. **Q: Can it handle SPAs?** A: Yes, PlaywrightCrawler renders JavaScript and waits for dynamic content. ## Source & Thanks - GitHub: [apify/crawlee](https://github.com/apify/crawlee) (16k+ stars) - Docs: [crawlee.dev](https://crawlee.dev) ## 快速使用 ```bash npx crawlee create my-scraper ``` 一键创建爬虫项目,内置代理轮换和反检测。 ## 什么是 Crawlee? Crawlee 是 Node.js/Python 网页抓取库,自动处理代理轮换、浏览器指纹、重试、自动扩缩和数据存储。 **一句话总结**:Crawlee 是网页抓取库,支持 Node.js 和 Python,内置代理轮换、反检测和自动扩缩。 ## 核心功能 ### 1. 多种爬虫类型 HTTP 爬虫(快速)和浏览器爬虫(JS 渲染)。 ### 2. 反检测 内置浏览器指纹随机化和会话管理。 ### 3. 代理轮换 每请求自动轮换代理。 ### 4. 自动扩缩 根据系统资源和目标网站响应自动调整并发。 ### 5. 内置存储 结构化数据集、键值存储、请求队列。 ## 常见问题 **Q: 和 Scrapy 比较?** A: Crawlee 原生支持浏览器、内置反检测,JS+Python 双语言。Scrapy 仅 Python 且以 HTTP 为主。 ## 来源与致谢 - GitHub: [apify/crawlee](https://github.com/apify/crawlee) (16k+ stars) --- Source: https://tokrepo.com/en/workflows/8f2c0ae9-1327-481f-a519-d473751bdd76 Author: MCP Hub