Open-source web crawler that outputs clean Markdown for AI. Structured extraction, browser automation, anti-bot handling. 63K+ stars.
TO
TokRepo精选 · Community
快速使用
先拿来用,再决定要不要深挖
这里应该同时让用户和 Agent 知道第一步该复制什么、安装什么、落到哪里。
```bash
pip install -U crawl4ai
crawl4ai-setup # installs Playwright browsers
```
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
print(result.markdown) # Clean markdown output
asyncio.run(main())
```
Or use the REST API: `crawl4ai-server` → `POST http://localhost:11235/crawl`
## Introduction
Crawl4AI is an **open-source web crawler purpose-built for AI and LLM applications**. Unlike traditional scrapers that output raw HTML, Crawl4AI converts web pages into clean, structured Markdown optimized for feeding into language models.
Core capabilities:
- **LLM-Optimized Output** — Converts any web page into clean Markdown with proper headings, lists, code blocks, and links preserved. Strips ads, navigation, and boilerplate automatically
- **Structured Data Extraction** — Define JSON schemas and extract structured data from pages using LLMs or CSS/XPath selectors
- **Browser Automation** — Built on Playwright for JavaScript-rendered pages. Handle infinite scroll, click-to-expand, and dynamic content
- **Anti-Bot Protection** — Automatic proxy rotation, stealth mode, CAPTCHA handling, and human-like browsing patterns
- **Batch Crawling** — Crawl thousands of pages concurrently with configurable rate limiting and session management
- **Media Extraction** — Download images, videos, and files alongside text content
- **Docker Deployment** — Production-ready Docker image with REST API for team and pipeline use
63,000+ GitHub stars. The most popular open-source web crawler for AI applications.
## FAQ
**Q: How is Crawl4AI different from BeautifulSoup or Scrapy?**
A: BeautifulSoup and Scrapy output raw HTML that needs extensive cleaning for LLMs. Crawl4AI outputs clean Markdown directly, handles JavaScript-rendered pages, and includes built-in LLM extraction capabilities. It's designed specifically for the AI/LLM use case.
**Q: Can it handle JavaScript-heavy single-page apps?**
A: Yes. Crawl4AI uses Playwright under the hood, so it fully renders JavaScript before extracting content. You can also wait for specific elements, scroll pages, and interact with dynamic content.
**Q: Is it free for commercial use?**
A: Yes, it's Apache 2.0 licensed. Fully free for personal and commercial use.
**Q: How fast is it?**
A: With async crawling, Crawl4AI can process 100+ pages per minute depending on target site response times. The concurrent architecture makes it significantly faster than sequential scrapers.
## Works With
- Python 3.9+ with async/await
- Playwright for browser rendering
- OpenAI / Anthropic / local LLMs for structured extraction
- Docker for production deployment
- REST API for integration with any language
🙏
来源与感谢
- GitHub: [unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- License: Apache 2.0
- Stars: 63,000+
- Maintainer: Unclecode & Crawl4AI community
Thanks to Unclecode for building the go-to web crawler for the AI era, solving the critical problem of converting messy web content into clean, LLM-ready data.