# Crawl4AI 0.5 — Async LLM-Friendly Web Crawler

> Crawl4AI 0.5 is the async Python crawler for RAG. Outputs clean markdown, no HTML cleanup. Adaptive crawling, JS rendering, AsyncWebCrawler API. 30K stars.

## Install

Save as a script file and run:

## Quick Use

1. `pip install crawl4ai`
2. Run setup once: `crawl4ai-setup` (installs Playwright browsers)
3. Use the AsyncWebCrawler snippet below in your Python script

---

## Intro

Crawl4AI is the LLM-first async web crawler — input a URL, output clean markdown ready to drop into RAG. Version 0.5 adds adaptive crawling (knows when to stop), session-based crawling for SPAs, and Memory-Adaptive Dispatcher to scale to thousands of URLs without exhausting RAM. Best for: RAG pipelines, knowledge-base ingestion, agents that need fresh web content. Works with: Python 3.10+, Playwright. Setup time: 2 minutes (`pip install crawl4ai && crawl4ai-setup`).

---

### Hello world

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://news.ycombinator.com")
        print(result.markdown)  # clean markdown, no HTML

asyncio.run(main())
```

### Adaptive crawling

The 0.5 release added adaptive strategies — the crawler decides when it has "enough" to answer the user's intent and stops, instead of always crawling N pages.

```python
from crawl4ai import AdaptiveCrawler, AdaptiveConfig

config = AdaptiveConfig(
    confidence_threshold=0.85,  # stop when 85% confident
    max_pages=50,
)

async with AdaptiveCrawler(config=config) as crawler:
    result = await crawler.digest(
        start_url="https://docs.python.org",
        query="How does the asyncio event loop dispatch coroutines?",
    )
    # result.pages contains only the relevant subset
```

### Memory-Adaptive Dispatcher (1000s of URLs)

```python
from crawl4ai import AsyncWebCrawler, MemoryAdaptiveDispatcher, CrawlerMonitor

dispatcher = MemoryAdaptiveDispatcher(
    memory_threshold_percent=70.0,
    monitor=CrawlerMonitor(),
)

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun_many(
        urls=urls,  # list of 5000+ URLs
        dispatcher=dispatcher,
    )
```

When RAM hits 70%, the dispatcher pauses new launches until memory frees up. No OOM crashes on long crawls.

### Output formats

- `result.markdown` — clean markdown
- `result.markdown_v2` — with citations preserved
- `result.fit_markdown` — content trimmed to LLM context window
- `result.media` — images and videos extracted
- `result.links` — internal/external links classified

---

### FAQ

**Q: Is Crawl4AI free?**
A: Yes — Apache-2.0 open-source. The library itself is free; Playwright (used for JS rendering) is also free and installs via crawl4ai-setup.

**Q: How does this differ from Firecrawl?**
A: Firecrawl is a hosted SaaS API ($/scrape). Crawl4AI is a Python library you self-host. Same output (clean markdown), different deployment model. Crawl4AI also has more knobs for adaptive crawling and dispatcher control.

**Q: Does it handle JavaScript-rendered pages?**
A: Yes. Crawl4AI uses Playwright under the hood for JS execution. Set `js_code="..."` to run custom JavaScript, `wait_for="selector"` to wait for specific elements, or `screenshot=True` for visual capture.

---

## Source & Thanks

> Built by [unclecode](https://github.com/unclecode). Licensed under Apache-2.0.
>
> [unclecode/crawl4ai](https://github.com/unclecode/crawl4ai) — ⭐ 30,000+

---

<!-- ZH -->

## 快速使用

1. `pip install crawl4ai`
2. 跑一次 setup：`crawl4ai-setup`（装 Playwright 浏览器）
3. 用下面的 AsyncWebCrawler 代码放进你的 Python 脚本

---

## 简介

Crawl4AI 是 LLM 优先的异步网页爬虫 —— 输入 URL，输出干净的 markdown，直接喂 RAG。0.5 版加了自适应爬取（知道什么时候停）、SPA 的会话式爬取、内存自适应分发器（数千 URL 不爆内存）。适合 RAG 流水线、知识库入库、要新鲜网页内容的 agent。需要 Python 3.10+ 和 Playwright。装机时间 2 分钟（`pip install crawl4ai && crawl4ai-setup`）。

---

### Hello world

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://news.ycombinator.com")
        print(result.markdown)  # 干净 markdown，没 HTML

asyncio.run(main())
```

### 自适应爬取

0.5 版加了自适应策略 —— 爬虫自己判断"够了"就停，而不是固定爬 N 页。

```python
from crawl4ai import AdaptiveCrawler, AdaptiveConfig

config = AdaptiveConfig(
    confidence_threshold=0.85,  # 85% 信心就停
    max_pages=50,
)

async with AdaptiveCrawler(config=config) as crawler:
    result = await crawler.digest(
        start_url="https://docs.python.org",
        query="How does the asyncio event loop dispatch coroutines?",
    )
    # result.pages 只含相关子集
```

### 内存自适应分发器（数千 URL）

```python
from crawl4ai import AsyncWebCrawler, MemoryAdaptiveDispatcher, CrawlerMonitor

dispatcher = MemoryAdaptiveDispatcher(
    memory_threshold_percent=70.0,
    monitor=CrawlerMonitor(),
)

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun_many(
        urls=urls,  # 5000+ URL 列表
        dispatcher=dispatcher,
    )
```

内存到 70% 时分发器暂停新启动，等内存释放再继续。长爬不 OOM。

### 输出格式

- `result.markdown` —— 干净 markdown
- `result.markdown_v2` —— 保留引用
- `result.fit_markdown` —— 裁剪到 LLM 上下文窗口
- `result.media` —— 抽出的图片视频
- `result.links` —— 内/外链分类

---

### FAQ

**Q: Crawl4AI 免费吗？**
A: 免费。Apache-2.0 开源。库本身免费，跑 JS 渲染用的 Playwright 也免费，crawl4ai-setup 自动装。

**Q: 跟 Firecrawl 啥区别？**
A: Firecrawl 是托管 SaaS API（按次计费）。Crawl4AI 是 Python 库自托管。输出一样（干净 markdown），部署模型不一样。Crawl4AI 有更多自适应爬取和分发器控制旋钮。

**Q: 能处理 JS 渲染页面吗？**
A: 能。Crawl4AI 底层用 Playwright 跑 JS。设 `js_code="..."` 跑自定义 JS、`wait_for="selector"` 等元素出现、`screenshot=True` 视觉截图。

---

## 来源与感谢

> Built by [unclecode](https://github.com/unclecode). Licensed under Apache-2.0.
>
> [unclecode/crawl4ai](https://github.com/unclecode/crawl4ai) — ⭐ 30,000+


---
Source: https://tokrepo.com/en/workflows/crawl4ai-0-5-async-llm-friendly-web-crawler
Author: Crawl4AI