What is Crawl4AI 0.5 — Async LLM-Friendly Web Crawler?

Crawl4AI 0.5 is the async Python crawler for RAG. Outputs clean markdown, no HTML cleanup. Adaptive crawling, JS rendering, AsyncWebCrawler API. 30K stars.

Is Crawl4AI 0.5 — Async LLM-Friendly Web Crawler free to use?

Yes. Crawl4AI 0.5 — Async LLM-Friendly Web Crawler is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Crawl4AI 0.5 — Async LLM-Friendly Web Crawler?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Crawl4AI 0.5 — Async LLM-Friendly Web Crawler

简介

Crawl4AI 是 LLM 优先的异步网页爬虫 —— 输入 URL，输出干净的 markdown，直接喂 RAG。0.5 版加了自适应爬取（知道什么时候停）、SPA 的会话式爬取、内存自适应分发器（数千 URL 不爆内存）。适合 RAG 流水线、知识库入库、要新鲜网页内容的 agent。需要 Python 3.10+ 和 Playwright。装机时间 2 分钟（pip install crawl4ai && crawl4ai-setup）。

Hello world

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://news.ycombinator.com")
        print(result.markdown)  # 干净 markdown，没 HTML

asyncio.run(main())

自适应爬取

0.5 版加了自适应策略 —— 爬虫自己判断"够了"就停，而不是固定爬 N 页。

from crawl4ai import AdaptiveCrawler, AdaptiveConfig

config = AdaptiveConfig(
    confidence_threshold=0.85,  # 85% 信心就停
    max_pages=50,
)

async with AdaptiveCrawler(config=config) as crawler:
    result = await crawler.digest(
        start_url="https://docs.python.org",
        query="How does the asyncio event loop dispatch coroutines?",
    )
    # result.pages 只含相关子集

内存自适应分发器（数千 URL）

from crawl4ai import AsyncWebCrawler, MemoryAdaptiveDispatcher, CrawlerMonitor

dispatcher = MemoryAdaptiveDispatcher(
    memory_threshold_percent=70.0,
    monitor=CrawlerMonitor(),
)

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun_many(
        urls=urls,  # 5000+ URL 列表
        dispatcher=dispatcher,
    )

内存到 70% 时分发器暂停新启动，等内存释放再继续。长爬不 OOM。

输出格式

result.markdown —— 干净 markdown
result.markdown_v2 —— 保留引用
result.fit_markdown —— 裁剪到 LLM 上下文窗口
result.media —— 抽出的图片视频
result.links —— 内/外链分类

FAQ

Q: Crawl4AI 免费吗？ A: 免费。Apache-2.0 开源。库本身免费，跑 JS 渲染用的 Playwright 也免费，crawl4ai-setup 自动装。

Q: 跟 Firecrawl 啥区别？ A: Firecrawl 是托管 SaaS API（按次计费）。Crawl4AI 是 Python 库自托管。输出一样（干净 markdown），部署模型不一样。Crawl4AI 有更多自适应爬取和分发器控制旋钮。

Q: 能处理 JS 渲染页面吗？ A: 能。Crawl4AI 底层用 Playwright 跑 JS。设 js_code="..." 跑自定义 JS、wait_for="selector" 等元素出现、screenshot=True 视觉截图。

Crawl4AI 0.5 — Async LLM-Friendly Web Crawler

Agent 可直接安装

简介

Hello world

自适应爬取

内存自适应分发器（数千 URL）

输出格式

FAQ

来源与感谢

讨论

相关资产

Crawl4AI — LLM-Friendly Web Crawling

Sanic — Async Python Web Framework Built for Speed

Tornado — Python Async Web Framework and Networking Library

Tide — Async Web Framework for Rust