# Crawl4AI 0.5 — Async LLM-Friendly Web Crawler > Crawl4AI 0.5 is the async Python crawler for RAG. Outputs clean markdown, no HTML cleanup. Adaptive crawling, JS rendering, AsyncWebCrawler API. 30K stars. ## Install Save as a script file and run: ## Quick Use 1. `pip install crawl4ai` 2. Run setup once: `crawl4ai-setup` (installs Playwright browsers) 3. Use the AsyncWebCrawler snippet below in your Python script --- ## Intro Crawl4AI is the LLM-first async web crawler — input a URL, output clean markdown ready to drop into RAG. Version 0.5 adds adaptive crawling (knows when to stop), session-based crawling for SPAs, and Memory-Adaptive Dispatcher to scale to thousands of URLs without exhausting RAM. Best for: RAG pipelines, knowledge-base ingestion, agents that need fresh web content. Works with: Python 3.10+, Playwright. Setup time: 2 minutes (`pip install crawl4ai && crawl4ai-setup`). --- ### Hello world ```python import asyncio from crawl4ai import AsyncWebCrawler async def main(): async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="https://news.ycombinator.com") print(result.markdown) # clean markdown, no HTML asyncio.run(main()) ``` ### Adaptive crawling The 0.5 release added adaptive strategies — the crawler decides when it has "enough" to answer the user's intent and stops, instead of always crawling N pages. ```python from crawl4ai import AdaptiveCrawler, AdaptiveConfig config = AdaptiveConfig( confidence_threshold=0.85, # stop when 85% confident max_pages=50, ) async with AdaptiveCrawler(config=config) as crawler: result = await crawler.digest( start_url="https://docs.python.org", query="How does the asyncio event loop dispatch coroutines?", ) # result.pages contains only the relevant subset ``` ### Memory-Adaptive Dispatcher (1000s of URLs) ```python from crawl4ai import AsyncWebCrawler, MemoryAdaptiveDispatcher, CrawlerMonitor dispatcher = MemoryAdaptiveDispatcher( memory_threshold_percent=70.0, monitor=CrawlerMonitor(), ) async with AsyncWebCrawler() as crawler: results = await crawler.arun_many( urls=urls, # list of 5000+ URLs dispatcher=dispatcher, ) ``` When RAM hits 70%, the dispatcher pauses new launches until memory frees up. No OOM crashes on long crawls. ### Output formats - `result.markdown` — clean markdown - `result.markdown_v2` — with citations preserved - `result.fit_markdown` — content trimmed to LLM context window - `result.media` — images and videos extracted - `result.links` — internal/external links classified --- ### FAQ **Q: Is Crawl4AI free?** A: Yes — Apache-2.0 open-source. The library itself is free; Playwright (used for JS rendering) is also free and installs via crawl4ai-setup. **Q: How does this differ from Firecrawl?** A: Firecrawl is a hosted SaaS API ($/scrape). Crawl4AI is a Python library you self-host. Same output (clean markdown), different deployment model. Crawl4AI also has more knobs for adaptive crawling and dispatcher control. **Q: Does it handle JavaScript-rendered pages?** A: Yes. Crawl4AI uses Playwright under the hood for JS execution. Set `js_code="..."` to run custom JavaScript, `wait_for="selector"` to wait for specific elements, or `screenshot=True` for visual capture. --- ## Source & Thanks > Built by [unclecode](https://github.com/unclecode). Licensed under Apache-2.0. > > [unclecode/crawl4ai](https://github.com/unclecode/crawl4ai) — ⭐ 30,000+ --- ## 快速使用 1. `pip install crawl4ai` 2. 跑一次 setup:`crawl4ai-setup`(装 Playwright 浏览器) 3. 用下面的 AsyncWebCrawler 代码放进你的 Python 脚本 --- ## 简介 Crawl4AI 是 LLM 优先的异步网页爬虫 —— 输入 URL,输出干净的 markdown,直接喂 RAG。0.5 版加了自适应爬取(知道什么时候停)、SPA 的会话式爬取、内存自适应分发器(数千 URL 不爆内存)。适合 RAG 流水线、知识库入库、要新鲜网页内容的 agent。需要 Python 3.10+ 和 Playwright。装机时间 2 分钟(`pip install crawl4ai && crawl4ai-setup`)。 --- ### Hello world ```python import asyncio from crawl4ai import AsyncWebCrawler async def main(): async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="https://news.ycombinator.com") print(result.markdown) # 干净 markdown,没 HTML asyncio.run(main()) ``` ### 自适应爬取 0.5 版加了自适应策略 —— 爬虫自己判断"够了"就停,而不是固定爬 N 页。 ```python from crawl4ai import AdaptiveCrawler, AdaptiveConfig config = AdaptiveConfig( confidence_threshold=0.85, # 85% 信心就停 max_pages=50, ) async with AdaptiveCrawler(config=config) as crawler: result = await crawler.digest( start_url="https://docs.python.org", query="How does the asyncio event loop dispatch coroutines?", ) # result.pages 只含相关子集 ``` ### 内存自适应分发器(数千 URL) ```python from crawl4ai import AsyncWebCrawler, MemoryAdaptiveDispatcher, CrawlerMonitor dispatcher = MemoryAdaptiveDispatcher( memory_threshold_percent=70.0, monitor=CrawlerMonitor(), ) async with AsyncWebCrawler() as crawler: results = await crawler.arun_many( urls=urls, # 5000+ URL 列表 dispatcher=dispatcher, ) ``` 内存到 70% 时分发器暂停新启动,等内存释放再继续。长爬不 OOM。 ### 输出格式 - `result.markdown` —— 干净 markdown - `result.markdown_v2` —— 保留引用 - `result.fit_markdown` —— 裁剪到 LLM 上下文窗口 - `result.media` —— 抽出的图片视频 - `result.links` —— 内/外链分类 --- ### FAQ **Q: Crawl4AI 免费吗?** A: 免费。Apache-2.0 开源。库本身免费,跑 JS 渲染用的 Playwright 也免费,crawl4ai-setup 自动装。 **Q: 跟 Firecrawl 啥区别?** A: Firecrawl 是托管 SaaS API(按次计费)。Crawl4AI 是 Python 库自托管。输出一样(干净 markdown),部署模型不一样。Crawl4AI 有更多自适应爬取和分发器控制旋钮。 **Q: 能处理 JS 渲染页面吗?** A: 能。Crawl4AI 底层用 Playwright 跑 JS。设 `js_code="..."` 跑自定义 JS、`wait_for="selector"` 等元素出现、`screenshot=True` 视觉截图。 --- ## 来源与感谢 > Built by [unclecode](https://github.com/unclecode). Licensed under Apache-2.0. > > [unclecode/crawl4ai](https://github.com/unclecode/crawl4ai) — ⭐ 30,000+ --- Source: https://tokrepo.com/en/workflows/crawl4ai-0-5-async-llm-friendly-web-crawler Author: Crawl4AI