Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsMay 7, 2026·4 min de lecture

Crawl4AI 0.5 — Async LLM-Friendly Web Crawler

Crawl4AI 0.5 is the async Python crawler for RAG. Outputs clean markdown, no HTML cleanup. Adaptive crawling, JS rendering, AsyncWebCrawler API. 30K stars.

Crawl4AI
Crawl4AI · Community
Prêt pour agents

Installation agent prête

Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Community
Point d'entrée
Asset
Commande d'installation directe
npx -y tokrepo@latest install 0793dfd1-25c8-4a72-b272-01604d25ceb3 --target codex

À exécuter après confirmation du plan en dry-run.

Introduction

Crawl4AI is the LLM-first async web crawler — input a URL, output clean markdown ready to drop into RAG. Version 0.5 adds adaptive crawling (knows when to stop), session-based crawling for SPAs, and Memory-Adaptive Dispatcher to scale to thousands of URLs without exhausting RAM. Best for: RAG pipelines, knowledge-base ingestion, agents that need fresh web content. Works with: Python 3.10+, Playwright. Setup time: 2 minutes (pip install crawl4ai && crawl4ai-setup).


Hello world

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://news.ycombinator.com")
        print(result.markdown)  # clean markdown, no HTML

asyncio.run(main())

Adaptive crawling

The 0.5 release added adaptive strategies — the crawler decides when it has "enough" to answer the user's intent and stops, instead of always crawling N pages.

from crawl4ai import AdaptiveCrawler, AdaptiveConfig

config = AdaptiveConfig(
    confidence_threshold=0.85,  # stop when 85% confident
    max_pages=50,
)

async with AdaptiveCrawler(config=config) as crawler:
    result = await crawler.digest(
        start_url="https://docs.python.org",
        query="How does the asyncio event loop dispatch coroutines?",
    )
    # result.pages contains only the relevant subset

Memory-Adaptive Dispatcher (1000s of URLs)

from crawl4ai import AsyncWebCrawler, MemoryAdaptiveDispatcher, CrawlerMonitor

dispatcher = MemoryAdaptiveDispatcher(
    memory_threshold_percent=70.0,
    monitor=CrawlerMonitor(),
)

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun_many(
        urls=urls,  # list of 5000+ URLs
        dispatcher=dispatcher,
    )

When RAM hits 70%, the dispatcher pauses new launches until memory frees up. No OOM crashes on long crawls.

Output formats

  • result.markdown — clean markdown
  • result.markdown_v2 — with citations preserved
  • result.fit_markdown — content trimmed to LLM context window
  • result.media — images and videos extracted
  • result.links — internal/external links classified

FAQ

Q: Is Crawl4AI free? A: Yes — Apache-2.0 open-source. The library itself is free; Playwright (used for JS rendering) is also free and installs via crawl4ai-setup.

Q: How does this differ from Firecrawl? A: Firecrawl is a hosted SaaS API ($/scrape). Crawl4AI is a Python library you self-host. Same output (clean markdown), different deployment model. Crawl4AI also has more knobs for adaptive crawling and dispatcher control.

Q: Does it handle JavaScript-rendered pages? A: Yes. Crawl4AI uses Playwright under the hood for JS execution. Set js_code="..." to run custom JavaScript, wait_for="selector" to wait for specific elements, or screenshot=True for visual capture.


Quick Use

  1. pip install crawl4ai
  2. Run setup once: crawl4ai-setup (installs Playwright browsers)
  3. Use the AsyncWebCrawler snippet below in your Python script

Intro

Crawl4AI is the LLM-first async web crawler — input a URL, output clean markdown ready to drop into RAG. Version 0.5 adds adaptive crawling (knows when to stop), session-based crawling for SPAs, and Memory-Adaptive Dispatcher to scale to thousands of URLs without exhausting RAM. Best for: RAG pipelines, knowledge-base ingestion, agents that need fresh web content. Works with: Python 3.10+, Playwright. Setup time: 2 minutes (pip install crawl4ai && crawl4ai-setup).


Hello world

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://news.ycombinator.com")
        print(result.markdown)  # clean markdown, no HTML

asyncio.run(main())

Adaptive crawling

The 0.5 release added adaptive strategies — the crawler decides when it has "enough" to answer the user's intent and stops, instead of always crawling N pages.

from crawl4ai import AdaptiveCrawler, AdaptiveConfig

config = AdaptiveConfig(
    confidence_threshold=0.85,  # stop when 85% confident
    max_pages=50,
)

async with AdaptiveCrawler(config=config) as crawler:
    result = await crawler.digest(
        start_url="https://docs.python.org",
        query="How does the asyncio event loop dispatch coroutines?",
    )
    # result.pages contains only the relevant subset

Memory-Adaptive Dispatcher (1000s of URLs)

from crawl4ai import AsyncWebCrawler, MemoryAdaptiveDispatcher, CrawlerMonitor

dispatcher = MemoryAdaptiveDispatcher(
    memory_threshold_percent=70.0,
    monitor=CrawlerMonitor(),
)

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun_many(
        urls=urls,  # list of 5000+ URLs
        dispatcher=dispatcher,
    )

When RAM hits 70%, the dispatcher pauses new launches until memory frees up. No OOM crashes on long crawls.

Output formats

  • result.markdown — clean markdown
  • result.markdown_v2 — with citations preserved
  • result.fit_markdown — content trimmed to LLM context window
  • result.media — images and videos extracted
  • result.links — internal/external links classified

FAQ

Q: Is Crawl4AI free? A: Yes — Apache-2.0 open-source. The library itself is free; Playwright (used for JS rendering) is also free and installs via crawl4ai-setup.

Q: How does this differ from Firecrawl? A: Firecrawl is a hosted SaaS API ($/scrape). Crawl4AI is a Python library you self-host. Same output (clean markdown), different deployment model. Crawl4AI also has more knobs for adaptive crawling and dispatcher control.

Q: Does it handle JavaScript-rendered pages? A: Yes. Crawl4AI uses Playwright under the hood for JS execution. Set js_code="..." to run custom JavaScript, wait_for="selector" to wait for specific elements, or screenshot=True for visual capture.


Source & Thanks

Built by unclecode. Licensed under Apache-2.0.

unclecode/crawl4ai — ⭐ 30,000+

🙏

Source et remerciements

Built by unclecode. Licensed under Apache-2.0.

unclecode/crawl4ai — ⭐ 30,000+

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires