Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsMay 7, 2026·4 min de lectura

Crawl4AI 0.5 — Async LLM-Friendly Web Crawler

Crawl4AI 0.5 is the async Python crawler for RAG. Outputs clean markdown, no HTML cleanup. Adaptive crawling, JS rendering, AsyncWebCrawler API. 30K stars.

Listo para agents

Instalación lista para agent

Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Community
Entrada
Asset
Comando de instalación directa
npx -y tokrepo@latest install 0793dfd1-25c8-4a72-b272-01604d25ceb3 --target codex

Ejecutar después de confirmar el plan con dry-run.

Introducción

Crawl4AI is the LLM-first async web crawler — input a URL, output clean markdown ready to drop into RAG. Version 0.5 adds adaptive crawling (knows when to stop), session-based crawling for SPAs, and Memory-Adaptive Dispatcher to scale to thousands of URLs without exhausting RAM. Best for: RAG pipelines, knowledge-base ingestion, agents that need fresh web content. Works with: Python 3.10+, Playwright. Setup time: 2 minutes (pip install crawl4ai && crawl4ai-setup).


Hello world

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://news.ycombinator.com")
        print(result.markdown)  # clean markdown, no HTML

asyncio.run(main())

Adaptive crawling

The 0.5 release added adaptive strategies — the crawler decides when it has "enough" to answer the user's intent and stops, instead of always crawling N pages.

from crawl4ai import AdaptiveCrawler, AdaptiveConfig

config = AdaptiveConfig(
    confidence_threshold=0.85,  # stop when 85% confident
    max_pages=50,
)

async with AdaptiveCrawler(config=config) as crawler:
    result = await crawler.digest(
        start_url="https://docs.python.org",
        query="How does the asyncio event loop dispatch coroutines?",
    )
    # result.pages contains only the relevant subset

Memory-Adaptive Dispatcher (1000s of URLs)

from crawl4ai import AsyncWebCrawler, MemoryAdaptiveDispatcher, CrawlerMonitor

dispatcher = MemoryAdaptiveDispatcher(
    memory_threshold_percent=70.0,
    monitor=CrawlerMonitor(),
)

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun_many(
        urls=urls,  # list of 5000+ URLs
        dispatcher=dispatcher,
    )

When RAM hits 70%, the dispatcher pauses new launches until memory frees up. No OOM crashes on long crawls.

Output formats

  • result.markdown — clean markdown
  • result.markdown_v2 — with citations preserved
  • result.fit_markdown — content trimmed to LLM context window
  • result.media — images and videos extracted
  • result.links — internal/external links classified

FAQ

Q: Is Crawl4AI free? A: Yes — Apache-2.0 open-source. The library itself is free; Playwright (used for JS rendering) is also free and installs via crawl4ai-setup.

Q: How does this differ from Firecrawl? A: Firecrawl is a hosted SaaS API ($/scrape). Crawl4AI is a Python library you self-host. Same output (clean markdown), different deployment model. Crawl4AI also has more knobs for adaptive crawling and dispatcher control.

Q: Does it handle JavaScript-rendered pages? A: Yes. Crawl4AI uses Playwright under the hood for JS execution. Set js_code="..." to run custom JavaScript, wait_for="selector" to wait for specific elements, or screenshot=True for visual capture.


Quick Use

  1. pip install crawl4ai
  2. Run setup once: crawl4ai-setup (installs Playwright browsers)
  3. Use the AsyncWebCrawler snippet below in your Python script

Intro

Crawl4AI is the LLM-first async web crawler — input a URL, output clean markdown ready to drop into RAG. Version 0.5 adds adaptive crawling (knows when to stop), session-based crawling for SPAs, and Memory-Adaptive Dispatcher to scale to thousands of URLs without exhausting RAM. Best for: RAG pipelines, knowledge-base ingestion, agents that need fresh web content. Works with: Python 3.10+, Playwright. Setup time: 2 minutes (pip install crawl4ai && crawl4ai-setup).


Hello world

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://news.ycombinator.com")
        print(result.markdown)  # clean markdown, no HTML

asyncio.run(main())

Adaptive crawling

The 0.5 release added adaptive strategies — the crawler decides when it has "enough" to answer the user's intent and stops, instead of always crawling N pages.

from crawl4ai import AdaptiveCrawler, AdaptiveConfig

config = AdaptiveConfig(
    confidence_threshold=0.85,  # stop when 85% confident
    max_pages=50,
)

async with AdaptiveCrawler(config=config) as crawler:
    result = await crawler.digest(
        start_url="https://docs.python.org",
        query="How does the asyncio event loop dispatch coroutines?",
    )
    # result.pages contains only the relevant subset

Memory-Adaptive Dispatcher (1000s of URLs)

from crawl4ai import AsyncWebCrawler, MemoryAdaptiveDispatcher, CrawlerMonitor

dispatcher = MemoryAdaptiveDispatcher(
    memory_threshold_percent=70.0,
    monitor=CrawlerMonitor(),
)

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun_many(
        urls=urls,  # list of 5000+ URLs
        dispatcher=dispatcher,
    )

When RAM hits 70%, the dispatcher pauses new launches until memory frees up. No OOM crashes on long crawls.

Output formats

  • result.markdown — clean markdown
  • result.markdown_v2 — with citations preserved
  • result.fit_markdown — content trimmed to LLM context window
  • result.media — images and videos extracted
  • result.links — internal/external links classified

FAQ

Q: Is Crawl4AI free? A: Yes — Apache-2.0 open-source. The library itself is free; Playwright (used for JS rendering) is also free and installs via crawl4ai-setup.

Q: How does this differ from Firecrawl? A: Firecrawl is a hosted SaaS API ($/scrape). Crawl4AI is a Python library you self-host. Same output (clean markdown), different deployment model. Crawl4AI also has more knobs for adaptive crawling and dispatcher control.

Q: Does it handle JavaScript-rendered pages? A: Yes. Crawl4AI uses Playwright under the hood for JS execution. Set js_code="..." to run custom JavaScript, wait_for="selector" to wait for specific elements, or screenshot=True for visual capture.


Source & Thanks

Built by unclecode. Licensed under Apache-2.0.

unclecode/crawl4ai — ⭐ 30,000+

🙏

Fuente y agradecimientos

Built by unclecode. Licensed under Apache-2.0.

unclecode/crawl4ai — ⭐ 30,000+

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados