Crawl4AI — LLM-Friendly Web Crawling
Open-source web crawler optimized for AI and LLM use cases. Extracts clean markdown, handles JavaScript-rendered pages, and supports structured data extraction.
Ready-to-run agent install
This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.
npx -y tokrepo@latest install cb19c9d4-6c2a-4443-80eb-043a440d79eb --target codexRun after dry-run confirms the install plan.
What it is
Crawl4AI is a Python library for crawling web pages and extracting content in formats optimized for large language models. Unlike traditional web scrapers that return raw HTML or DOM trees, Crawl4AI converts pages into clean markdown with structural metadata. This output is ready to feed directly into LLM prompts, RAG pipelines, or knowledge base indexing without manual cleanup.
The library uses an async architecture built on Playwright for browser automation, handling JavaScript-rendered pages, SPAs, and dynamic content that simpler HTTP-based scrapers miss. It is designed for developers building AI applications that need web data as input: research agents, content analysis pipelines, competitive intelligence tools, and automated documentation generators.
How it saves time or tokens
Raw HTML is token-heavy. A typical web page contains navigation menus, footers, ads, tracking scripts, and other boilerplate that inflates token count without adding useful content. Crawl4AI strips all of this, extracting only the main content and converting it to compact markdown. This can reduce token usage by 60-80% compared to feeding raw HTML into an LLM.
The library also handles common web scraping pain points automatically: JavaScript rendering, cookie consent banners, lazy-loaded content, and anti-bot detection. Without Crawl4AI, developers spend hours configuring headless browsers, writing extraction rules, and debugging edge cases.
How to use
- Install Crawl4AI:
```bash
pip install crawl4ai
```
- Run a basic crawl:
```python
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url='https://example.com')
print(result.markdown)
```
- For structured data extraction, pass a schema to extract specific fields from the page.
Example
Crawling a documentation page and using the output with an LLM:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url='https://docs.python.org/3/tutorial/classes.html',
word_count_threshold=10, # skip short blocks
bypass_cache=True,
)
# result.markdown contains clean content
print(f'Content length: {len(result.markdown)} chars')
print(f'Links found: {len(result.links)}')
print(f'Images found: {len(result.media)}')
# Feed to LLM
prompt = f'Summarize this documentation:\n\n{result.markdown[:4000]}'
# ... send to your LLM of choice
asyncio.run(main())
Output comparison for a typical documentation page:
| Format | Size | Useful Content Ratio |
|---|---|---|
| Raw HTML | ~120 KB | ~15% |
| Crawl4AI Markdown | ~8 KB | ~95% |
| Manual copy-paste | ~6 KB | ~90% |
Related on TokRepo
- AI Tools for Web Scraping — Compare Crawl4AI with other web scraping tools built for AI workflows.
- AI Tools for RAG — Explore tools for building retrieval-augmented generation pipelines that consume crawled data.
Common pitfalls
- Not awaiting the async crawler properly. Crawl4AI uses async/await throughout. If you try to use it synchronously without asyncio.run or an async context, you will get coroutine errors. Always use
async with AsyncWebCrawler()inside an async function. - Crawling too aggressively without rate limiting. Rapid concurrent requests to the same domain can trigger anti-bot protections or IP bans. Use the built-in delay options and respect robots.txt.
- Expecting perfect extraction on every page. Complex layouts with iframes, shadow DOM, or heavily obfuscated content may produce incomplete markdown. Test extraction on your target pages and add custom extraction rules for edge cases.
Frequently Asked Questions
Crawl4AI uses Playwright under the hood to run a full browser engine. It loads the page, waits for JavaScript to execute and render content, then extracts the resulting DOM. This handles SPAs built with React, Vue, or Angular, as well as lazy-loaded content and dynamically injected elements that HTTP-only scrapers would miss.
Yes. Crawl4AI preserves structural elements when converting to markdown. Tables become markdown tables, lists become markdown lists, and headings maintain their hierarchy. For custom structured extraction (pulling specific fields from product pages, for example), you can pass a JSON schema that defines the fields you want to extract.
Crawl4AI is designed primarily for focused crawling of specific pages or small sets of URLs. For large-scale crawling of thousands or millions of pages, you would need to add your own queue management, deduplication, and distributed processing. The library handles individual page extraction well but does not include a built-in crawl scheduler.
Markdown output is typically 85-95% smaller than raw HTML because it strips navigation, scripts, styles, ads, and boilerplate. The remaining content is the main article or documentation text in a format that LLMs parse efficiently. This reduction directly translates to lower token costs and more room for useful context in the prompt window.
Crawl4AI does not enforce robots.txt by default, but it provides configuration options for polite crawling. You can set delays between requests, limit concurrency, and add custom headers. It is the developer's responsibility to respect target websites' terms of service and crawling policies.
Citations (3)
- Crawl4AI GitHub Repository— Crawl4AI is an open-source async web crawler for LLM-ready output
- Crawl4AI Documentation— Crawl4AI uses Playwright for JavaScript rendering
- Playwright Documentation— Playwright browser automation library
Related on TokRepo
Source & Thanks
Created by unclecode. Licensed under Apache 2.0. unclecode/crawl4ai — 30K+ GitHub stars
Discussion
Related Assets
Crawl4AI 0.5 — Async LLM-Friendly Web Crawler
Crawl4AI 0.5 is the async Python crawler for RAG. Outputs clean markdown, no HTML cleanup. Adaptive crawling, JS rendering, AsyncWebCrawler API. 30K stars.
GoatCounter — Privacy-Friendly Open Source Web Analytics
GoatCounter is a lightweight, privacy-respecting web analytics tool written in Go. It tracks page views without cookies, fingerprinting, or personal data collection, and can be self-hosted as a single binary.
KoboldCpp — Single-File Local LLM Inference Engine
KoboldCpp is a self-contained local LLM inference engine that runs GGUF models with GPU acceleration on consumer hardware, providing an OpenAI-compatible API and built-in web UI without requiring Python or complex setup.
Tolgee — Developer-Friendly Localization Platform
An open-source localization platform that lets developers and translators manage translations through a web UI, in-context editing, and native SDK integrations for React, Vue, Angular, and more.