Crawl4AI — LLM-Friendly Web Crawling
Open-source web crawler optimized for AI and LLM use cases. Extracts clean markdown, handles JavaScript-rendered pages, and supports structured data extraction.
What it is
Crawl4AI is a Python library for crawling web pages and extracting content in formats optimized for large language models. Unlike traditional web scrapers that return raw HTML or DOM trees, Crawl4AI converts pages into clean markdown with structural metadata. This output is ready to feed directly into LLM prompts, RAG pipelines, or knowledge base indexing without manual cleanup.
The library uses an async architecture built on Playwright for browser automation, handling JavaScript-rendered pages, SPAs, and dynamic content that simpler HTTP-based scrapers miss. It is designed for developers building AI applications that need web data as input: research agents, content analysis pipelines, competitive intelligence tools, and automated documentation generators.
How it saves time or tokens
Raw HTML is token-heavy. A typical web page contains navigation menus, footers, ads, tracking scripts, and other boilerplate that inflates token count without adding useful content. Crawl4AI strips all of this, extracting only the main content and converting it to compact markdown. This can reduce token usage by 60-80% compared to feeding raw HTML into an LLM.
The library also handles common web scraping pain points automatically: JavaScript rendering, cookie consent banners, lazy-loaded content, and anti-bot detection. Without Crawl4AI, developers spend hours configuring headless browsers, writing extraction rules, and debugging edge cases.
How to use
- Install Crawl4AI:
```bash
pip install crawl4ai
```
- Run a basic crawl:
```python
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url='https://example.com')
print(result.markdown)
```
- For structured data extraction, pass a schema to extract specific fields from the page.
Example
Crawling a documentation page and using the output with an LLM:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url='https://docs.python.org/3/tutorial/classes.html',
word_count_threshold=10, # skip short blocks
bypass_cache=True,
)
# result.markdown contains clean content
print(f'Content length: {len(result.markdown)} chars')
print(f'Links found: {len(result.links)}')
print(f'Images found: {len(result.media)}')
# Feed to LLM
prompt = f'Summarize this documentation:\n\n{result.markdown[:4000]}'
# ... send to your LLM of choice
asyncio.run(main())
Output comparison for a typical documentation page:
| Format | Size | Useful Content Ratio |
|---|---|---|
| Raw HTML | ~120 KB | ~15% |
| Crawl4AI Markdown | ~8 KB | ~95% |
| Manual copy-paste | ~6 KB | ~90% |
Related on TokRepo
- AI Tools for Web Scraping — Compare Crawl4AI with other web scraping tools built for AI workflows.
- AI Tools for RAG — Explore tools for building retrieval-augmented generation pipelines that consume crawled data.
Common pitfalls
- Not awaiting the async crawler properly. Crawl4AI uses async/await throughout. If you try to use it synchronously without asyncio.run or an async context, you will get coroutine errors. Always use
async with AsyncWebCrawler()inside an async function. - Crawling too aggressively without rate limiting. Rapid concurrent requests to the same domain can trigger anti-bot protections or IP bans. Use the built-in delay options and respect robots.txt.
- Expecting perfect extraction on every page. Complex layouts with iframes, shadow DOM, or heavily obfuscated content may produce incomplete markdown. Test extraction on your target pages and add custom extraction rules for edge cases.
Frequently Asked Questions
Crawl4AI uses Playwright under the hood to run a full browser engine. It loads the page, waits for JavaScript to execute and render content, then extracts the resulting DOM. This handles SPAs built with React, Vue, or Angular, as well as lazy-loaded content and dynamically injected elements that HTTP-only scrapers would miss.
Yes. Crawl4AI preserves structural elements when converting to markdown. Tables become markdown tables, lists become markdown lists, and headings maintain their hierarchy. For custom structured extraction (pulling specific fields from product pages, for example), you can pass a JSON schema that defines the fields you want to extract.
Crawl4AI is designed primarily for focused crawling of specific pages or small sets of URLs. For large-scale crawling of thousands or millions of pages, you would need to add your own queue management, deduplication, and distributed processing. The library handles individual page extraction well but does not include a built-in crawl scheduler.
Markdown output is typically 85-95% smaller than raw HTML because it strips navigation, scripts, styles, ads, and boilerplate. The remaining content is the main article or documentation text in a format that LLMs parse efficiently. This reduction directly translates to lower token costs and more room for useful context in the prompt window.
Crawl4AI does not enforce robots.txt by default, but it provides configuration options for polite crawling. You can set delays between requests, limit concurrency, and add custom headers. It is the developer's responsibility to respect target websites' terms of service and crawling policies.
Citations (3)
- Crawl4AI GitHub Repository— Crawl4AI is an open-source async web crawler for LLM-ready output
- Crawl4AI Documentation— Crawl4AI uses Playwright for JavaScript rendering
- Playwright Documentation— Playwright browser automation library
Related on TokRepo
Source & Thanks
Created by unclecode. Licensed under Apache 2.0. unclecode/crawl4ai — 30K+ GitHub stars
Discussion
Related Assets
Kornia — Differentiable Computer Vision Library for PyTorch
Kornia is a differentiable computer vision library built on PyTorch that provides GPU-accelerated implementations of classical vision algorithms including geometric transforms, color conversions, filtering, feature detection, and augmentations, all with full autograd support for end-to-end learning.
AlphaFold — AI-Powered 3D Protein Structure Prediction
AlphaFold by Google DeepMind predicts three-dimensional protein structures from amino acid sequences with atomic-level accuracy, enabling breakthroughs in drug discovery, enzyme engineering, and structural biology research.
Flash Attention — Fast Memory-Efficient Exact Attention for Transformers
Flash Attention is a CUDA kernel library that computes exact scaled dot-product attention 2-4x faster and with up to 20x less memory than standard implementations by using IO-aware tiling to minimize GPU memory reads and writes.