ScriptsMar 29, 2026·1 min read

Crawl4AI — LLM-Friendly Web Crawling

Open-source web crawler optimized for AI and LLM use cases. Extracts clean markdown, handles JavaScript-rendered pages, and supports structured data extraction.

TL;DR
An open-source Python crawler that returns clean markdown output optimized for LLM ingestion and RAG pipelines.
§01

What it is

Crawl4AI is a Python library for crawling web pages and extracting content in formats optimized for large language models. Unlike traditional web scrapers that return raw HTML or DOM trees, Crawl4AI converts pages into clean markdown with structural metadata. This output is ready to feed directly into LLM prompts, RAG pipelines, or knowledge base indexing without manual cleanup.

The library uses an async architecture built on Playwright for browser automation, handling JavaScript-rendered pages, SPAs, and dynamic content that simpler HTTP-based scrapers miss. It is designed for developers building AI applications that need web data as input: research agents, content analysis pipelines, competitive intelligence tools, and automated documentation generators.

§02

How it saves time or tokens

Raw HTML is token-heavy. A typical web page contains navigation menus, footers, ads, tracking scripts, and other boilerplate that inflates token count without adding useful content. Crawl4AI strips all of this, extracting only the main content and converting it to compact markdown. This can reduce token usage by 60-80% compared to feeding raw HTML into an LLM.

The library also handles common web scraping pain points automatically: JavaScript rendering, cookie consent banners, lazy-loaded content, and anti-bot detection. Without Crawl4AI, developers spend hours configuring headless browsers, writing extraction rules, and debugging edge cases.

§03

How to use

  1. Install Crawl4AI:

```bash

pip install crawl4ai

```

  1. Run a basic crawl:

```python

from crawl4ai import AsyncWebCrawler

async with AsyncWebCrawler() as crawler:

result = await crawler.arun(url='https://example.com')

print(result.markdown)

```

  1. For structured data extraction, pass a schema to extract specific fields from the page.
§04

Example

Crawling a documentation page and using the output with an LLM:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url='https://docs.python.org/3/tutorial/classes.html',
            word_count_threshold=10,   # skip short blocks
            bypass_cache=True,
        )

        # result.markdown contains clean content
        print(f'Content length: {len(result.markdown)} chars')
        print(f'Links found: {len(result.links)}')
        print(f'Images found: {len(result.media)}')

        # Feed to LLM
        prompt = f'Summarize this documentation:\n\n{result.markdown[:4000]}'
        # ... send to your LLM of choice

asyncio.run(main())

Output comparison for a typical documentation page:

FormatSizeUseful Content Ratio
Raw HTML~120 KB~15%
Crawl4AI Markdown~8 KB~95%
Manual copy-paste~6 KB~90%
§05

Related on TokRepo

  • AI Tools for Web Scraping — Compare Crawl4AI with other web scraping tools built for AI workflows.
  • AI Tools for RAG — Explore tools for building retrieval-augmented generation pipelines that consume crawled data.
§06

Common pitfalls

  • Not awaiting the async crawler properly. Crawl4AI uses async/await throughout. If you try to use it synchronously without asyncio.run or an async context, you will get coroutine errors. Always use async with AsyncWebCrawler() inside an async function.
  • Crawling too aggressively without rate limiting. Rapid concurrent requests to the same domain can trigger anti-bot protections or IP bans. Use the built-in delay options and respect robots.txt.
  • Expecting perfect extraction on every page. Complex layouts with iframes, shadow DOM, or heavily obfuscated content may produce incomplete markdown. Test extraction on your target pages and add custom extraction rules for edge cases.

Frequently Asked Questions

How does Crawl4AI handle JavaScript-rendered pages?+

Crawl4AI uses Playwright under the hood to run a full browser engine. It loads the page, waits for JavaScript to execute and render content, then extracts the resulting DOM. This handles SPAs built with React, Vue, or Angular, as well as lazy-loaded content and dynamically injected elements that HTTP-only scrapers would miss.

Can I extract structured data like tables or lists?+

Yes. Crawl4AI preserves structural elements when converting to markdown. Tables become markdown tables, lists become markdown lists, and headings maintain their hierarchy. For custom structured extraction (pulling specific fields from product pages, for example), you can pass a JSON schema that defines the fields you want to extract.

Is Crawl4AI suitable for large-scale crawling?+

Crawl4AI is designed primarily for focused crawling of specific pages or small sets of URLs. For large-scale crawling of thousands or millions of pages, you would need to add your own queue management, deduplication, and distributed processing. The library handles individual page extraction well but does not include a built-in crawl scheduler.

How does the markdown output compare to raw HTML for LLM input?+

Markdown output is typically 85-95% smaller than raw HTML because it strips navigation, scripts, styles, ads, and boilerplate. The remaining content is the main article or documentation text in a format that LLMs parse efficiently. This reduction directly translates to lower token costs and more room for useful context in the prompt window.

Does Crawl4AI respect robots.txt and rate limits?+

Crawl4AI does not enforce robots.txt by default, but it provides configuration options for polite crawling. You can set delays between requests, limit concurrency, and add custom headers. It is the developer's responsibility to respect target websites' terms of service and crawling policies.

Citations (3)
🙏

Source & Thanks

Created by unclecode. Licensed under Apache 2.0. unclecode/crawl4ai — 30K+ GitHub stars

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets