Core Features
LLM-Optimized Output
Crawl4AI outputs clean markdown by default — no HTML parsing needed. Every crawl result includes result.markdown ready to feed into any LLM context window.
Structured Extraction
Extract specific data using CSS selectors, XPath, or LLM-based extraction strategies:
from crawl4ai.extraction_strategy import LLMExtractionStrategy
strategy = LLMExtractionStrategy(
provider="openai/gpt-4",
instruction="Extract all product names and prices"
)
result = await crawler.arun(url=url, extraction_strategy=strategy)Anti-Bot Bypass
Built-in stealth mode with browser fingerprint rotation, proxy support, and human-like behavior simulation. Handles Cloudflare, DataDome, and other protection systems.
Batch Crawling
Crawl hundreds of pages concurrently with rate limiting:
urls = ["https://site.com/page1", "https://site.com/page2"]
results = await crawler.arun_many(urls, max_concurrent=10)Key Stats
- 25,000+ GitHub stars
- 300+ contributors
- Supports 50+ website protection bypasses
- Output formats: Markdown, JSON, HTML, screenshots
- Python 3.8+ compatible
FAQ
Q: What is Crawl4AI? A: Crawl4AI is an open-source Python web crawler that extracts clean markdown from websites, purpose-built for feeding data into LLMs and AI applications.
Q: Is Crawl4AI free? A: Yes, fully open-source under Apache 2.0 license. No API keys or paid plans required.
Q: How does Crawl4AI compare to Scrapy? A: Crawl4AI focuses on AI/LLM use cases with built-in markdown extraction and JavaScript rendering. Scrapy is a general-purpose framework requiring more setup for AI pipelines.