Scripts2026年3月29日·1 分钟阅读

Crawl4AI — LLM-Friendly Web Crawling

Open-source web crawler optimized for AI and LLM use cases. Extracts clean markdown, handles JavaScript-rendered pages, and supports structured data extraction.

TO
TokRepo精选 · Community
快速使用

先拿来用,再决定要不要深挖

这里应该同时让用户和 Agent 知道第一步该复制什么、安装什么、落到哪里。

pip install crawl4ai
from crawl4ai import AsyncWebCrawler

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com")
    print(result.markdown)  # Clean markdown content

介绍

Crawl4AI is purpose-built for feeding web content into LLMs. It crawls pages, renders JavaScript, and outputs clean markdown — perfect for RAG pipelines, research agents, and AI-powered content analysis.

Best for: RAG data ingestion, AI research agents, content extraction, web scraping for LLMs Works with: Any LLM pipeline — LangChain, LlamaIndex, custom agents


Key Features

  • Markdown output — Clean, LLM-ready text extraction
  • JavaScript rendering — Handles SPAs and dynamic content
  • Structured extraction — CSS selectors, schema-based extraction
  • Chunking strategies — Topic-based, fixed-size, or semantic chunking
  • Media extraction — Images, links, metadata
  • Rate limiting — Built-in politeness and throttling
  • Async — Fast concurrent crawling

🙏

来源与感谢

Created by unclecode. Licensed under Apache 2.0. unclecode/crawl4ai — 30K+ GitHub stars

相关资产