[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"pack-detail-ai-web-scraping-en":3,"seo:pack:ai-web-scraping:en":63},{"code":4,"message":5,"data":6},200,"操作成功",{"pack":7},{"slug":8,"icon":9,"tone":10,"status":11,"status_label":12,"title":13,"description":14,"items":15,"install_cmd":62},"ai-web-scraping","🕷","#0369A1","stable","Stable","AI Web Scraping","Firecrawl, Crawlee, Crawl4AI, GPT Crawler, ScrapeGraphAI — scraping engines that output LLM-ready markdown, not raw HTML.",[16,28,38,46,54],{"id":17,"uuid":18,"slug":19,"title":20,"description":21,"author_name":22,"view_count":23,"vote_count":24,"lang_type":25,"type":26,"type_label":27},744,"6a62a986-9f1a-4a59-88c8-b99151986854","firecrawl-web-scraping-api-ai-applications-6a62a986","Firecrawl — Web Scraping API for AI Applications","Turn any website into clean markdown or structured data for LLMs. Firecrawl handles JavaScript rendering, anti-bot bypassing, sitemaps, and batch crawling via simple API.","Firecrawl",279,0,"en","skill","Skill",{"id":29,"uuid":30,"slug":31,"title":32,"description":33,"author_name":34,"view_count":35,"vote_count":24,"lang_type":25,"type":36,"type_label":37},412,"3e8c6e91-e10e-45ba-9206-d6e3a9958c6a","crawlee-production-web-scraping-node-js-3e8c6e91","Crawlee — Production Web Scraping for Node.js","Build reliable crawlers with automatic proxy rotation, request queuing, and browser automation. By Apify. 22K+ stars.","Apify",267,"script","Script",{"id":39,"uuid":40,"slug":41,"title":42,"description":43,"author_name":44,"view_count":45,"vote_count":24,"lang_type":25,"type":26,"type_label":27},172,"cb19c9d4-6c2a-4443-80eb-043a440d79eb","crawl4ai-llm-friendly-web-crawling-cb19c9d4","Crawl4AI — LLM-Friendly Web Crawling","Open-source web crawler optimized for AI and LLM use cases. Extracts clean markdown, handles JavaScript-rendered pages, and supports structured data extraction.","Crawl4AI",297,{"id":47,"uuid":48,"slug":49,"title":50,"description":51,"author_name":52,"view_count":53,"vote_count":24,"lang_type":25,"type":26,"type_label":27},238,"bbd3962b-db9b-4ce9-9efe-31f44d08fdff","gpt-crawler-build-custom-gpts-any-website-bbd3962b","GPT Crawler — Build Custom GPTs from Any Website","Crawl any website to generate knowledge files for custom GPTs and RAG. Output as JSON for OpenAI GPTs or any LLM knowledge base. Zero config. 22K+ stars.","AI Open Source",223,{"id":55,"uuid":56,"slug":57,"title":58,"description":59,"author_name":60,"view_count":61,"vote_count":24,"lang_type":25,"type":26,"type_label":27},243,"d34e3181-e3f5-4853-871e-83acafe0c60e","scrapegraphai-ai-powered-web-scraping-d34e3181","ScrapeGraphAI — AI-Powered Web Scraping","Python scraping library powered by LLMs. Describe what you want to extract in natural language, get structured data back. Handles dynamic pages. 23K+ stars.","Script Depot",351,"tokrepo install pack\u002Fai-web-scraping",{"pageType":64,"pageKey":8,"locale":25,"title":65,"metaDescription":66,"h1":13,"tldr":67,"bodyMarkdown":68,"faq":69,"schema":85,"internalLinks":94,"citations":106,"wordCount":119,"generatedAt":120},"pack","AI Web Scraping: 5 Engines That Output LLM-Ready Markdown","Firecrawl, Crawlee, Crawl4AI, GPT Crawler, ScrapeGraphAI — scrapers that emit clean markdown for RAG instead of raw HTML. Install with TokRepo.","Five open-source scraping engines that skip BeautifulSoup hell and emit LLM-ready markdown directly. Install the whole pack via TokRepo, or pick the engine that fits your stack.","## What's in this pack\n\n| # | Engine | Strength | Language |\n|---|---|---|---|\n| 1 | Firecrawl | hosted API + self-host, JS-render, sitemap crawl | TypeScript |\n| 2 | Crawlee | full crawler framework with proxy rotation | TypeScript \u002F Python |\n| 3 | Crawl4AI | RAG-optimized markdown, fastest async crawl | Python |\n| 4 | GPT Crawler | one-config-file knowledge-base crawl for chatbots | TypeScript |\n| 5 | ScrapeGraphAI | LLM-driven extraction via prompt + schema | Python |\n\nThese five tools converge on the same insight: feeding an LLM raw HTML is a token tax. By the time you've stripped nav bars, ads, scripts, and inline styles, you've burned thousands of tokens for nothing. AI-native scrapers do this conversion at the crawler edge so your retrieval layer sees clean markdown.\n\n## Why scraping looks different in 2026\n\nThree changes pushed the old scraping playbook into retirement.\n\nFirst, JavaScript rendering became table stakes. Single-page apps and edge-rendered sites now hide content behind hydration. The 2018 stack (`requests` + BeautifulSoup) returns shells. Modern engines wrap headless Chromium and wait for the right network-idle event before extracting.\n\nSecond, retrieval is the destination, not display. The output isn't going into a search index — it's going into a vector database for RAG. That changes the optimization target from \"render in a browser\" to \"fits in 8k tokens cleanly.\"\n\nThird, anti-bot escalated. Cloudflare, DataDome, and PerimeterX block naive scrapers within seconds. Firecrawl and Crawlee solve this with rotating residential proxies, browser fingerprint randomization, and smart retry logic — features you'd otherwise duct-tape together over weeks.\n\n## Install in one command\n\n```bash\n# Install the whole pack\ntokrepo install pack\u002Fai-web-scraping\n\n# Or pick the engine that matches your stack\ntokrepo install firecrawl\ntokrepo install crawl4ai\ntokrepo install scrapegraphai\n```\n\nEach asset's TokRepo page bundles install commands, recommended config, and the most common output adapters (markdown, JSONL, vector-db direct insert).\n\n## Common pitfalls\n\n- **Robots.txt and rate limits**: respect them. Most engines have a `respect_robots_txt` flag default-on; turning it off invites IP bans and legal trouble. Set polite crawl delays.\n- **JavaScript pages without JS render**: if Firecrawl\u002FCrawl4AI returns empty content, you're hitting a hydration site without rendering enabled. Toggle the JS option.\n- **Markdown drift**: different engines emit slightly different markdown flavors (tables, code blocks, footnotes). Normalize post-crawl if you mix engines for the same RAG corpus.\n- **PDF\u002FOffice files masquerading as web pages**: web scrapers won't extract these. Hand off to the Document AI Pipeline pack instead.\n- **Auth-walled content**: scraping behind login is fragile and often violates ToS. Use the official API where one exists.\n\n## When this pack alone isn't enough\n\nThis pack is the *extraction* layer. To complete a RAG pipeline you also need:\n\n- A vector database — see the **Vector DB Showdown** pack for Chroma, Weaviate, Qdrant, and friends.\n- A chunking + embedding step — usually done with LangChain or LlamaIndex glue.\n- An eval loop — see **LLM Eval & Guardrails** to score retrieval relevance.\n\nFor PDF and Office inputs, switch to the **Document AI Pipeline** pack. For interactive scraping (filling forms, clicking through wizards), the **Browser Automation** pack is the right tool — those sites need Playwright-style interaction, not crawl.\n\n## Picking the right engine\n\n- **Want a hosted API and don't mind paying for managed infra**: Firecrawl. Best dev-ex of the five, JS render and proxy rotation built in.\n- **Need to scrape millions of pages on owned hardware**: Crawlee. The most mature crawler framework, with queue persistence and resumable runs.\n- **Building a RAG ingest with Python**: Crawl4AI. Async-first design hits 3-5x throughput vs synchronous crawlers on the same box.\n- **One-time knowledge-base export for a chatbot**: GPT Crawler. A single `config.ts` file points at a domain and out comes a JSONL ready to feed OpenAI's file uploader.\n- **Pages where the schema is irregular and you want extraction by intent**: ScrapeGraphAI. You hand it a Pydantic model and a prompt; it figures out the selectors per page.",[70,73,76,79,82],{"q":71,"a":72},"Are these tools free to use?","All five are open-source. Firecrawl offers a hosted SaaS tier with free quota, but you can self-host it for free. Crawlee, Crawl4AI, GPT Crawler, and ScrapeGraphAI are 100% self-hosted and BSD\u002FMIT licensed. The hidden cost is proxy services if you're crawling sites with aggressive anti-bot — expect $50-200\u002Fmonth for residential proxies on real workloads.",{"q":74,"a":75},"Firecrawl vs Crawl4AI — which should I pick?","Firecrawl if you want a hosted endpoint and don't mind paying for managed infra; its API surface is simpler and the JS-render is rock solid. Crawl4AI if you're Python-native and want maximum throughput on self-host; its async architecture beats Firecrawl on raw speed but requires more ops glue. For a Cursor\u002FCodex CLI agent calling tools, both work — Firecrawl just has fewer setup steps.",{"q":77,"a":78},"Will this work with Cursor or Codex CLI as a tool?","Yes — most of these have MCP servers or HTTP APIs that any AI tool with tool-calling can invoke. Firecrawl ships an official MCP server. Crawl4AI exposes a Python function you can wrap. Drop the MCP config into Cursor's settings or your Codex CLI agent definition and the LLM can scrape on demand.",{"q":80,"a":81},"How is this different from the Browser Automation pack?","Scraping is data-extraction-first: you want LLM-ready markdown out of a page you can predict the URL of. Browser automation is interaction-first: you click, fill, navigate, screenshot. There's overlap (both use headless Chromium), but the API surface and the typical workflow differ. If you're building a RAG corpus, this pack. If you're filling forms, Browser Automation.",{"q":83,"a":84},"What's the operational gotcha?","Token blowup from over-eager crawls. A single sitemap with 10k pages at 5k tokens each is 50M tokens of embedding cost — easily $500+ at OpenAI prices. Always set a `max_pages` and `max_depth` first, sample 50 pages, count tokens, project the bill, then unleash. Cheap to forget, expensive to fix.",{"@context":86,"@type":87,"name":13,"description":88,"numberOfItems":89,"publisher":90},"https:\u002F\u002Fschema.org","CollectionPage","Five scraping engines that output LLM-ready markdown — Firecrawl, Crawlee, Crawl4AI, GPT Crawler, ScrapeGraphAI.",5,{"@type":91,"name":92,"url":93},"Organization","TokRepo","https:\u002F\u002Ftokrepo.com",[95,99,103],{"url":96,"anchor":97,"reason":98},"\u002Fen\u002Fpacks\u002Fdocument-ai-pipeline","Document AI Pipeline","PDF\u002FOffice ingestion partner",{"url":100,"anchor":101,"reason":102},"\u002Fen\u002Fpacks\u002Fbrowser-automation","Browser Automation","interaction-first scraping alternative",{"url":104,"anchor":22,"reason":105},"\u002Fen\u002Ftools\u002Ffirecrawl","the most popular runner in this pack",[107,111,115],{"claim":108,"source_name":109,"source_url":110},"Firecrawl turns websites into LLM-ready markdown via a hosted or self-hosted API","mendableai\u002Ffirecrawl","https:\u002F\u002Fgithub.com\u002Fmendableai\u002Ffirecrawl",{"claim":112,"source_name":113,"source_url":114},"Crawlee is the open-source web crawling and browser automation library by Apify","apify\u002Fcrawlee","https:\u002F\u002Fgithub.com\u002Fapify\u002Fcrawlee",{"claim":116,"source_name":117,"source_url":118},"Crawl4AI is open-source and optimized for retrieval-augmented LLM input","unclecode\u002Fcrawl4ai","https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai",672,"2026-05-02T15:00:00Z"]