AI Web Scraping
Firecrawl, Crawlee, Crawl4AI, GPT Crawler, ScrapeGraphAI — scraping engines that output LLM-ready markdown, not raw HTML.
What's in this pack
| # | Engine | Strength | Language |
|---|---|---|---|
| 1 | Firecrawl | hosted API + self-host, JS-render, sitemap crawl | TypeScript |
| 2 | Crawlee | full crawler framework with proxy rotation | TypeScript / Python |
| 3 | Crawl4AI | RAG-optimized markdown, fastest async crawl | Python |
| 4 | GPT Crawler | one-config-file knowledge-base crawl for chatbots | TypeScript |
| 5 | ScrapeGraphAI | LLM-driven extraction via prompt + schema | Python |
These five tools converge on the same insight: feeding an LLM raw HTML is a token tax. By the time you've stripped nav bars, ads, scripts, and inline styles, you've burned thousands of tokens for nothing. AI-native scrapers do this conversion at the crawler edge so your retrieval layer sees clean markdown.
Why scraping looks different in 2026
Three changes pushed the old scraping playbook into retirement.
First, JavaScript rendering became table stakes. Single-page apps and edge-rendered sites now hide content behind hydration. The 2018 stack (requests + BeautifulSoup) returns shells. Modern engines wrap headless Chromium and wait for the right network-idle event before extracting.
Second, retrieval is the destination, not display. The output isn't going into a search index — it's going into a vector database for RAG. That changes the optimization target from "render in a browser" to "fits in 8k tokens cleanly."
Third, anti-bot escalated. Cloudflare, DataDome, and PerimeterX block naive scrapers within seconds. Firecrawl and Crawlee solve this with rotating residential proxies, browser fingerprint randomization, and smart retry logic — features you'd otherwise duct-tape together over weeks.
Install in one command
# Install the whole pack
tokrepo install pack/ai-web-scraping
# Or pick the engine that matches your stack
tokrepo install firecrawl
tokrepo install crawl4ai
tokrepo install scrapegraphai
Each asset's TokRepo page bundles install commands, recommended config, and the most common output adapters (markdown, JSONL, vector-db direct insert).
Common pitfalls
- Robots.txt and rate limits: respect them. Most engines have a
respect_robots_txtflag default-on; turning it off invites IP bans and legal trouble. Set polite crawl delays. - JavaScript pages without JS render: if Firecrawl/Crawl4AI returns empty content, you're hitting a hydration site without rendering enabled. Toggle the JS option.
- Markdown drift: different engines emit slightly different markdown flavors (tables, code blocks, footnotes). Normalize post-crawl if you mix engines for the same RAG corpus.
- PDF/Office files masquerading as web pages: web scrapers won't extract these. Hand off to the Document AI Pipeline pack instead.
- Auth-walled content: scraping behind login is fragile and often violates ToS. Use the official API where one exists.
When this pack alone isn't enough
This pack is the extraction layer. To complete a RAG pipeline you also need:
- A vector database — see the Vector DB Showdown pack for Chroma, Weaviate, Qdrant, and friends.
- A chunking + embedding step — usually done with LangChain or LlamaIndex glue.
- An eval loop — see LLM Eval & Guardrails to score retrieval relevance.
For PDF and Office inputs, switch to the Document AI Pipeline pack. For interactive scraping (filling forms, clicking through wizards), the Browser Automation pack is the right tool — those sites need Playwright-style interaction, not crawl.
Picking the right engine
- Want a hosted API and don't mind paying for managed infra: Firecrawl. Best dev-ex of the five, JS render and proxy rotation built in.
- Need to scrape millions of pages on owned hardware: Crawlee. The most mature crawler framework, with queue persistence and resumable runs.
- Building a RAG ingest with Python: Crawl4AI. Async-first design hits 3-5x throughput vs synchronous crawlers on the same box.
- One-time knowledge-base export for a chatbot: GPT Crawler. A single
config.tsfile points at a domain and out comes a JSONL ready to feed OpenAI's file uploader. - Pages where the schema is irregular and you want extraction by intent: ScrapeGraphAI. You hand it a Pydantic model and a prompt; it figures out the selectors per page.
5 assets in this pack
Frequently asked questions
Are these tools free to use?
All five are open-source. Firecrawl offers a hosted SaaS tier with free quota, but you can self-host it for free. Crawlee, Crawl4AI, GPT Crawler, and ScrapeGraphAI are 100% self-hosted and BSD/MIT licensed. The hidden cost is proxy services if you're crawling sites with aggressive anti-bot — expect $50-200/month for residential proxies on real workloads.
Firecrawl vs Crawl4AI — which should I pick?
Firecrawl if you want a hosted endpoint and don't mind paying for managed infra; its API surface is simpler and the JS-render is rock solid. Crawl4AI if you're Python-native and want maximum throughput on self-host; its async architecture beats Firecrawl on raw speed but requires more ops glue. For a Cursor/Codex CLI agent calling tools, both work — Firecrawl just has fewer setup steps.
Will this work with Cursor or Codex CLI as a tool?
Yes — most of these have MCP servers or HTTP APIs that any AI tool with tool-calling can invoke. Firecrawl ships an official MCP server. Crawl4AI exposes a Python function you can wrap. Drop the MCP config into Cursor's settings or your Codex CLI agent definition and the LLM can scrape on demand.
How is this different from the Browser Automation pack?
Scraping is data-extraction-first: you want LLM-ready markdown out of a page you can predict the URL of. Browser automation is interaction-first: you click, fill, navigate, screenshot. There's overlap (both use headless Chromium), but the API surface and the typical workflow differ. If you're building a RAG corpus, this pack. If you're filling forms, Browser Automation.
What's the operational gotcha?
Token blowup from over-eager crawls. A single sitemap with 10k pages at 5k tokens each is 50M tokens of embedding cost — easily $500+ at OpenAI prices. Always set a max_pages and max_depth first, sample 50 pages, count tokens, project the bill, then unleash. Cheap to forget, expensive to fix.
12 packs · 80+ hand-picked assets
Browse every curated bundle on the home page
Back to all packs