TOKREPO · ARSENAL
Stable

AI Web Scraping

Firecrawl, Crawlee, Crawl4AI, GPT Crawler, ScrapeGraphAI — scraping engines that output LLM-ready markdown, not raw HTML.

5 assets

What's in this pack

# Engine Strength Language
1 Firecrawl hosted API + self-host, JS-render, sitemap crawl TypeScript
2 Crawlee full crawler framework with proxy rotation TypeScript / Python
3 Crawl4AI RAG-optimized markdown, fastest async crawl Python
4 GPT Crawler one-config-file knowledge-base crawl for chatbots TypeScript
5 ScrapeGraphAI LLM-driven extraction via prompt + schema Python

These five tools converge on the same insight: feeding an LLM raw HTML is a token tax. By the time you've stripped nav bars, ads, scripts, and inline styles, you've burned thousands of tokens for nothing. AI-native scrapers do this conversion at the crawler edge so your retrieval layer sees clean markdown.

Why scraping looks different in 2026

Three changes pushed the old scraping playbook into retirement.

First, JavaScript rendering became table stakes. Single-page apps and edge-rendered sites now hide content behind hydration. The 2018 stack (requests + BeautifulSoup) returns shells. Modern engines wrap headless Chromium and wait for the right network-idle event before extracting.

Second, retrieval is the destination, not display. The output isn't going into a search index — it's going into a vector database for RAG. That changes the optimization target from "render in a browser" to "fits in 8k tokens cleanly."

Third, anti-bot escalated. Cloudflare, DataDome, and PerimeterX block naive scrapers within seconds. Firecrawl and Crawlee solve this with rotating residential proxies, browser fingerprint randomization, and smart retry logic — features you'd otherwise duct-tape together over weeks.

Install in one command

# Install the whole pack
tokrepo install pack/ai-web-scraping

# Or pick the engine that matches your stack
tokrepo install firecrawl
tokrepo install crawl4ai
tokrepo install scrapegraphai

Each asset's TokRepo page bundles install commands, recommended config, and the most common output adapters (markdown, JSONL, vector-db direct insert).

Common pitfalls

  • Robots.txt and rate limits: respect them. Most engines have a respect_robots_txt flag default-on; turning it off invites IP bans and legal trouble. Set polite crawl delays.
  • JavaScript pages without JS render: if Firecrawl/Crawl4AI returns empty content, you're hitting a hydration site without rendering enabled. Toggle the JS option.
  • Markdown drift: different engines emit slightly different markdown flavors (tables, code blocks, footnotes). Normalize post-crawl if you mix engines for the same RAG corpus.
  • PDF/Office files masquerading as web pages: web scrapers won't extract these. Hand off to the Document AI Pipeline pack instead.
  • Auth-walled content: scraping behind login is fragile and often violates ToS. Use the official API where one exists.

When this pack alone isn't enough

This pack is the extraction layer. To complete a RAG pipeline you also need:

  • A vector database — see the Vector DB Showdown pack for Chroma, Weaviate, Qdrant, and friends.
  • A chunking + embedding step — usually done with LangChain or LlamaIndex glue.
  • An eval loop — see LLM Eval & Guardrails to score retrieval relevance.

For PDF and Office inputs, switch to the Document AI Pipeline pack. For interactive scraping (filling forms, clicking through wizards), the Browser Automation pack is the right tool — those sites need Playwright-style interaction, not crawl.

Picking the right engine

  • Want a hosted API and don't mind paying for managed infra: Firecrawl. Best dev-ex of the five, JS render and proxy rotation built in.
  • Need to scrape millions of pages on owned hardware: Crawlee. The most mature crawler framework, with queue persistence and resumable runs.
  • Building a RAG ingest with Python: Crawl4AI. Async-first design hits 3-5x throughput vs synchronous crawlers on the same box.
  • One-time knowledge-base export for a chatbot: GPT Crawler. A single config.ts file points at a domain and out comes a JSONL ready to feed OpenAI's file uploader.
  • Pages where the schema is irregular and you want extraction by intent: ScrapeGraphAI. You hand it a Pydantic model and a prompt; it figures out the selectors per page.
INSTALL · ONE COMMAND
$ tokrepo install pack/ai-web-scraping
hand it to your agent — or paste it in your terminal
What's inside

5 assets in this pack

Script#01
Firecrawl — Web Scraping API for AI Applications

Turn any website into clean markdown or structured data for LLMs. Firecrawl handles JavaScript rendering, anti-bot bypassing, sitemaps, and batch crawling via simple API.

by Prompt Lab·97 views
$ tokrepo install firecrawl-web-scraping-api-ai-applications-6a62a986
Script#02
Crawlee — Production Web Scraping for Node.js

Build reliable crawlers with automatic proxy rotation, request queuing, and browser automation. By Apify. 22K+ stars.

by Script Depot·97 views
$ tokrepo install crawlee-production-web-scraping-node-js-3e8c6e91
Script#03
Crawl4AI — LLM-Friendly Web Crawling

Open-source web crawler optimized for AI and LLM use cases. Extracts clean markdown, handles JavaScript-rendered pages, and supports structured data extraction.

by Script Depot·96 views
$ tokrepo install crawl4ai-llm-friendly-web-crawling-cb19c9d4
Config#04
GPT Crawler — Build Custom GPTs from Any Website

Crawl any website to generate knowledge files for custom GPTs and RAG. Output as JSON for OpenAI GPTs or any LLM knowledge base. Zero config. 22K+ stars.

by AI Open Source·97 views
$ tokrepo install gpt-crawler-build-custom-gpts-any-website-bbd3962b
Script#05
ScrapeGraphAI — AI-Powered Web Scraping

Python scraping library powered by LLMs. Describe what you want to extract in natural language, get structured data back. Handles dynamic pages. 23K+ stars.

by Script Depot·96 views
$ tokrepo install scrapegraphai-ai-powered-web-scraping-d34e3181
FAQ

Frequently asked questions

Are these tools free to use?

All five are open-source. Firecrawl offers a hosted SaaS tier with free quota, but you can self-host it for free. Crawlee, Crawl4AI, GPT Crawler, and ScrapeGraphAI are 100% self-hosted and BSD/MIT licensed. The hidden cost is proxy services if you're crawling sites with aggressive anti-bot — expect $50-200/month for residential proxies on real workloads.

Firecrawl vs Crawl4AI — which should I pick?

Firecrawl if you want a hosted endpoint and don't mind paying for managed infra; its API surface is simpler and the JS-render is rock solid. Crawl4AI if you're Python-native and want maximum throughput on self-host; its async architecture beats Firecrawl on raw speed but requires more ops glue. For a Cursor/Codex CLI agent calling tools, both work — Firecrawl just has fewer setup steps.

Will this work with Cursor or Codex CLI as a tool?

Yes — most of these have MCP servers or HTTP APIs that any AI tool with tool-calling can invoke. Firecrawl ships an official MCP server. Crawl4AI exposes a Python function you can wrap. Drop the MCP config into Cursor's settings or your Codex CLI agent definition and the LLM can scrape on demand.

How is this different from the Browser Automation pack?

Scraping is data-extraction-first: you want LLM-ready markdown out of a page you can predict the URL of. Browser automation is interaction-first: you click, fill, navigate, screenshot. There's overlap (both use headless Chromium), but the API surface and the typical workflow differ. If you're building a RAG corpus, this pack. If you're filling forms, Browser Automation.

What's the operational gotcha?

Token blowup from over-eager crawls. A single sitemap with 10k pages at 5k tokens each is 50M tokens of embedding cost — easily $500+ at OpenAI prices. Always set a max_pages and max_depth first, sample 50 pages, count tokens, project the bill, then unleash. Cheap to forget, expensive to fix.

MORE FROM THE ARSENAL

12 packs · 80+ hand-picked assets

Browse every curated bundle on the home page

Back to all packs