ConfigsMar 31, 2026·2 min read

GPT Crawler — Build Custom GPTs from Any Website

Crawl any website to generate knowledge files for custom GPTs and RAG. Output as JSON for OpenAI GPTs or any LLM knowledge base. Zero config. 22K+ stars.

TO
TokRepo精选 · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

npx gpt-crawler --url https://docs.example.com --match "https://docs.example.com/**"

Or configure:

// config.ts
export const config = {
  url: "https://docs.example.com",
  match: "https://docs.example.com/**",
  maxPagesToCrawl: 100,
  outputFileName: "output.json",
};
git clone https://github.com/BuilderIO/gpt-crawler.git
cd gpt-crawler && npm install && npm start

Upload output.json to OpenAI GPT Builder or your RAG pipeline.


Intro

GPT Crawler turns any website into a knowledge file for custom GPTs and RAG pipelines. Point it at documentation, help centers, or any website — it crawls pages, extracts clean text, and outputs structured JSON ready for OpenAI's GPT Builder or any LLM knowledge base. Zero AI cost — it's a pure crawler, not an LLM app. 22,000+ GitHub stars, ISC licensed.

Best for: Creating custom GPTs from documentation sites, building RAG knowledge bases from web content Works with: OpenAI GPTs, Claude Projects, any RAG pipeline (LangChain, LlamaIndex)


Key Features

One-Command Crawl

Point at any URL with a glob pattern — get structured JSON output.

Smart Extraction

Extracts main content, strips navigation/ads/boilerplate. Clean text optimized for LLMs.

Configurable

  • maxPagesToCrawl — limit crawl depth
  • match — URL glob patterns to include/exclude
  • selector — CSS selector for content extraction
  • maxTokens — limit output size for GPT upload

Output Formats

JSON array of {title, url, text} objects — ready for:

  • OpenAI GPT Builder (upload as knowledge)
  • Claude Projects (upload as context)
  • Any RAG vector store ingestion

FAQ

Q: What is GPT Crawler? A: A tool that crawls any website and outputs structured JSON for creating custom GPTs and RAG knowledge bases. No AI cost — pure web crawling. 22K+ stars.

Q: How is it different from Crawl4AI or Firecrawl? A: GPT Crawler is simpler — focused specifically on generating GPT knowledge files. Crawl4AI and Firecrawl offer more features (JS rendering, structured extraction, APIs).


🙏

Source & Thanks

Created by Builder.io. Licensed under ISC. BuilderIO/gpt-crawler — 22,000+ GitHub stars

Related Assets