ConfigsMar 31, 2026·2 min read

GPT Crawler — Build Custom GPTs from Any Website

Crawl any website to generate knowledge files for custom GPTs and RAG. Output as JSON for OpenAI GPTs or any LLM knowledge base. Zero config. 22K+ stars.

TL;DR
GPT Crawler crawls websites and outputs structured JSON ready for OpenAI GPT Builder or any RAG pipeline.
§01

What it is

GPT Crawler is an open-source tool that crawls any website and converts its content into structured JSON files suitable for custom GPTs and RAG knowledge bases. Point it at documentation sites, help centers, or any public website, and it extracts clean text optimized for LLM consumption.

GPT Crawler is designed for developers and AI practitioners who want to create custom knowledge bases from existing web content without writing scrapers from scratch. It is a pure crawler with zero AI cost.

§02

How it saves time or tokens

Building a knowledge base from a documentation site normally requires writing a custom scraper, handling pagination, cleaning HTML, and formatting output. GPT Crawler reduces this to a single command or config file. The maxTokens option lets you control output size to fit within GPT upload limits, and the CSS selector option targets specific content areas to avoid navigation and boilerplate noise.

§03

How to use

  1. Run directly with npx for quick crawls:
npx gpt-crawler --url https://docs.example.com --match 'https://docs.example.com/**'
  1. For more control, clone and configure:
git clone https://github.com/BuilderIO/gpt-crawler.git
cd gpt-crawler && npm install
  1. Edit the config and run:
// config.ts
export const config = {
  url: 'https://docs.example.com',
  match: 'https://docs.example.com/**',
  maxPagesToCrawl: 100,
  outputFileName: 'output.json',
  selector: '.main-content',
  maxTokens: 500000,
};
npm start
  1. Upload output.json to OpenAI GPT Builder or feed it into your RAG pipeline.
§04

Example

The output JSON structure looks like this:

[
  {
    "title": "Getting Started",
    "url": "https://docs.example.com/getting-started",
    "content": "This guide walks you through setting up..."
  },
  {
    "title": "API Reference",
    "url": "https://docs.example.com/api",
    "content": "The API accepts JSON requests on..."
  }
]
§05

Related on TokRepo

§06

Common pitfalls

  • Setting maxPagesToCrawl too high on large sites can result in multi-gigabyte output files; start with 50-100 pages and increase as needed
  • Sites with client-side rendering (SPAs) may return empty content; GPT Crawler uses static fetching and does not execute JavaScript by default
  • The match glob pattern must be precise; overly broad patterns will crawl unrelated sections of the site

Frequently Asked Questions

What output formats does GPT Crawler support?+

GPT Crawler outputs JSON by default, structured as an array of objects with title, url, and content fields. This format is directly compatible with OpenAI GPT Builder uploads and can be fed into LangChain, LlamaIndex, or any RAG pipeline with minimal transformation.

Can GPT Crawler handle JavaScript-rendered pages?+

The default mode uses static HTML fetching and does not execute JavaScript. For single-page applications or sites that render content client-side, you would need to pair GPT Crawler with a headless browser or use an alternative tool that supports JavaScript rendering.

How do I limit the output size for GPT uploads?+

Use the maxTokens configuration option to cap the total token count of the output file. OpenAI GPT Builder has upload size limits, so setting maxTokens to 500000 or lower ensures compatibility. You can also use maxPagesToCrawl to limit the number of pages.

Is GPT Crawler free to use?+

Yes. GPT Crawler is open source under the ISC license. It is a pure web crawler with no AI API calls, so there are no per-use costs. The only cost is your own compute for running the crawler.

Can I use GPT Crawler output with Claude Projects?+

Yes. The JSON output works with any system that accepts text knowledge files. For Claude Projects, you can upload the JSON directly as a project knowledge file, and Claude will use it as context for answering questions about the crawled content.

Citations (3)
  • GPT Crawler GitHub— GPT Crawler is an open-source website crawler for custom GPTs
  • OpenAI Docs— Custom GPTs accept uploaded knowledge files
  • LangChain Docs— RAG pipelines use document ingestion for knowledge retrieval
🙏

Source & Thanks

Created by Builder.io. Licensed under ISC. BuilderIO/gpt-crawler — 22,000+ GitHub stars

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets