Is GPT Crawler — Build Custom GPTs from Any Website free to use?

Yes. GPT Crawler — Build Custom GPTs from Any Website is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install GPT Crawler — Build Custom GPTs from Any Website?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

ConfigsMar 31, 2026·2 min read

GPT Crawler — Build Custom GPTs from Any Website

Crawl any website to generate knowledge files for custom GPTs and RAG. Output as JSON for OpenAI GPTs or any LLM knowledge base. Zero config. 22K+ stars.

AI Open Source · Community

TL;DR

GPT Crawler crawls websites and outputs structured JSON ready for OpenAI GPT Builder or any RAG pipeline.

§01

What it is

GPT Crawler is an open-source tool that crawls any website and converts its content into structured JSON files suitable for custom GPTs and RAG knowledge bases. Point it at documentation sites, help centers, or any public website, and it extracts clean text optimized for LLM consumption.

GPT Crawler is designed for developers and AI practitioners who want to create custom knowledge bases from existing web content without writing scrapers from scratch. It is a pure crawler with zero AI cost.

§02

How it saves time or tokens

Building a knowledge base from a documentation site normally requires writing a custom scraper, handling pagination, cleaning HTML, and formatting output. GPT Crawler reduces this to a single command or config file. The maxTokens option lets you control output size to fit within GPT upload limits, and the CSS selector option targets specific content areas to avoid navigation and boilerplate noise.

§03

How to use

Run directly with npx for quick crawls:

npx gpt-crawler --url https://docs.example.com --match 'https://docs.example.com/**'

For more control, clone and configure:

git clone https://github.com/BuilderIO/gpt-crawler.git
cd gpt-crawler && npm install

Edit the config and run:

// config.ts
export const config = {
  url: 'https://docs.example.com',
  match: 'https://docs.example.com/**',
  maxPagesToCrawl: 100,
  outputFileName: 'output.json',
  selector: '.main-content',
  maxTokens: 500000,
};

npm start

Upload output.json to OpenAI GPT Builder or feed it into your RAG pipeline.

§04

Example

The output JSON structure looks like this:

[
  {
    "title": "Getting Started",
    "url": "https://docs.example.com/getting-started",
    "content": "This guide walks you through setting up..."
  },
  {
    "title": "API Reference",
    "url": "https://docs.example.com/api",
    "content": "The API accepts JSON requests on..."
  }
]

§05

Related on TokRepo

Web scraping tools — dedicated tools for extracting web data
RAG resources — retrieval-augmented generation tools and frameworks

§06

Common pitfalls

Setting maxPagesToCrawl too high on large sites can result in multi-gigabyte output files; start with 50-100 pages and increase as needed
Sites with client-side rendering (SPAs) may return empty content; GPT Crawler uses static fetching and does not execute JavaScript by default
The match glob pattern must be precise; overly broad patterns will crawl unrelated sections of the site

Frequently Asked Questions

What output formats does GPT Crawler support?+

GPT Crawler outputs JSON by default, structured as an array of objects with title, url, and content fields. This format is directly compatible with OpenAI GPT Builder uploads and can be fed into LangChain, LlamaIndex, or any RAG pipeline with minimal transformation.

Can GPT Crawler handle JavaScript-rendered pages?+

The default mode uses static HTML fetching and does not execute JavaScript. For single-page applications or sites that render content client-side, you would need to pair GPT Crawler with a headless browser or use an alternative tool that supports JavaScript rendering.

How do I limit the output size for GPT uploads?+

Use the maxTokens configuration option to cap the total token count of the output file. OpenAI GPT Builder has upload size limits, so setting maxTokens to 500000 or lower ensures compatibility. You can also use maxPagesToCrawl to limit the number of pages.

Is GPT Crawler free to use?+

Yes. GPT Crawler is open source under the ISC license. It is a pure web crawler with no AI API calls, so there are no per-use costs. The only cost is your own compute for running the crawler.

Can I use GPT Crawler output with Claude Projects?+

Yes. The JSON output works with any system that accepts text knowledge files. For Claude Projects, you can upload the JSON directly as a project knowledge file, and Claude will use it as context for answering questions about the crawled content.

Citations (3)

GPT Crawler GitHub— GPT Crawler is an open-source website crawler for custom GPTs
OpenAI Docs— Custom GPTs accept uploaded knowledge files
LangChain Docs— RAG pipelines use document ingestion for knowledge retrieval

Related on TokRepo

Web scraping tools RAG tools Content tools

🙏

Source & Thanks

Created by Builder.io. Licensed under ISC. BuilderIO/gpt-crawler — 22,000+ GitHub stars

Discussion

No comments yet. Be the first to share your thoughts.

GPT Crawler — Build Custom GPTs from Any Website

What it is

How it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

Frequently Asked Questions

Citations (3)

Related on TokRepo

Source & Thanks

Discussion

Related Assets

Conda — Cross-Platform Package and Environment Manager

Sphinx — Python Documentation Generator

Neutralinojs — Lightweight Cross-Platform Desktop Apps