Firecrawl — Web Scraping API for AI Applications
Turn any website into clean markdown or structured data for LLMs. Firecrawl handles JavaScript rendering, anti-bot bypassing, sitemaps, and batch crawling via simple API.
What it is
Firecrawl is a web scraping API that converts any website into clean markdown or structured data optimized for LLM ingestion. It handles JavaScript rendering, anti-bot bypassing, sitemaps, and batch crawling out of the box, so developers can focus on building AI features instead of scraping infrastructure.
The tool targets AI engineers building RAG pipelines, knowledge bases, or data collection systems that need reliable web content extraction.
How it saves time or tokens
Raw HTML is noisy -- ads, navigation, scripts, and boilerplate inflate token counts when fed to LLMs. Firecrawl strips all of that and returns only the meaningful content as markdown. This reduces prompt tokens by 60-80% compared to feeding raw HTML, and eliminates the need to build and maintain your own rendering and extraction pipeline.
How to use
- Sign up at firecrawl.dev and get an API key.
- Install the SDK:
pip install firecrawl-pyornpm install @mendable/firecrawl-js. - Call
scrape_url()with your target URL to get clean markdown back.
Example
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='fc-YOUR_KEY')
# Scrape a single page
result = app.scrape_url('https://docs.python.org/3/tutorial/classes.html')
print(result['markdown'][:500])
# Crawl an entire site
crawl = app.crawl_url(
'https://docs.python.org/3/',
params={'limit': 50, 'scrapeOptions': {'formats': ['markdown']}}
)
for page in crawl['data']:
print(page['metadata']['title'])
Related on TokRepo
- AI Tools for Web Scraping -- compare web scraping solutions for AI workflows
- AI Tools for RAG -- retrieval-augmented generation tools and pipelines
Common pitfalls
- Rate limits apply on the free tier. For batch crawling, use the async crawl endpoint and poll for results instead of synchronous calls.
- Some sites block headless browsers regardless of anti-bot measures. Always check the response status and have a fallback strategy.
- Firecrawl's markdown output quality depends on the site's HTML structure. Heavily JavaScript-rendered SPAs may need extra wait time configuration.
Frequently Asked Questions
Yes. Firecrawl uses headless browsers to render pages before extraction. This means single-page applications built with React, Vue, or Angular are fully rendered before content is extracted.
Yes. The crawl_url method accepts a starting URL and follows internal links up to a configurable limit. Results are returned as a list of pages, each with markdown content and metadata.
Firecrawl returns content as markdown (default), plain text, or structured JSON via LLM extraction. Markdown is the most common format for feeding content into RAG pipelines.
Yes. Firecrawl is open source and can be self-hosted using Docker. The self-hosted version removes API rate limits and keeps all data within your infrastructure.
BeautifulSoup and Scrapy are general-purpose scraping libraries that require you to handle rendering, parsing, and content extraction yourself. Firecrawl is purpose-built for LLM use cases with built-in rendering, anti-bot measures, and markdown conversion.
Citations (3)
- Firecrawl GitHub— Firecrawl converts websites to markdown for LLMs
- Firecrawl Docs— Supports JavaScript rendering and anti-bot bypassing
- Firecrawl Self-Host Docs— Self-hostable open source scraping API
Related on TokRepo
Source & Thanks
Created by Mendable. Licensed under AGPL-3.0.
mendableai/firecrawl — 30k+ stars
Discussion
Related Assets
NAPI-RS — Build Node.js Native Addons in Rust
Write high-performance Node.js native modules in Rust with automatic TypeScript type generation and cross-platform prebuilt binaries.
Mamba — Fast Cross-Platform Package Manager
A drop-in conda replacement written in C++ that resolves environments in seconds instead of minutes.
Plasmo — The Browser Extension Framework
Build, test, and publish browser extensions for Chrome, Firefox, and Edge using React or Vue with hot-reload and automatic manifest generation.