Introduction
Kreuzberg is a document intelligence framework built around a high-performance Rust core with bindings for Python, Ruby, Go, Java, TypeScript, and more. It extracts text, metadata, tables, and images from virtually any document format, making it a foundational building block for RAG pipelines, search indexing, and document processing workflows.
What Kreuzberg Does
- Extracts text content from PDFs, DOCX, PPTX, images, HTML, and 97+ formats
- Detects and extracts tables with structure preservation
- Pulls metadata (author, dates, page count) from documents
- Performs OCR on scanned documents and images via Tesseract
- Returns structured output suitable for LLM ingestion and RAG
Architecture Overview
The core extraction engine is written in Rust using pdfium for PDF rendering, and Tesseract bindings for OCR. Format-specific parsers handle Office XML, HTML, email, and other document types. The Rust core compiles to native libraries and WebAssembly, enabling bindings for 11 languages through FFI. Each binding provides idiomatic APIs while sharing the same underlying extraction logic.
Self-Hosting & Configuration
- Install via package manager for your language (pip, gem, go get, npm, etc.)
- Optionally install Tesseract for OCR support on scanned documents
- Configure OCR language packs for non-English documents
- Available as a REST API server and MCP server for agent integration
- Also available as a standalone CLI tool
Key Features
- Single extraction API across 97+ document formats
- Rust core ensures consistent behavior across all language bindings
- Table extraction preserves row/column structure
- OCR integration for scanned and image-based documents
- WebAssembly build for browser and edge deployment
Comparison with Similar Tools
- Apache Tika — Java-based with heavy runtime; Kreuzberg is lightweight Rust
- Unstructured — Python-only; Kreuzberg supports 11 languages natively
- Docling — focused on PDF; Kreuzberg handles 97+ formats
- MarkItDown — converts to Markdown; Kreuzberg provides structured extraction
- MinerU — PDF-focused deep extraction; Kreuzberg is broader but less specialized on PDFs
FAQ
Q: Does it handle scanned PDFs? A: Yes. When text extraction yields empty results, Kreuzberg automatically falls back to OCR via Tesseract.
Q: Can I use it in a browser? A: Yes. The WebAssembly build works in browsers and Deno/Bun without native dependencies.
Q: How does it compare performance-wise to Python alternatives? A: The Rust core is significantly faster than pure Python parsers, especially for large documents and batch processing.
Q: Does it support structured table output? A: Yes. Tables are returned as arrays of rows with cell text and optional column headers.