Kreuzberg — Polyglot Document Intelligence Framework with a Rust Core

Introduction

Kreuzberg is a document intelligence framework built around a high-performance Rust core with bindings for Python, Ruby, Go, Java, TypeScript, and more. It extracts text, metadata, tables, and images from virtually any document format, making it a foundational building block for RAG pipelines, search indexing, and document processing workflows.

What Kreuzberg Does

Extracts text content from PDFs, DOCX, PPTX, images, HTML, and 97+ formats
Detects and extracts tables with structure preservation
Pulls metadata (author, dates, page count) from documents
Performs OCR on scanned documents and images via Tesseract
Returns structured output suitable for LLM ingestion and RAG

Architecture Overview

The core extraction engine is written in Rust using pdfium for PDF rendering, and Tesseract bindings for OCR. Format-specific parsers handle Office XML, HTML, email, and other document types. The Rust core compiles to native libraries and WebAssembly, enabling bindings for 11 languages through FFI. Each binding provides idiomatic APIs while sharing the same underlying extraction logic.

Self-Hosting & Configuration

Install via package manager for your language (pip, gem, go get, npm, etc.)
Optionally install Tesseract for OCR support on scanned documents
Configure OCR language packs for non-English documents
Available as a REST API server and MCP server for agent integration
Also available as a standalone CLI tool

Key Features

Single extraction API across 97+ document formats
Rust core ensures consistent behavior across all language bindings
Table extraction preserves row/column structure
OCR integration for scanned and image-based documents
WebAssembly build for browser and edge deployment

Comparison with Similar Tools

Apache Tika — Java-based with heavy runtime; Kreuzberg is lightweight Rust
Unstructured — Python-only; Kreuzberg supports 11 languages natively
Docling — focused on PDF; Kreuzberg handles 97+ formats
MarkItDown — converts to Markdown; Kreuzberg provides structured extraction
MinerU — PDF-focused deep extraction; Kreuzberg is broader but less specialized on PDFs

FAQ

Q: Does it handle scanned PDFs? A: Yes. When text extraction yields empty results, Kreuzberg automatically falls back to OCR via Tesseract.

Q: Can I use it in a browser? A: Yes. The WebAssembly build works in browsers and Deno/Bun without native dependencies.

Q: How does it compare performance-wise to Python alternatives? A: The Rust core is significantly faster than pure Python parsers, especially for large documents and batch processing.

Q: Does it support structured table output? A: Yes. Tables are returned as arrays of rows with cell text and optional column headers.

Sources

https://github.com/kreuzberg-dev/kreuzberg

Kreuzberg — Polyglot Document Intelligence Framework with a Rust Core

Introduction

What Kreuzberg Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Fil de discussion

Actifs similaires

Knip — Find Unused Files, Dependencies and Exports in JS/TS Projects

Tonic — Native gRPC Framework for Rust

PlantUML — Generate UML Diagrams from Plain Text