Skills2026年5月4日·1 分钟阅读

Kreuzberg — Polyglot Document Intelligence Framework with a Rust Core

An open-source document extraction framework that pulls text, metadata, images, and structured data from PDFs, Office files, images, and 97+ formats, with bindings for 11 programming languages.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
Kreuzberg
通用 CLI 安装命令
npx tokrepo install 25ac133b-47b5-11f1-9bc6-00163e2b0d79

Introduction

Kreuzberg is a document intelligence framework built around a high-performance Rust core with bindings for Python, Ruby, Go, Java, TypeScript, and more. It extracts text, metadata, tables, and images from virtually any document format, making it a foundational building block for RAG pipelines, search indexing, and document processing workflows.

What Kreuzberg Does

  • Extracts text content from PDFs, DOCX, PPTX, images, HTML, and 97+ formats
  • Detects and extracts tables with structure preservation
  • Pulls metadata (author, dates, page count) from documents
  • Performs OCR on scanned documents and images via Tesseract
  • Returns structured output suitable for LLM ingestion and RAG

Architecture Overview

The core extraction engine is written in Rust using pdfium for PDF rendering, and Tesseract bindings for OCR. Format-specific parsers handle Office XML, HTML, email, and other document types. The Rust core compiles to native libraries and WebAssembly, enabling bindings for 11 languages through FFI. Each binding provides idiomatic APIs while sharing the same underlying extraction logic.

Self-Hosting & Configuration

  • Install via package manager for your language (pip, gem, go get, npm, etc.)
  • Optionally install Tesseract for OCR support on scanned documents
  • Configure OCR language packs for non-English documents
  • Available as a REST API server and MCP server for agent integration
  • Also available as a standalone CLI tool

Key Features

  • Single extraction API across 97+ document formats
  • Rust core ensures consistent behavior across all language bindings
  • Table extraction preserves row/column structure
  • OCR integration for scanned and image-based documents
  • WebAssembly build for browser and edge deployment

Comparison with Similar Tools

  • Apache Tika — Java-based with heavy runtime; Kreuzberg is lightweight Rust
  • Unstructured — Python-only; Kreuzberg supports 11 languages natively
  • Docling — focused on PDF; Kreuzberg handles 97+ formats
  • MarkItDown — converts to Markdown; Kreuzberg provides structured extraction
  • MinerU — PDF-focused deep extraction; Kreuzberg is broader but less specialized on PDFs

FAQ

Q: Does it handle scanned PDFs? A: Yes. When text extraction yields empty results, Kreuzberg automatically falls back to OCR via Tesseract.

Q: Can I use it in a browser? A: Yes. The WebAssembly build works in browsers and Deno/Bun without native dependencies.

Q: How does it compare performance-wise to Python alternatives? A: The Rust core is significantly faster than pure Python parsers, especially for large documents and batch processing.

Q: Does it support structured table output? A: Yes. Tables are returned as arrays of rows with cell text and optional column headers.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产