How do I install Xberg — Polyglot Document Intelligence Framework in Rust?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Xberg — Polyglot Document Intelligence Framework in Rust

Introduction

Xberg is a document intelligence framework built around a high-performance Rust core. It extracts text, metadata, images, and structured information from over 97 file formats, with native bindings available for Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, and TypeScript.

What Xberg Does

Extracts text, tables, images, and metadata from PDFs, DOCX, XLSX, PPTX, and more
Provides native bindings for 12 programming languages via FFI
Runs as a CLI tool, REST API server, or MCP server for AI agents
Uses pdfium and Tesseract for accurate PDF and OCR processing
Outputs structured data suitable for RAG pipelines and search indexing

Architecture Overview

The core extraction engine is written in Rust for speed and safety. It uses pdfium for PDF rendering, Tesseract for OCR, and format-specific parsers for Office documents and archives. Language bindings are generated via C FFI, ensuring consistent behavior across all supported platforms. A thin HTTP layer exposes the same functionality as a REST API or MCP server.

Self-Hosting & Configuration

Install via pip, gem, cargo, npm, or your language's package manager
The CLI binary is self-contained with no runtime dependencies beyond the system libc
Configure OCR language packs for non-Latin scripts via environment variables
REST API mode starts with a single command for integration with web services
MCP server mode enables direct use by AI coding agents

Key Features

Supports 97+ file formats including PDF, DOCX, XLSX, PPTX, HTML, EML, and images
Rust core delivers extraction speeds 5-10x faster than pure-Python alternatives
Table extraction preserves row and column structure for spreadsheet-like output
Image extraction pulls embedded graphics with position metadata
Available as CLI, library, REST API, MCP server, and WebAssembly module

Comparison with Similar Tools

Docling — Python-based document parsing for AI; Xberg offers broader language support and a faster Rust core
MinerU — focuses on scientific papers; Xberg handles a wider range of document types
Marker — PDF-to-Markdown converter; Xberg provides structured data output beyond Markdown
Apache Tika — Java-based extraction; Xberg is lighter weight with native bindings for more languages
Unstructured — Python ETL for documents; Xberg focuses on speed with its Rust engine

FAQ

Q: Does Xberg require Tesseract for all file types? A: No. Tesseract is only used for OCR on scanned documents and images. Text-based PDFs and Office files are extracted without OCR.

Q: Can I use Xberg in a browser? A: Yes. A WebAssembly build is available for client-side document processing.

Q: What is the maximum file size Xberg can handle? A: There is no hard limit. Files are processed in streaming fashion, so memory usage scales with page count rather than total file size.

Q: Is it suitable for production RAG pipelines? A: Yes. Xberg is designed for high-throughput extraction and outputs structured data ready for embedding and indexing.

Xberg — Polyglot Document Intelligence Framework in Rust

Installation agent prête

Introduction

What Xberg Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Fil de discussion

Actifs similaires

Kreuzberg — Polyglot Document Intelligence Framework with a Rust Core

Rapier — Fast 2D and 3D Physics Engine in Rust

RustPython — Python Interpreter Written in Rust

Zerobrew — 5-20x Faster Experimental Homebrew Alternative in Rust