Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsJul 1, 2026·3 min de lecture

Xberg — Polyglot Document Intelligence Framework in Rust

A cross-language document extraction framework with a Rust core that parses PDFs, Office files, images, and 97+ formats into structured text and metadata.

Prêt pour agents

Installation agent prête

Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
Xberg
Commande d'installation directe
npx -y tokrepo@latest install f9ad56b7-758a-11f1-9bc6-00163e2b0d79 --target codex

À exécuter après confirmation du plan en dry-run.

Introduction

Xberg is a document intelligence framework built around a high-performance Rust core. It extracts text, metadata, images, and structured information from over 97 file formats, with native bindings available for Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, and TypeScript.

What Xberg Does

  • Extracts text, tables, images, and metadata from PDFs, DOCX, XLSX, PPTX, and more
  • Provides native bindings for 12 programming languages via FFI
  • Runs as a CLI tool, REST API server, or MCP server for AI agents
  • Uses pdfium and Tesseract for accurate PDF and OCR processing
  • Outputs structured data suitable for RAG pipelines and search indexing

Architecture Overview

The core extraction engine is written in Rust for speed and safety. It uses pdfium for PDF rendering, Tesseract for OCR, and format-specific parsers for Office documents and archives. Language bindings are generated via C FFI, ensuring consistent behavior across all supported platforms. A thin HTTP layer exposes the same functionality as a REST API or MCP server.

Self-Hosting & Configuration

  • Install via pip, gem, cargo, npm, or your language's package manager
  • The CLI binary is self-contained with no runtime dependencies beyond the system libc
  • Configure OCR language packs for non-Latin scripts via environment variables
  • REST API mode starts with a single command for integration with web services
  • MCP server mode enables direct use by AI coding agents

Key Features

  • Supports 97+ file formats including PDF, DOCX, XLSX, PPTX, HTML, EML, and images
  • Rust core delivers extraction speeds 5-10x faster than pure-Python alternatives
  • Table extraction preserves row and column structure for spreadsheet-like output
  • Image extraction pulls embedded graphics with position metadata
  • Available as CLI, library, REST API, MCP server, and WebAssembly module

Comparison with Similar Tools

  • Docling — Python-based document parsing for AI; Xberg offers broader language support and a faster Rust core
  • MinerU — focuses on scientific papers; Xberg handles a wider range of document types
  • Marker — PDF-to-Markdown converter; Xberg provides structured data output beyond Markdown
  • Apache Tika — Java-based extraction; Xberg is lighter weight with native bindings for more languages
  • Unstructured — Python ETL for documents; Xberg focuses on speed with its Rust engine

FAQ

Q: Does Xberg require Tesseract for all file types? A: No. Tesseract is only used for OCR on scanned documents and images. Text-based PDFs and Office files are extracted without OCR.

Q: Can I use Xberg in a browser? A: Yes. A WebAssembly build is available for client-side document processing.

Q: What is the maximum file size Xberg can handle? A: There is no hard limit. Files are processed in streaming fashion, so memory usage scales with page count rather than total file size.

Q: Is it suitable for production RAG pipelines? A: Yes. Xberg is designed for high-throughput extraction and outputs structured data ready for embedding and indexing.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires