# Xberg — Polyglot Document Intelligence Framework in Rust

> A cross-language document extraction framework with a Rust core that parses PDFs, Office files, images, and 97+ formats into structured text and metadata.

## Install

Save in your project root:

# Xberg — Polyglot Document Intelligence Framework in Rust

## Quick Use
```bash
pip install xberg
# Extract text from a PDF
python -c "import xberg; print(xberg.extract('document.pdf').text)"
# Or via CLI
cargo install xberg-cli && xberg extract document.pdf
```

## Introduction
Xberg is a document intelligence framework built around a high-performance Rust core. It extracts text, metadata, images, and structured information from over 97 file formats, with native bindings available for Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, and TypeScript.

## What Xberg Does
- Extracts text, tables, images, and metadata from PDFs, DOCX, XLSX, PPTX, and more
- Provides native bindings for 12 programming languages via FFI
- Runs as a CLI tool, REST API server, or MCP server for AI agents
- Uses pdfium and Tesseract for accurate PDF and OCR processing
- Outputs structured data suitable for RAG pipelines and search indexing

## Architecture Overview
The core extraction engine is written in Rust for speed and safety. It uses pdfium for PDF rendering, Tesseract for OCR, and format-specific parsers for Office documents and archives. Language bindings are generated via C FFI, ensuring consistent behavior across all supported platforms. A thin HTTP layer exposes the same functionality as a REST API or MCP server.

## Self-Hosting & Configuration
- Install via pip, gem, cargo, npm, or your language's package manager
- The CLI binary is self-contained with no runtime dependencies beyond the system libc
- Configure OCR language packs for non-Latin scripts via environment variables
- REST API mode starts with a single command for integration with web services
- MCP server mode enables direct use by AI coding agents

## Key Features
- Supports 97+ file formats including PDF, DOCX, XLSX, PPTX, HTML, EML, and images
- Rust core delivers extraction speeds 5-10x faster than pure-Python alternatives
- Table extraction preserves row and column structure for spreadsheet-like output
- Image extraction pulls embedded graphics with position metadata
- Available as CLI, library, REST API, MCP server, and WebAssembly module

## Comparison with Similar Tools
- **Docling** — Python-based document parsing for AI; Xberg offers broader language support and a faster Rust core
- **MinerU** — focuses on scientific papers; Xberg handles a wider range of document types
- **Marker** — PDF-to-Markdown converter; Xberg provides structured data output beyond Markdown
- **Apache Tika** — Java-based extraction; Xberg is lighter weight with native bindings for more languages
- **Unstructured** — Python ETL for documents; Xberg focuses on speed with its Rust engine

## FAQ
**Q: Does Xberg require Tesseract for all file types?**
A: No. Tesseract is only used for OCR on scanned documents and images. Text-based PDFs and Office files are extracted without OCR.

**Q: Can I use Xberg in a browser?**
A: Yes. A WebAssembly build is available for client-side document processing.

**Q: What is the maximum file size Xberg can handle?**
A: There is no hard limit. Files are processed in streaming fashion, so memory usage scales with page count rather than total file size.

**Q: Is it suitable for production RAG pipelines?**
A: Yes. Xberg is designed for high-throughput extraction and outputs structured data ready for embedding and indexing.

## Sources
- https://github.com/xberg-io/xberg
- https://xberg.io

---
Source: https://tokrepo.com/en/workflows/xberg-polyglot-document-intelligence-framework-rust-f9ad56b7
Author: AI Open Source