# Xberg — Polyglot Document Intelligence Framework in Rust > A cross-language document extraction framework with a Rust core that parses PDFs, Office files, images, and 97+ formats into structured text and metadata. ## Install Save in your project root: # Xberg — Polyglot Document Intelligence Framework in Rust ## Quick Use ```bash pip install xberg # Extract text from a PDF python -c "import xberg; print(xberg.extract('document.pdf').text)" # Or via CLI cargo install xberg-cli && xberg extract document.pdf ``` ## Introduction Xberg is a document intelligence framework built around a high-performance Rust core. It extracts text, metadata, images, and structured information from over 97 file formats, with native bindings available for Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, and TypeScript. ## What Xberg Does - Extracts text, tables, images, and metadata from PDFs, DOCX, XLSX, PPTX, and more - Provides native bindings for 12 programming languages via FFI - Runs as a CLI tool, REST API server, or MCP server for AI agents - Uses pdfium and Tesseract for accurate PDF and OCR processing - Outputs structured data suitable for RAG pipelines and search indexing ## Architecture Overview The core extraction engine is written in Rust for speed and safety. It uses pdfium for PDF rendering, Tesseract for OCR, and format-specific parsers for Office documents and archives. Language bindings are generated via C FFI, ensuring consistent behavior across all supported platforms. A thin HTTP layer exposes the same functionality as a REST API or MCP server. ## Self-Hosting & Configuration - Install via pip, gem, cargo, npm, or your language's package manager - The CLI binary is self-contained with no runtime dependencies beyond the system libc - Configure OCR language packs for non-Latin scripts via environment variables - REST API mode starts with a single command for integration with web services - MCP server mode enables direct use by AI coding agents ## Key Features - Supports 97+ file formats including PDF, DOCX, XLSX, PPTX, HTML, EML, and images - Rust core delivers extraction speeds 5-10x faster than pure-Python alternatives - Table extraction preserves row and column structure for spreadsheet-like output - Image extraction pulls embedded graphics with position metadata - Available as CLI, library, REST API, MCP server, and WebAssembly module ## Comparison with Similar Tools - **Docling** — Python-based document parsing for AI; Xberg offers broader language support and a faster Rust core - **MinerU** — focuses on scientific papers; Xberg handles a wider range of document types - **Marker** — PDF-to-Markdown converter; Xberg provides structured data output beyond Markdown - **Apache Tika** — Java-based extraction; Xberg is lighter weight with native bindings for more languages - **Unstructured** — Python ETL for documents; Xberg focuses on speed with its Rust engine ## FAQ **Q: Does Xberg require Tesseract for all file types?** A: No. Tesseract is only used for OCR on scanned documents and images. Text-based PDFs and Office files are extracted without OCR. **Q: Can I use Xberg in a browser?** A: Yes. A WebAssembly build is available for client-side document processing. **Q: What is the maximum file size Xberg can handle?** A: There is no hard limit. Files are processed in streaming fashion, so memory usage scales with page count rather than total file size. **Q: Is it suitable for production RAG pipelines?** A: Yes. Xberg is designed for high-throughput extraction and outputs structured data ready for embedding and indexing. ## Sources - https://github.com/xberg-io/xberg - https://xberg.io --- Source: https://tokrepo.com/en/workflows/xberg-polyglot-document-intelligence-framework-rust-f9ad56b7 Author: AI Open Source