Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsJul 5, 2026·3 min de lectura

Unlimited-OCR — One-Shot Long-Horizon Document Parsing by Baidu

Open-source OCR system from Baidu that parses complex documents in a single pass with high accuracy across diverse layouts.

Listo para agents

Instalación lista para agent

Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
Unlimited-OCR Overview
Comando de instalación directa
npx -y tokrepo@latest install 8a06950a-7808-11f1-9bc6-00163e2b0d79 --target codex

Ejecutar después de confirmar el plan con dry-run.

Introduction

Unlimited-OCR is an open-source document parsing system released by Baidu Research. It introduces a one-shot long-horizon parsing approach that processes entire documents in a single forward pass, handling complex layouts including tables, formulas, figures, and mixed content.

What Unlimited-OCR Does

  • Parses full documents in a single pass without sliding-window fragmentation
  • Recognizes text across complex layouts with tables, columns, and figures
  • Extracts mathematical formulas and renders them as LaTeX
  • Handles multi-page PDFs with consistent structural understanding
  • Outputs structured JSON with bounding boxes and reading order

Architecture Overview

Unlimited-OCR uses a vision-language model backbone that processes document images at high resolution in one shot. Rather than breaking pages into patches and stitching results, it maintains global context across the entire page, enabling accurate reading order detection and cross-element relationship understanding.

Self-Hosting & Configuration

  • Requires Python 3.8+ and PyTorch with CUDA support
  • Download pretrained model weights from the official release page
  • GPU with 16 GB VRAM recommended for full-page processing
  • Configure output format (JSON, Markdown, or plain text) via CLI flags
  • Batch processing mode available for large document collections

Key Features

  • One-shot parsing eliminates boundary artifacts from patch-based methods
  • Long-horizon context captures cross-page references and document structure
  • High accuracy on academic papers, invoices, and mixed-language documents
  • Built-in table structure recognition with cell-level extraction
  • Formula recognition outputs publication-ready LaTeX

Comparison with Similar Tools

  • Surya — multi-language OCR with strong line detection but no one-shot page understanding
  • Marker — PDF-to-Markdown converter focused on clean output rather than structural extraction
  • MinerU — document extraction for AI pipelines with page-level processing
  • PaddleOCR — Baidu's established OCR toolkit with a modular pipeline approach
  • Docling — IBM's document parser focused on conversion to standard formats

FAQ

Q: What document formats does Unlimited-OCR support? A: PDF, PNG, JPG, TIFF, and BMP. Multi-page PDFs are processed with cross-page awareness.

Q: How does it handle handwritten text? A: The model is primarily trained on printed text. Handwritten recognition depends on legibility and may require fine-tuning.

Q: Can I run it on CPU? A: Yes, but inference is significantly slower. GPU processing is recommended for production use.

Q: Does it support right-to-left languages like Arabic? A: The model supports major RTL languages, though accuracy may vary compared to Latin-script content.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados