TOKREPO · Arsenal IA

Stable

Pack OCR et Analyse de Documents

Dix outils pour l'ingénieur qui extrait des données structurées de scans, PDF et captures. Doc-AI moderne (Marker, Nougat, Surya, Zerox, MinerU), parsers sensibles au layout (Docling, Unstructured, OpenDataLoader) et OCR éprouvé (Tesseract, PaddleOCR) — ordre délibéré : détection → OCR → tables → structure → JSON.

10 ressources

À propos de ce pack

What's in this pack

This is the pipeline a working engineer would assemble in one afternoon to convert messy documents — scanned invoices, academic PDFs, screenshots, mixed-language contracts — into clean structured data. The order matters: every stage's output is the next stage's input, and skipping layout detection is the single most common reason a doc-AI pipeline produces garbage.

All ten picks are open-source and actively maintained as of 2026. The combined install is large (model weights run a few GB), but you can usually pick one tool per stage and skip the rest. Treat this pack as a menu, not a checklist.

Install in this order

Marker — convert PDF to Markdown end-to-end. Start here. Marker handles layout + OCR + tables + math in one shot and is the right default for most academic, technical, and structured PDFs. If Marker's output is good enough, you can stop reading.
Surya — document OCR for 90+ languages with layout analysis, table detection, reading-order, and LaTeX OCR. Powers Marker internally; use it standalone when you need the OCR layer without the full Markdown pipeline.
MinerU — extract LLM-ready data from any document. Stronger than Marker on complex layouts (multi-column papers, magazines, government forms). 57K+ GitHub stars. Picks up where Marker gives up.
Zerox — zero-shot PDF OCR for AI pipelines. Sends page images to a vision LLM (GPT-4o, Claude, Gemini) and gets Markdown back. Pay-per-call instead of GPU-heavy local inference. Fastest path to working when you don't want to host a model.
Nougat — neural optical understanding for academic documents. Meta's transformer model trained on arXiv. Best-in-class for math-heavy PDFs (equations come back as LaTeX, not garbled glyphs). Slower than Marker but more accurate on STEM papers.
Docling — IBM's document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured Markdown or JSON. The most general-purpose parser in the pack — use when your input format isn't always PDF.
Unstructured — document ETL for LLM pipelines. Handles 25+ file types with a unified API, partitioning text into typed elements (Title, NarrativeText, Table, ListItem). The boring industrial-strength backbone for RAG ingestion at scale.
OpenDataLoader PDF — AI-ready document parser focused on producing clean structured output for downstream agents. Lighter footprint than Marker/MinerU, useful when latency matters more than peak accuracy.
PaddleOCR — production-ready OCR for 100+ languages, including the best open-source Chinese OCR available. Use as the OCR layer when Marker/Surya struggle with non-Latin scripts or extreme noise.
Tesseract OCR — the 40-year-old workhorse. Slow, sometimes inaccurate on modern fonts, but predictable, scriptable, and runs on a Raspberry Pi. Keep it as your fallback when GPU isn't available and accuracy expectations are modest.

How they fit together

Document in (PDF / scan / image / DOCX)
   │
   ├─ Marker  ─────────────────► clean Markdown (try this first)
   │
   │  if Marker output is bad:
   │
   ├─ MinerU  ─────────────────► Markdown / JSON (complex layouts)
   │
   │  if input is multi-format (DOCX, PPTX, HTML):
   │
   ├─ Docling  ────────────────► structured Markdown
   ├─ Unstructured  ───────────► typed elements (Title, Table, NarrativeText)
   │
   │  if input is math-heavy academic PDF:
   │
   ├─ Nougat  ─────────────────► LaTeX + Markdown
   │
   │  if cloud LLM is cheaper than GPU:
   │
   ├─ Zerox  ──────────────────► Markdown via vision LLM
   │
   │  low-level OCR layer (called by others):
   │
   ├─ Surya / PaddleOCR / Tesseract  ──► raw text + bounding boxes
   │
   └─ OpenDataLoader PDF  ─────► lightweight structured JSON

The install pattern is: Marker first as the default, MinerU as the escalation for layouts Marker struggles with, Nougat for math, and Zerox when you want to skip GPU hosting entirely. The OCR-only tools (Surya, PaddleOCR, Tesseract) are the building blocks underneath — you call them directly when the higher-level parsers fall short on your specific document class.

Tradeoffs you'll hit

Marker vs MinerU — Marker is faster and produces cleaner Markdown on well-behaved PDFs. MinerU handles weirder layouts (Chinese newspapers, government forms, scanned books) but takes longer and outputs noisier Markdown. Benchmark both on 10 real documents from your domain before committing.
Local model vs vision LLM (Zerox) — A 4090 running Marker costs more upfront but is roughly an order of magnitude cheaper per page once you exceed a few thousand pages/month. Below that volume, Zerox via GPT-4o or Claude is usually the right call.
Surya vs PaddleOCR vs Tesseract — Surya is the modern default. PaddleOCR wins on Chinese, Japanese, Korean, and Arabic. Tesseract wins on "runs anywhere with no GPU" — keep it in your pipeline as the last-resort fallback.
Docling vs Unstructured — Docling produces cleaner Markdown; Unstructured produces typed elements better suited to RAG chunking. Use Docling when a human will read the output. Use Unstructured when only a retriever will.

Common pitfalls

Skipping layout detection — Running raw Tesseract on a two-column academic PDF interleaves text from both columns. Always run a layout-aware tool first (Marker, Surya, MinerU) — never feed full pages to OCR blindly.
Trusting the table output without verification — Every tool in this pack still loses cells on borderless tables, merged headers, or rotated text. Pipe table output through a quick sanity check (row count, column count, numeric column dtype) before downstream use.
GPU memory exhaustion — Marker, MinerU, and Nougat all want 8-12 GB VRAM at full quality. On a 16 GB card, run them sequentially, not in parallel.
Mixed-language documents — Most tools auto-detect language per page, not per region. A bilingual contract with English on the left and Chinese on the right often gets one language identified and the other mangled. PaddleOCR handles this best; for everything else, pre-segment by region.
Forgetting to dedupe headers/footers — Marker and friends extract page numbers, running headers, and footnotes as body text. Strip them with a post-processing pass keyed on repeating substrings across pages.

INSTALLER · UNE COMMANDE

$ tokrepo install pack/ocr-document-parsing

passez-la à votre agent — ou collez-la dans votre terminal

Ce qu'il contient

10 ressources prêtes à installer

Skill#01

Marker — Convert PDF to Markdown with High Accuracy

Fast, accurate PDF to Markdown + JSON converter. Handles tables, images, equations, code blocks, and multi-column layouts. GPU-accelerated. 33K+ GitHub stars.

by Script Depot·285 views

$ tokrepo install marker-convert-pdf-markdown-high-accuracy-42976daf

Skill#02

Surya — Document OCR for 90+ Languages

Surya is a document OCR toolkit with 19.5K+ GitHub stars. Text recognition in 90+ languages, layout analysis, table detection, reading order, and LaTeX OCR. Benchmarks favorably against cloud OCR serv

by Script Depot·604 views

$ tokrepo install surya-document-ocr-90-languages-66bc0630

Script#03

MinerU — Extract LLM-Ready Data from Any Document

Convert PDFs, scans, and complex documents into clean Markdown or JSON for RAG and LLM pipelines. 57K+ GitHub stars.

by Script Depot·442 views

$ tokrepo install mineru-extract-llm-ready-data-any-document-985fe0df

Skill#04

Zerox — Zero-Shot PDF OCR for AI Pipelines

Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.

by Script Depot·366 views

$ tokrepo install zerox-zero-shot-pdf-ocr-ai-pipelines-3ac555d9

Skill#05

Nougat — Neural Optical Understanding for Academic Documents

Nougat is a visual transformer model from Meta that converts academic PDF pages into structured Markdown, accurately preserving mathematical equations, tables, and text formatting.

by AI Open Source·181 views

$ tokrepo install nougat-neural-optical-understanding-academic-documents-ed1264b8

Script#06

Docling — Document Parsing for AI

IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.

by Script Depot·317 views

$ tokrepo install docling-document-parsing-ai-443e86c2

MCP#07

Unstructured — Document ETL for LLM Pipelines

Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.

by MCP Hub·404 views

$ tokrepo install unstructured-document-etl-llm-pipelines-c2ba9909

Skill#08

PaddleOCR — AI-Powered OCR Toolkit for 100+ Languages

A lightweight, production-ready OCR system supporting 100+ languages. Bridges documents and images to structured data for LLM pipelines.

by Script Depot·247 views

$ tokrepo install paddleocr-ai-powered-ocr-toolkit-100-languages-175147cb

Skill#09

Tesseract OCR — Open Source Text Recognition Engine for 100+ Languages

Tesseract is an open-source OCR engine maintained by Google, supporting over 100 languages. It converts images and scanned documents into machine-readable text with high accuracy across multiple output formats.

by Script Depot·329 views

$ tokrepo install tesseract-ocr-open-source-text-recognition-engine-100-9bb6bba9

Skill#10

OpenDataLoader PDF — AI-Ready Document Parser

An open-source PDF parser that automates document accessibility and extracts structured, AI-ready data including tables, text, bounding boxes, and tagged content.

by AI Open Source·297 views

$ tokrepo install opendataloader-pdf-ai-ready-document-parser-841f15d1

Questions fréquentes

Which tool should I try first if I have no idea where to start?

Marker. It's the highest-quality default for the broadest range of PDFs and gives you clean Markdown end-to-end without forcing you to assemble a layout-OCR-table pipeline yourself. Run Marker on five real documents from your domain. If the output is good, you're done. If it isn't, escalate to MinerU for layout-heavy docs or Nougat for math-heavy ones.

Do I need a GPU to run this pack?

For Marker, Surya, MinerU, and Nougat — effectively yes, or at least a strong Apple Silicon chip. They'll technically run on CPU but at 30-100x slower throughput, which is impractical for anything beyond hobby use. The escape hatches are Zerox (offloads to a vision LLM API), Tesseract (CPU-only by design), and PaddleOCR (has a lightweight CPU mode). For production pipelines, plan on a single GPU instance handling thousands of pages per hour.

How do I handle documents in Chinese, Japanese, Korean, or Arabic?

PaddleOCR is the strongest open-source choice for CJK and Arabic — it was built primarily for Chinese text and the model weights are heavily optimized. Surya covers 90+ languages and handles mixed-script documents reasonably well. Marker and MinerU both delegate OCR internally, and MinerU in particular was developed with strong Chinese-language coverage in mind. Avoid Tesseract for CJK unless you're constrained to CPU.

What's the difference between OCR and document parsing?

OCR is the narrow problem of converting image pixels into text strings. Document parsing is the broader problem of understanding the document's structure — sections, paragraphs, tables, figures, reading order, references. Tesseract and PaddleOCR do OCR only. Marker, MinerU, Docling, and Unstructured do parsing on top of OCR. The reason this pack covers both layers is that high-level parsers still occasionally fail on a specific page, and you need a working OCR layer to recover.

Can I run any of these as a hosted API instead of self-hosting?

Several of these tools have hosted versions or commercial wrappers — Marker has a hosted API, MinerU runs as a managed service, Unstructured offers an API plan, and Zerox by design just calls a vision LLM API. For low volume and quick prototyping, hosted is the right call. For high volume, regulated data, or anything where document content can't leave your network, self-hosting is the path. The benchmark you actually want is cost-per-thousand-pages on your real workload, not the headline accuracy number.

PLUS DANS L'ARSENAL

12 packs · 80+ ressources sélectionnées

Découvrez tous les packs curatés sur la page d'accueil

Retour à tous les packs