Pack OCR et Analyse de Documents
Dix outils pour l'ingénieur qui extrait des données structurées de scans, PDF et captures. Doc-AI moderne (Marker, Nougat, Surya, Zerox, MinerU), parsers sensibles au layout (Docling, Unstructured, OpenDataLoader) et OCR éprouvé (Tesseract, PaddleOCR) — ordre délibéré : détection → OCR → tables → structure → JSON.
What's in this pack
This is the pipeline a working engineer would assemble in one afternoon to convert messy documents — scanned invoices, academic PDFs, screenshots, mixed-language contracts — into clean structured data. The order matters: every stage's output is the next stage's input, and skipping layout detection is the single most common reason a doc-AI pipeline produces garbage.
All ten picks are open-source and actively maintained as of 2026. The combined install is large (model weights run a few GB), but you can usually pick one tool per stage and skip the rest. Treat this pack as a menu, not a checklist.
Install in this order
- Marker — convert PDF to Markdown end-to-end. Start here. Marker handles layout + OCR + tables + math in one shot and is the right default for most academic, technical, and structured PDFs. If Marker's output is good enough, you can stop reading.
- Surya — document OCR for 90+ languages with layout analysis, table detection, reading-order, and LaTeX OCR. Powers Marker internally; use it standalone when you need the OCR layer without the full Markdown pipeline.
- MinerU — extract LLM-ready data from any document. Stronger than Marker on complex layouts (multi-column papers, magazines, government forms). 57K+ GitHub stars. Picks up where Marker gives up.
- Zerox — zero-shot PDF OCR for AI pipelines. Sends page images to a vision LLM (GPT-4o, Claude, Gemini) and gets Markdown back. Pay-per-call instead of GPU-heavy local inference. Fastest path to working when you don't want to host a model.
- Nougat — neural optical understanding for academic documents. Meta's transformer model trained on arXiv. Best-in-class for math-heavy PDFs (equations come back as LaTeX, not garbled glyphs). Slower than Marker but more accurate on STEM papers.
- Docling — IBM's document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured Markdown or JSON. The most general-purpose parser in the pack — use when your input format isn't always PDF.
- Unstructured — document ETL for LLM pipelines. Handles 25+ file types with a unified API, partitioning text into typed elements (Title, NarrativeText, Table, ListItem). The boring industrial-strength backbone for RAG ingestion at scale.
- OpenDataLoader PDF — AI-ready document parser focused on producing clean structured output for downstream agents. Lighter footprint than Marker/MinerU, useful when latency matters more than peak accuracy.
- PaddleOCR — production-ready OCR for 100+ languages, including the best open-source Chinese OCR available. Use as the OCR layer when Marker/Surya struggle with non-Latin scripts or extreme noise.
- Tesseract OCR — the 40-year-old workhorse. Slow, sometimes inaccurate on modern fonts, but predictable, scriptable, and runs on a Raspberry Pi. Keep it as your fallback when GPU isn't available and accuracy expectations are modest.
How they fit together
Document in (PDF / scan / image / DOCX)
│
├─ Marker ─────────────────► clean Markdown (try this first)
│
│ if Marker output is bad:
│
├─ MinerU ─────────────────► Markdown / JSON (complex layouts)
│
│ if input is multi-format (DOCX, PPTX, HTML):
│
├─ Docling ────────────────► structured Markdown
├─ Unstructured ───────────► typed elements (Title, Table, NarrativeText)
│
│ if input is math-heavy academic PDF:
│
├─ Nougat ─────────────────► LaTeX + Markdown
│
│ if cloud LLM is cheaper than GPU:
│
├─ Zerox ──────────────────► Markdown via vision LLM
│
│ low-level OCR layer (called by others):
│
├─ Surya / PaddleOCR / Tesseract ──► raw text + bounding boxes
│
└─ OpenDataLoader PDF ─────► lightweight structured JSON
The install pattern is: Marker first as the default, MinerU as the escalation for layouts Marker struggles with, Nougat for math, and Zerox when you want to skip GPU hosting entirely. The OCR-only tools (Surya, PaddleOCR, Tesseract) are the building blocks underneath — you call them directly when the higher-level parsers fall short on your specific document class.
Tradeoffs you'll hit
- Marker vs MinerU — Marker is faster and produces cleaner Markdown on well-behaved PDFs. MinerU handles weirder layouts (Chinese newspapers, government forms, scanned books) but takes longer and outputs noisier Markdown. Benchmark both on 10 real documents from your domain before committing.
- Local model vs vision LLM (Zerox) — A 4090 running Marker costs more upfront but is roughly an order of magnitude cheaper per page once you exceed a few thousand pages/month. Below that volume, Zerox via GPT-4o or Claude is usually the right call.
- Surya vs PaddleOCR vs Tesseract — Surya is the modern default. PaddleOCR wins on Chinese, Japanese, Korean, and Arabic. Tesseract wins on "runs anywhere with no GPU" — keep it in your pipeline as the last-resort fallback.
- Docling vs Unstructured — Docling produces cleaner Markdown; Unstructured produces typed elements better suited to RAG chunking. Use Docling when a human will read the output. Use Unstructured when only a retriever will.
Common pitfalls
- Skipping layout detection — Running raw Tesseract on a two-column academic PDF interleaves text from both columns. Always run a layout-aware tool first (Marker, Surya, MinerU) — never feed full pages to OCR blindly.
- Trusting the table output without verification — Every tool in this pack still loses cells on borderless tables, merged headers, or rotated text. Pipe table output through a quick sanity check (row count, column count, numeric column dtype) before downstream use.
- GPU memory exhaustion — Marker, MinerU, and Nougat all want 8-12 GB VRAM at full quality. On a 16 GB card, run them sequentially, not in parallel.
- Mixed-language documents — Most tools auto-detect language per page, not per region. A bilingual contract with English on the left and Chinese on the right often gets one language identified and the other mangled. PaddleOCR handles this best; for everything else, pre-segment by region.
- Forgetting to dedupe headers/footers — Marker and friends extract page numbers, running headers, and footnotes as body text. Strip them with a post-processing pass keyed on repeating substrings across pages.
10 ressources prêtes à installer
Questions fréquentes
Which tool should I try first if I have no idea where to start?
Marker. It's the highest-quality default for the broadest range of PDFs and gives you clean Markdown end-to-end without forcing you to assemble a layout-OCR-table pipeline yourself. Run Marker on five real documents from your domain. If the output is good, you're done. If it isn't, escalate to MinerU for layout-heavy docs or Nougat for math-heavy ones.
Do I need a GPU to run this pack?
For Marker, Surya, MinerU, and Nougat — effectively yes, or at least a strong Apple Silicon chip. They'll technically run on CPU but at 30-100x slower throughput, which is impractical for anything beyond hobby use. The escape hatches are Zerox (offloads to a vision LLM API), Tesseract (CPU-only by design), and PaddleOCR (has a lightweight CPU mode). For production pipelines, plan on a single GPU instance handling thousands of pages per hour.
How do I handle documents in Chinese, Japanese, Korean, or Arabic?
PaddleOCR is the strongest open-source choice for CJK and Arabic — it was built primarily for Chinese text and the model weights are heavily optimized. Surya covers 90+ languages and handles mixed-script documents reasonably well. Marker and MinerU both delegate OCR internally, and MinerU in particular was developed with strong Chinese-language coverage in mind. Avoid Tesseract for CJK unless you're constrained to CPU.
What's the difference between OCR and document parsing?
OCR is the narrow problem of converting image pixels into text strings. Document parsing is the broader problem of understanding the document's structure — sections, paragraphs, tables, figures, reading order, references. Tesseract and PaddleOCR do OCR only. Marker, MinerU, Docling, and Unstructured do parsing on top of OCR. The reason this pack covers both layers is that high-level parsers still occasionally fail on a specific page, and you need a working OCR layer to recover.
Can I run any of these as a hosted API instead of self-hosting?
Several of these tools have hosted versions or commercial wrappers — Marker has a hosted API, MinerU runs as a managed service, Unstructured offers an API plan, and Zerox by design just calls a vision LLM API. For low volume and quick prototyping, hosted is the right call. For high volume, regulated data, or anything where document content can't leave your network, self-hosting is the path. The benchmark you actually want is cost-per-thousand-pages on your real workload, not the headline accuracy number.
12 packs · 80+ ressources sélectionnées
Découvrez tous les packs curatés sur la page d'accueil
Retour à tous les packs