[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"pack-detail-ocr-document-parsing-en":3,"seo:pack:ocr-document-parsing:en":98},{"code":4,"message":5,"data":6},200,"操作成功",{"pack":7},{"slug":8,"icon":9,"tone":10,"status":11,"status_label":12,"title":13,"description":14,"items":15,"install_cmd":97},"ocr-document-parsing","📄","#0EA5E9","new","New · this week","OCR + Document Parsing Pack","Ten picks for the engineer pulling structured data out of scans, PDFs, and screenshots. Modern doc-AI (Marker, Nougat, Surya, Zerox, MinerU), layout-aware parsers (Docling, Unstructured, OpenDataLoader), plus battle-tested OCR (Tesseract, PaddleOCR) — opinionated order from detect → OCR → tables → structure → JSON.",[16,28,35,44,51,59,66,76,83,90],{"id":17,"uuid":18,"slug":19,"title":20,"description":21,"author_name":22,"view_count":23,"vote_count":24,"lang_type":25,"type":26,"type_label":27},210,"42976daf-a56a-4152-9afb-d5b00d130a08","marker-convert-pdf-markdown-high-accuracy-42976daf","Marker — Convert PDF to Markdown with High Accuracy","Fast, accurate PDF to Markdown + JSON converter. Handles tables, images, equations, code blocks, and multi-column layouts. GPU-accelerated. 33K+ GitHub stars.","Script Depot",136,0,"en","skill","Skill",{"id":29,"uuid":30,"slug":31,"title":32,"description":33,"author_name":22,"view_count":34,"vote_count":24,"lang_type":25,"type":26,"type_label":27},263,"66bc0630-1be7-4da3-b227-f1fdb1faa065","surya-document-ocr-90-languages-66bc0630","Surya — Document OCR for 90+ Languages","Surya is a document OCR toolkit with 19.5K+ GitHub stars. Text recognition in 90+ languages, layout analysis, table detection, reading order, and LaTeX OCR. Benchmarks favorably against cloud OCR serv",384,{"id":36,"uuid":37,"slug":38,"title":39,"description":40,"author_name":22,"view_count":41,"vote_count":24,"lang_type":25,"type":42,"type_label":43},413,"985fe0df-6ec5-4fd6-8d3d-3c1627b0e18d","mineru-extract-llm-ready-data-any-document-985fe0df","MinerU — Extract LLM-Ready Data from Any Document","Convert PDFs, scans, and complex documents into clean Markdown or JSON for RAG and LLM pipelines. 57K+ GitHub stars.",230,"script","Script",{"id":45,"uuid":46,"slug":47,"title":48,"description":49,"author_name":22,"view_count":50,"vote_count":24,"lang_type":25,"type":26,"type_label":27},758,"3ac555d9-d75c-4208-ba46-974e4a717234","zerox-zero-shot-pdf-ocr-ai-pipelines-3ac555d9","Zerox — Zero-Shot PDF OCR for AI Pipelines","Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.",205,{"id":52,"uuid":53,"slug":54,"title":55,"description":56,"author_name":57,"view_count":58,"vote_count":24,"lang_type":25,"type":26,"type_label":27},4670,"ed1264b8-54cb-11f1-9bc6-00163e2b0d79","nougat-neural-optical-understanding-academic-documents-ed1264b8","Nougat — Neural Optical Understanding for Academic Documents","Nougat is a visual transformer model from Meta that converts academic PDF pages into structured Markdown, accurately preserving mathematical equations, tables, and text formatting.","AI Open Source",20,{"id":60,"uuid":61,"slug":62,"title":63,"description":64,"author_name":22,"view_count":65,"vote_count":24,"lang_type":25,"type":42,"type_label":43},173,"443e86c2-3811-496e-8e4d-6eef742ab219","docling-document-parsing-ai-443e86c2","Docling — Document Parsing for AI","IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.",180,{"id":67,"uuid":68,"slug":69,"title":70,"description":71,"author_name":72,"view_count":73,"vote_count":24,"lang_type":25,"type":74,"type_label":75},439,"c2ba9909-f624-414f-8aeb-fbd95c50766e","unstructured-document-etl-llm-pipelines-c2ba9909","Unstructured — Document ETL for LLM Pipelines","Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.","MCP Hub",214,"mcp","MCP",{"id":77,"uuid":78,"slug":79,"title":80,"description":81,"author_name":22,"view_count":82,"vote_count":24,"lang_type":25,"type":26,"type_label":27},2454,"175147cb-453a-11f1-9bc6-00163e2b0d79","paddleocr-ai-powered-ocr-toolkit-100-languages-175147cb","PaddleOCR — AI-Powered OCR Toolkit for 100+ Languages","A lightweight, production-ready OCR system supporting 100+ languages. Bridges documents and images to structured data for LLM pipelines.",96,{"id":84,"uuid":85,"slug":86,"title":87,"description":88,"author_name":22,"view_count":89,"vote_count":24,"lang_type":25,"type":26,"type_label":27},2344,"9bb6bba9-43a4-11f1-9bc6-00163e2b0d79","tesseract-ocr-open-source-text-recognition-engine-100-9bb6bba9","Tesseract OCR — Open Source Text Recognition Engine for 100+ Languages","Tesseract is an open-source OCR engine maintained by Google, supporting over 100 languages. It converts images and scanned documents into machine-readable text with high accuracy across multiple output formats.",163,{"id":91,"uuid":92,"slug":93,"title":94,"description":95,"author_name":57,"view_count":96,"vote_count":24,"lang_type":25,"type":26,"type_label":27},4036,"841f15d1-5079-11f1-9bc6-00163e2b0d79","opendataloader-pdf-ai-ready-document-parser-841f15d1","OpenDataLoader PDF — AI-Ready Document Parser","An open-source PDF parser that automates document accessibility and extracts structured, AI-ready data including tables, text, bounding boxes, and tagged content.",60,"tokrepo install pack\u002Focr-document-parsing",{"pageType":99,"pageKey":8,"locale":25,"title":100,"metaDescription":101,"h1":102,"tldr":103,"bodyMarkdown":104,"faq":105,"schema":121,"internalLinks":126,"citations":139,"wordCount":152,"generatedAt":153},"pack","OCR + Document Parsing Pack — 10 Tools to Turn PDFs and Scans into Clean JSON","Marker, Surya, Nougat, Zerox, MinerU, Docling, Unstructured, OpenDataLoader, Tesseract, PaddleOCR — opinionated install order for the dev pulling structured data out of scans, PDFs, and screenshots. Detect layout, OCR, parse tables, ship JSON.","OCR + Document Parsing Pack — Detect → OCR → Tables → Structure → JSON","Ten picks in install order: detect page layout first, then OCR the text, then extract tables, then assemble structure, then output JSON your LLM or pipeline can actually consume. Modern doc-AI on top, classic OCR as fallback for what the new stuff still misses.","## What's in this pack\n\nThis is the pipeline a working engineer would assemble in one afternoon to convert messy documents — scanned invoices, academic PDFs, screenshots, mixed-language contracts — into clean structured data. The order matters: every stage's output is the next stage's input, and skipping layout detection is the single most common reason a doc-AI pipeline produces garbage.\n\nAll ten picks are **open-source** and **actively maintained** as of 2026. The combined install is large (model weights run a few GB), but you can usually pick one tool per stage and skip the rest. Treat this pack as a menu, not a checklist.\n\n## Install in this order\n\n1. **Marker** — convert PDF to Markdown end-to-end. Start here. Marker handles layout + OCR + tables + math in one shot and is the right default for most academic, technical, and structured PDFs. If Marker's output is good enough, you can stop reading.\n2. **Surya** — document OCR for 90+ languages with layout analysis, table detection, reading-order, and LaTeX OCR. Powers Marker internally; use it standalone when you need the OCR layer without the full Markdown pipeline.\n3. **MinerU** — extract LLM-ready data from any document. Stronger than Marker on complex layouts (multi-column papers, magazines, government forms). 57K+ GitHub stars. Picks up where Marker gives up.\n4. **Zerox** — zero-shot PDF OCR for AI pipelines. Sends page images to a vision LLM (GPT-4o, Claude, Gemini) and gets Markdown back. Pay-per-call instead of GPU-heavy local inference. Fastest path to working when you don't want to host a model.\n5. **Nougat** — neural optical understanding for academic documents. Meta's transformer model trained on arXiv. Best-in-class for math-heavy PDFs (equations come back as LaTeX, not garbled glyphs). Slower than Marker but more accurate on STEM papers.\n6. **Docling** — IBM's document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured Markdown or JSON. The most general-purpose parser in the pack — use when your input format isn't always PDF.\n7. **Unstructured** — document ETL for LLM pipelines. Handles 25+ file types with a unified API, partitioning text into typed elements (Title, NarrativeText, Table, ListItem). The boring industrial-strength backbone for RAG ingestion at scale.\n8. **OpenDataLoader PDF** — AI-ready document parser focused on producing clean structured output for downstream agents. Lighter footprint than Marker\u002FMinerU, useful when latency matters more than peak accuracy.\n9. **PaddleOCR** — production-ready OCR for 100+ languages, including the best open-source Chinese OCR available. Use as the OCR layer when Marker\u002FSurya struggle with non-Latin scripts or extreme noise.\n10. **Tesseract OCR** — the 40-year-old workhorse. Slow, sometimes inaccurate on modern fonts, but predictable, scriptable, and runs on a Raspberry Pi. Keep it as your fallback when GPU isn't available and accuracy expectations are modest.\n\n## How they fit together\n\n```\nDocument in (PDF \u002F scan \u002F image \u002F DOCX)\n   │\n   ├─ Marker  ─────────────────► clean Markdown (try this first)\n   │\n   │  if Marker output is bad:\n   │\n   ├─ MinerU  ─────────────────► Markdown \u002F JSON (complex layouts)\n   │\n   │  if input is multi-format (DOCX, PPTX, HTML):\n   │\n   ├─ Docling  ────────────────► structured Markdown\n   ├─ Unstructured  ───────────► typed elements (Title, Table, NarrativeText)\n   │\n   │  if input is math-heavy academic PDF:\n   │\n   ├─ Nougat  ─────────────────► LaTeX + Markdown\n   │\n   │  if cloud LLM is cheaper than GPU:\n   │\n   ├─ Zerox  ──────────────────► Markdown via vision LLM\n   │\n   │  low-level OCR layer (called by others):\n   │\n   ├─ Surya \u002F PaddleOCR \u002F Tesseract  ──► raw text + bounding boxes\n   │\n   └─ OpenDataLoader PDF  ─────► lightweight structured JSON\n```\n\nThe install pattern is: **Marker first as the default**, **MinerU as the escalation** for layouts Marker struggles with, **Nougat for math**, and **Zerox when you want to skip GPU hosting entirely**. The OCR-only tools (Surya, PaddleOCR, Tesseract) are the building blocks underneath — you call them directly when the higher-level parsers fall short on your specific document class.\n\n## Tradeoffs you'll hit\n\n- **Marker vs MinerU** — Marker is faster and produces cleaner Markdown on well-behaved PDFs. MinerU handles weirder layouts (Chinese newspapers, government forms, scanned books) but takes longer and outputs noisier Markdown. Benchmark both on 10 real documents from your domain before committing.\n- **Local model vs vision LLM (Zerox)** — A 4090 running Marker costs more upfront but is roughly an order of magnitude cheaper per page once you exceed a few thousand pages\u002Fmonth. Below that volume, Zerox via GPT-4o or Claude is usually the right call.\n- **Surya vs PaddleOCR vs Tesseract** — Surya is the modern default. PaddleOCR wins on Chinese, Japanese, Korean, and Arabic. Tesseract wins on \"runs anywhere with no GPU\" — keep it in your pipeline as the last-resort fallback.\n- **Docling vs Unstructured** — Docling produces cleaner Markdown; Unstructured produces typed elements better suited to RAG chunking. Use Docling when a human will read the output. Use Unstructured when only a retriever will.\n\n## Common pitfalls\n\n- **Skipping layout detection** — Running raw Tesseract on a two-column academic PDF interleaves text from both columns. Always run a layout-aware tool first (Marker, Surya, MinerU) — never feed full pages to OCR blindly.\n- **Trusting the table output without verification** — Every tool in this pack still loses cells on borderless tables, merged headers, or rotated text. Pipe table output through a quick sanity check (row count, column count, numeric column dtype) before downstream use.\n- **GPU memory exhaustion** — Marker, MinerU, and Nougat all want 8-12 GB VRAM at full quality. On a 16 GB card, run them sequentially, not in parallel.\n- **Mixed-language documents** — Most tools auto-detect language per page, not per region. A bilingual contract with English on the left and Chinese on the right often gets one language identified and the other mangled. PaddleOCR handles this best; for everything else, pre-segment by region.\n- **Forgetting to dedupe headers\u002Ffooters** — Marker and friends extract page numbers, running headers, and footnotes as body text. Strip them with a post-processing pass keyed on repeating substrings across pages.",[106,109,112,115,118],{"q":107,"a":108},"Which tool should I try first if I have no idea where to start?","Marker. It's the highest-quality default for the broadest range of PDFs and gives you clean Markdown end-to-end without forcing you to assemble a layout-OCR-table pipeline yourself. Run Marker on five real documents from your domain. If the output is good, you're done. If it isn't, escalate to MinerU for layout-heavy docs or Nougat for math-heavy ones.",{"q":110,"a":111},"Do I need a GPU to run this pack?","For Marker, Surya, MinerU, and Nougat — effectively yes, or at least a strong Apple Silicon chip. They'll technically run on CPU but at 30-100x slower throughput, which is impractical for anything beyond hobby use. The escape hatches are Zerox (offloads to a vision LLM API), Tesseract (CPU-only by design), and PaddleOCR (has a lightweight CPU mode). For production pipelines, plan on a single GPU instance handling thousands of pages per hour.",{"q":113,"a":114},"How do I handle documents in Chinese, Japanese, Korean, or Arabic?","PaddleOCR is the strongest open-source choice for CJK and Arabic — it was built primarily for Chinese text and the model weights are heavily optimized. Surya covers 90+ languages and handles mixed-script documents reasonably well. Marker and MinerU both delegate OCR internally, and MinerU in particular was developed with strong Chinese-language coverage in mind. Avoid Tesseract for CJK unless you're constrained to CPU.",{"q":116,"a":117},"What's the difference between OCR and document parsing?","OCR is the narrow problem of converting image pixels into text strings. Document parsing is the broader problem of understanding the document's structure — sections, paragraphs, tables, figures, reading order, references. Tesseract and PaddleOCR do OCR only. Marker, MinerU, Docling, and Unstructured do parsing on top of OCR. The reason this pack covers both layers is that high-level parsers still occasionally fail on a specific page, and you need a working OCR layer to recover.",{"q":119,"a":120},"Can I run any of these as a hosted API instead of self-hosting?","Several of these tools have hosted versions or commercial wrappers — Marker has a hosted API, MinerU runs as a managed service, Unstructured offers an API plan, and Zerox by design just calls a vision LLM API. For low volume and quick prototyping, hosted is the right call. For high volume, regulated data, or anything where document content can't leave your network, self-hosting is the path. The benchmark you actually want is cost-per-thousand-pages on your real workload, not the headline accuracy number.",{"@context":122,"@type":123,"name":13,"description":124,"numberOfItems":125,"inLanguage":25},"https:\u002F\u002Fschema.org","ItemList","Ten open-source OCR and document parsing tools curated for engineers extracting structured data from PDFs, scans, and screenshots — install order from layout detection to JSON output.",10,[127,131,135],{"url":128,"anchor":129,"reason":130},"\u002Fen\u002Fai-tools-for\u002Frag","RAG ingestion tools","Document parsing is the first step in most RAG pipelines",{"url":132,"anchor":133,"reason":134},"\u002Fen\u002Ffeatured","Featured assets on TokRepo","These ten tools live alongside the broader curated catalog",{"url":136,"anchor":137,"reason":138},"\u002Fen\u002Ftopics","Browse other topic packs","Pairs well with the data-engineer and ML engineer packs",[140,144,148],{"claim":141,"source_name":142,"source_url":143},"Marker converts PDF to Markdown end-to-end with layout, table, and equation support","Marker GitHub","https:\u002F\u002Fgithub.com\u002FVikParuchuri\u002Fmarker",{"claim":145,"source_name":146,"source_url":147},"MinerU converts PDFs and scans into clean Markdown or JSON for RAG","MinerU GitHub","https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FMinerU",{"claim":149,"source_name":150,"source_url":151},"Nougat is a transformer model for academic document understanding from Meta AI","Nougat GitHub","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fnougat",920,"2026-05-22T10:00:00Z"]