[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"pack-detail-document-ai-pipeline-en":3,"seo:pack:document-ai-pipeline:en":78},{"code":4,"message":5,"data":6},200,"操作成功",{"pack":7},{"slug":8,"icon":9,"tone":10,"status":11,"status_label":12,"title":13,"description":14,"items":15,"install_cmd":77},"document-ai-pipeline","📄","#BE123C","stable","Stable","Document AI Pipeline","Surya, Zerox, MinerU, Docling, Unstructured, DocETL, MarkItDown — turn any PDF, scan, or Office file into clean LLM input.",[16,28,35,44,51,61,69],{"id":17,"uuid":18,"slug":19,"title":20,"description":21,"author_name":22,"view_count":23,"vote_count":24,"lang_type":25,"type":26,"type_label":27},263,"66bc0630-1be7-4da3-b227-f1fdb1faa065","surya-document-ocr-90-languages-66bc0630","Surya — Document OCR for 90+ Languages","Surya is a document OCR toolkit with 19.5K+ GitHub stars. Text recognition in 90+ languages, layout analysis, table detection, reading order, and LaTeX OCR. Benchmarks favorably against cloud OCR serv","Script Depot",501,0,"en","skill","Skill",{"id":29,"uuid":30,"slug":31,"title":32,"description":33,"author_name":22,"view_count":34,"vote_count":24,"lang_type":25,"type":26,"type_label":27},758,"3ac555d9-d75c-4208-ba46-974e4a717234","zerox-zero-shot-pdf-ocr-ai-pipelines-3ac555d9","Zerox — Zero-Shot PDF OCR for AI Pipelines","Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.",303,{"id":36,"uuid":37,"slug":38,"title":39,"description":40,"author_name":22,"view_count":41,"vote_count":24,"lang_type":25,"type":42,"type_label":43},413,"985fe0df-6ec5-4fd6-8d3d-3c1627b0e18d","mineru-extract-llm-ready-data-any-document-985fe0df","MinerU — Extract LLM-Ready Data from Any Document","Convert PDFs, scans, and complex documents into clean Markdown or JSON for RAG and LLM pipelines. 57K+ GitHub stars.",348,"script","Script",{"id":45,"uuid":46,"slug":47,"title":48,"description":49,"author_name":22,"view_count":50,"vote_count":24,"lang_type":25,"type":42,"type_label":43},173,"443e86c2-3811-496e-8e4d-6eef742ab219","docling-document-parsing-ai-443e86c2","Docling — Document Parsing for AI","IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.",262,{"id":52,"uuid":53,"slug":54,"title":55,"description":56,"author_name":57,"view_count":58,"vote_count":24,"lang_type":25,"type":59,"type_label":60},439,"c2ba9909-f624-414f-8aeb-fbd95c50766e","unstructured-document-etl-llm-pipelines-c2ba9909","Unstructured — Document ETL for LLM Pipelines","Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.","MCP Hub",344,"mcp","MCP",{"id":62,"uuid":63,"slug":64,"title":65,"description":66,"author_name":67,"view_count":68,"vote_count":24,"lang_type":25,"type":26,"type_label":27},417,"ef81583e-45e5-4134-b25b-04e486ae2d06","docetl-llm-powered-document-processing-pipelines-ef81583e","DocETL — LLM-Powered Document Processing Pipelines","Declarative YAML pipelines for LLM document analysis with map, reduce, and resolve operators. By UC Berkeley. 3.7K+ stars.","AI Open Source",291,{"id":70,"uuid":71,"slug":72,"title":73,"description":74,"author_name":75,"view_count":76,"vote_count":24,"lang_type":25,"type":26,"type_label":27},678,"6fdc90c2-bede-4d3a-98d7-faf751dfb41f","markitdown-convert-any-file-markdown-llms-6fdc90c2","MarkItDown — Convert Any File to Markdown for LLMs","Python library by Microsoft that converts PDF, DOCX, PPTX, XLSX, images, audio, and HTML to clean Markdown. Perfect for feeding documents into LLM context windows. 8,000+ stars.","Microsoft AI",353,"tokrepo install pack\u002Fdocument-ai-pipeline",{"pageType":79,"pageKey":8,"locale":25,"title":80,"metaDescription":81,"h1":13,"tldr":82,"bodyMarkdown":83,"faq":84,"schema":100,"internalLinks":109,"citations":122,"wordCount":135,"generatedAt":136},"pack","Document AI Pipeline: 7 Parsers for PDF, Scan, Office to LLM","Surya, Zerox, MinerU, Docling, Unstructured, DocETL, MarkItDown — turn any PDF, scan, or Office file into clean LLM input. Install the pipeline via TokRepo.","Seven open-source parsers covering OCR, layout extraction, table reconstruction, and Office-to-Markdown conversion. Together they turn any document a human ever made into clean LLM input.","## What's in this pack\n\n| # | Parser | Best at | Output |\n|---|---|---|---|\n| 1 | Surya | multilingual OCR + layout, 90+ languages | text + bounding boxes |\n| 2 | Zerox | vision-LLM driven page-by-page parse | markdown |\n| 3 | MinerU | scientific PDFs with formulas and tables | markdown + LaTeX |\n| 4 | Docling | IBM's all-in-one PDF\u002FDOCX\u002FHTML\u002FPPTX parser | DoclingDocument JSON |\n| 5 | Unstructured | enterprise-grade preprocessing with chunking | element list ready for embedding |\n| 6 | DocETL | LLM-driven document ETL with validation | typed records |\n| 7 | MarkItDown | Microsoft's Office-to-Markdown converter | markdown |\n\nThe seven parsers cover every shape of \"this file used to be for humans, now an LLM has to read it.\" Some specialize (Surya for OCR, MinerU for math papers); others are generalists (Docling, Unstructured, MarkItDown). Pick by file mix and accuracy budget.\n\n## Why this matters\n\nLLMs are surprisingly bad at reading raw PDF text. The bytes that look like prose to your eyes are usually scattered glyphs with no reading order — pdfplumber and PyMuPDF return jumbled output that confuses the model. Tables come out as broken rows. Headers and footers leak into the body. Multi-column layouts read top-to-bottom of the left column then top-to-bottom of the right, which is meaningless to a transformer.\n\nThis pack solves that. Surya and Zerox use vision models to *see* the page like a human and reconstruct logical reading order. Docling and Unstructured run layout-aware pipelines that label each element (heading, paragraph, table, caption) so downstream chunking respects structure. MinerU is the only open-source tool that reliably extracts equations and matrices from scientific papers.\n\nFor Office files (PowerPoint decks, Word docs, Excel sheets), MarkItDown is the answer. Microsoft published it because their own internal Copilot retrieval needed clean Markdown from Office, and existing converters were terrible.\n\n## Install in one command\n\n```bash\n# Install the whole pack\ntokrepo install pack\u002Fdocument-ai-pipeline\n\n# Or pick the parser that matches your file mix\ntokrepo install docling\ntokrepo install surya\ntokrepo install markitdown\n```\n\nEach TokRepo asset page lists the supported file types, GPU requirements (Surya and Zerox want GPU; Docling and MarkItDown run on CPU), and the chunking strategy that pairs well downstream.\n\n## Common pitfalls\n\n- **OCR vs PDF text layer**: a PDF *with* a text layer doesn't need OCR. Run Docling first; if the text layer is intact, skip Surya entirely. OCR is 10-100x slower than text extraction.\n- **Tables silently broken**: most parsers extract tables but flatten rows incorrectly. Always sample 10 random table outputs and eyeball them before trusting the pipeline.\n- **Reading order on multi-column**: legal documents and academic papers in two columns trip up naive parsers. Docling and Surya handle this; pdfplumber does not.\n- **Image captions get lost**: figures are often the most information-dense part of a paper. Make sure your parser keeps caption text linked to the figure, not floating elsewhere.\n- **Token cost on Zerox**: Zerox calls a vision-LLM per page. A 200-page PDF can cost $1-2 in API fees. Cache aggressively and prefer Docling-then-Zerox-fallback over running everything through Zerox.\n\n## Relationship to other packs\n\nThis pack is the **ingestion** layer for retrieval. It produces clean text and structured elements; the **RAG Pipelines** pack chunks, embeds, and serves them. For web pages instead of files, switch to **AI Web Scraping**. For voice or video content, those flow through speech-to-text first (out of scope for this pack).\n\nA common production stack is: MarkItDown for Office → Docling for PDFs → Unstructured chunking → vector DB → RAG pipeline. The boundaries between packs are clean enough that you can swap any single layer without rewriting the rest.",[85,88,91,94,97],{"q":86,"a":87},"Is this stack free?","All seven parsers are open-source under MIT, Apache 2.0, or BSD. Self-hosting is free. The hidden cost is GPU time for the vision-based parsers (Surya, Zerox) and LLM API fees if you use Zerox or DocETL with hosted models. CPU-only options (Docling, MarkItDown, Unstructured) are essentially free at any scale.",{"q":89,"a":90},"Docling vs Unstructured — which should I pick?","Docling if you want a single parser that handles PDF\u002FDOCX\u002FHTML\u002FPPTX with a unified output format and IBM's quality bar. Unstructured if you need deep enterprise integrations (S3, SharePoint, Azure connectors), pluggable chunking strategies, and don't mind a steeper config surface. Many teams run both: Docling for parse, Unstructured for chunking.",{"q":92,"a":93},"Will these work with Cursor or Codex CLI?","Yes — Docling, Unstructured, and MarkItDown have MCP servers or are exposed as CLI tools that any AI agent can invoke. Drop the MCP definition into your Cursor settings and the LLM can convert a dropped PDF to markdown on the fly. Surya and Zerox are heavier (GPU-resident) and usually run as a separate microservice.",{"q":95,"a":96},"How is this different from the AI Web Scraping pack?","Web scraping starts from a URL. Document AI starts from a file. The output of both is LLM-ready text, but the input shape is fundamentally different. Most production RAG corpora need both — your knowledge base has internal PDFs *and* a public docs site. Install both packs in that case.",{"q":98,"a":99},"What's the operational gotcha?","Throughput planning. Vision-based parsing (Surya, Zerox, MinerU on hard pages) is roughly 1-5 pages per second on a single GPU. If you have 100k pages to ingest, that's hours-to-days. Run a small benchmark before committing — many teams discover too late that their backfill takes a weekend, not an afternoon.",{"@context":101,"@type":102,"name":13,"description":103,"numberOfItems":104,"publisher":105},"https:\u002F\u002Fschema.org","CollectionPage","Seven open-source parsers that turn PDFs, scans, and Office files into clean LLM input.",7,{"@type":106,"name":107,"url":108},"Organization","TokRepo","https:\u002F\u002Ftokrepo.com",[110,114,118],{"url":111,"anchor":112,"reason":113},"\u002Fen\u002Fpacks\u002Fai-web-scraping","AI Web Scraping","complementary web ingestion",{"url":115,"anchor":116,"reason":117},"\u002Fen\u002Fpacks\u002Frag-pipelines","RAG Pipelines","downstream retrieve+generate layer",{"url":119,"anchor":120,"reason":121},"\u002Fen\u002Ftools\u002Fdocling","Docling","the IBM-built parser in this pack",[123,127,131],{"claim":124,"source_name":125,"source_url":126},"Docling is IBM's open-source document conversion toolkit for AI workflows","DS4SD\u002Fdocling on GitHub","https:\u002F\u002Fgithub.com\u002FDS4SD\u002Fdocling",{"claim":128,"source_name":129,"source_url":130},"Unstructured.io provides open-source preprocessing for LLM-ready document chunks","Unstructured-IO\u002Funstructured","https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured",{"claim":132,"source_name":133,"source_url":134},"MarkItDown converts Office, PDF, and other files to Markdown for LLM ingestion","microsoft\u002Fmarkitdown","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmarkitdown",624,"2026-05-02T15:00:00Z"]