TOKREPO · ARSENAL
Stable

Document AI Pipeline

Surya, Zerox, MinerU, Docling, Unstructured, DocETL, MarkItDown — turn any PDF, scan, or Office file into clean LLM input.

7 assets

What's in this pack

# Parser Best at Output
1 Surya multilingual OCR + layout, 90+ languages text + bounding boxes
2 Zerox vision-LLM driven page-by-page parse markdown
3 MinerU scientific PDFs with formulas and tables markdown + LaTeX
4 Docling IBM's all-in-one PDF/DOCX/HTML/PPTX parser DoclingDocument JSON
5 Unstructured enterprise-grade preprocessing with chunking element list ready for embedding
6 DocETL LLM-driven document ETL with validation typed records
7 MarkItDown Microsoft's Office-to-Markdown converter markdown

The seven parsers cover every shape of "this file used to be for humans, now an LLM has to read it." Some specialize (Surya for OCR, MinerU for math papers); others are generalists (Docling, Unstructured, MarkItDown). Pick by file mix and accuracy budget.

Why this matters

LLMs are surprisingly bad at reading raw PDF text. The bytes that look like prose to your eyes are usually scattered glyphs with no reading order — pdfplumber and PyMuPDF return jumbled output that confuses the model. Tables come out as broken rows. Headers and footers leak into the body. Multi-column layouts read top-to-bottom of the left column then top-to-bottom of the right, which is meaningless to a transformer.

This pack solves that. Surya and Zerox use vision models to see the page like a human and reconstruct logical reading order. Docling and Unstructured run layout-aware pipelines that label each element (heading, paragraph, table, caption) so downstream chunking respects structure. MinerU is the only open-source tool that reliably extracts equations and matrices from scientific papers.

For Office files (PowerPoint decks, Word docs, Excel sheets), MarkItDown is the answer. Microsoft published it because their own internal Copilot retrieval needed clean Markdown from Office, and existing converters were terrible.

Install in one command

# Install the whole pack
tokrepo install pack/document-ai-pipeline

# Or pick the parser that matches your file mix
tokrepo install docling
tokrepo install surya
tokrepo install markitdown

Each TokRepo asset page lists the supported file types, GPU requirements (Surya and Zerox want GPU; Docling and MarkItDown run on CPU), and the chunking strategy that pairs well downstream.

Common pitfalls

  • OCR vs PDF text layer: a PDF with a text layer doesn't need OCR. Run Docling first; if the text layer is intact, skip Surya entirely. OCR is 10-100x slower than text extraction.
  • Tables silently broken: most parsers extract tables but flatten rows incorrectly. Always sample 10 random table outputs and eyeball them before trusting the pipeline.
  • Reading order on multi-column: legal documents and academic papers in two columns trip up naive parsers. Docling and Surya handle this; pdfplumber does not.
  • Image captions get lost: figures are often the most information-dense part of a paper. Make sure your parser keeps caption text linked to the figure, not floating elsewhere.
  • Token cost on Zerox: Zerox calls a vision-LLM per page. A 200-page PDF can cost $1-2 in API fees. Cache aggressively and prefer Docling-then-Zerox-fallback over running everything through Zerox.

Relationship to other packs

This pack is the ingestion layer for retrieval. It produces clean text and structured elements; the RAG Pipelines pack chunks, embeds, and serves them. For web pages instead of files, switch to AI Web Scraping. For voice or video content, those flow through speech-to-text first (out of scope for this pack).

A common production stack is: MarkItDown for Office → Docling for PDFs → Unstructured chunking → vector DB → RAG pipeline. The boundaries between packs are clean enough that you can swap any single layer without rewriting the rest.

INSTALL · ONE COMMAND
$ tokrepo install pack/document-ai-pipeline
hand it to your agent — or paste it in your terminal
What's inside

7 assets in this pack

Script#01
Surya — Document OCR for 90+ Languages

Surya is a document OCR toolkit with 19.5K+ GitHub stars. Text recognition in 90+ languages, layout analysis, table detection, reading order, and LaTeX OCR. Benchmarks favorably against cloud OCR serv

by Script Depot·237 views
$ tokrepo install surya-document-ocr-90-languages-66bc0630
Script#02
Zerox — Zero-Shot PDF OCR for AI Pipelines

Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.

by Script Depot·97 views
$ tokrepo install zerox-zero-shot-pdf-ocr-ai-pipelines-3ac555d9
Script#03
MinerU — Extract LLM-Ready Data from Any Document

Convert PDFs, scans, and complex documents into clean Markdown or JSON for RAG and LLM pipelines. 57K+ GitHub stars.

by Script Depot·119 views
$ tokrepo install mineru-extract-llm-ready-data-any-document-985fe0df
Script#04
Docling — Document Parsing for AI

IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.

by Script Depot·92 views
$ tokrepo install docling-document-parsing-ai-443e86c2
MCP#05
Unstructured — Document ETL for LLM Pipelines

Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.

by MCP Hub·125 views
$ tokrepo install unstructured-document-etl-llm-pipelines-c2ba9909
Skill#06
DocETL — LLM-Powered Document Processing Pipelines

Declarative YAML pipelines for LLM document analysis with map, reduce, and resolve operators. By UC Berkeley. 3.7K+ stars.

by AI Open Source·133 views
$ tokrepo install docetl-llm-powered-document-processing-pipelines-ef81583e
Config#07
MarkItDown — Convert Any File to Markdown for LLMs

Python library by Microsoft that converts PDF, DOCX, PPTX, XLSX, images, audio, and HTML to clean Markdown. Perfect for feeding documents into LLM context windows. 8,000+ stars.

by Microsoft AI·114 views
$ tokrepo install markitdown-convert-any-file-markdown-llms-6fdc90c2
FAQ

Frequently asked questions

Is this stack free?

All seven parsers are open-source under MIT, Apache 2.0, or BSD. Self-hosting is free. The hidden cost is GPU time for the vision-based parsers (Surya, Zerox) and LLM API fees if you use Zerox or DocETL with hosted models. CPU-only options (Docling, MarkItDown, Unstructured) are essentially free at any scale.

Docling vs Unstructured — which should I pick?

Docling if you want a single parser that handles PDF/DOCX/HTML/PPTX with a unified output format and IBM's quality bar. Unstructured if you need deep enterprise integrations (S3, SharePoint, Azure connectors), pluggable chunking strategies, and don't mind a steeper config surface. Many teams run both: Docling for parse, Unstructured for chunking.

Will these work with Cursor or Codex CLI?

Yes — Docling, Unstructured, and MarkItDown have MCP servers or are exposed as CLI tools that any AI agent can invoke. Drop the MCP definition into your Cursor settings and the LLM can convert a dropped PDF to markdown on the fly. Surya and Zerox are heavier (GPU-resident) and usually run as a separate microservice.

How is this different from the AI Web Scraping pack?

Web scraping starts from a URL. Document AI starts from a file. The output of both is LLM-ready text, but the input shape is fundamentally different. Most production RAG corpora need both — your knowledge base has internal PDFs and a public docs site. Install both packs in that case.

What's the operational gotcha?

Throughput planning. Vision-based parsing (Surya, Zerox, MinerU on hard pages) is roughly 1-5 pages per second on a single GPU. If you have 100k pages to ingest, that's hours-to-days. Run a small benchmark before committing — many teams discover too late that their backfill takes a weekend, not an afternoon.

MORE FROM THE ARSENAL

12 packs · 80+ hand-picked assets

Browse every curated bundle on the home page

Back to all packs