ScriptsApr 2, 2026·3 min read

MinerU — Extract LLM-Ready Data from Any Document

Convert PDFs, scans, and complex documents into clean Markdown or JSON for RAG and LLM pipelines. 57K+ GitHub stars.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Stage only · 17/100Stage only

Agent surface

Any MCP/CLI agent

Kind

Script

Install

Stage only

Trust

Trust: Established

Entrypoint

mineru.md

Universal CLI install command

npx tokrepo install 985fe0df-6ec5-4fd6-8d3d-3c1627b0e18d

install contract metadata JSON adapter plan raw content

TL;DR

MinerU converts PDFs and scans into clean Markdown or JSON for RAG and LLM pipelines.

§01

What it is

MinerU is an open-source document extraction tool that converts PDFs, scanned images, and complex documents into clean Markdown or structured JSON. It handles tables, formulas, images, and multi-column layouts that break simpler parsers. The output is designed to feed directly into RAG pipelines and LLM applications.

MinerU targets AI engineers building RAG systems, data scientists who need to process research papers at scale, and teams that ingest large document corpora for LLM training or retrieval.

§02

How it saves time or tokens

Raw PDF text extraction often produces garbled output: broken table structures, missing formula text, and mangled multi-column layouts. MinerU's layout-aware parsing preserves document structure, which means your LLM receives clean, well-formatted context instead of noisy text. This reduces wasted tokens on malformed input and improves retrieval accuracy in RAG pipelines.

For batch processing, MinerU handles entire document directories in one command, eliminating manual per-file conversion workflows.

§03

How to use

Install MinerU via pip: pip install magic-pdf[full]. Download the required model weights as documented in the README.
Run extraction on a PDF: magic-pdf -p input.pdf -o output/ -m auto. The -m auto flag automatically selects the best extraction mode for your document.
Find the output in the output/ directory as Markdown files with extracted images in a companion folder. Use the structured JSON output for programmatic pipelines.

§04

Example

# Extract a research paper to Markdown
magic-pdf -p research-paper.pdf -o ./output -m auto

# Output structure:
# output/
#   research-paper/
#     auto/
#       research-paper.md      # Clean Markdown
#       images/                 # Extracted figures
#       research-paper.json     # Structured content

The Markdown output preserves headings, tables as Markdown tables, LaTeX formulas, and image references with extracted figure files.

§05

Related on TokRepo

AI tools for RAG — Tools for building retrieval-augmented generation pipelines
AI tools for documents — Document processing and extraction solutions

§06

Common pitfalls

Model weights are large (several GB). The initial download takes time and disk space. Ensure you have sufficient storage before installing.
Scanned PDFs with poor image quality produce lower-accuracy extractions. MinerU works best with high-resolution scans or born-digital PDFs.
GPU acceleration significantly speeds up processing. CPU-only mode works but is much slower for large document batches.

Frequently Asked Questions

What document formats does MinerU support?+

MinerU primarily targets PDF files, including both born-digital PDFs and scanned documents. It handles tables, mathematical formulas (LaTeX), multi-column layouts, and embedded images. Some formats like DOCX can be converted to PDF first for processing.

How does MinerU compare to other PDF parsers?+

MinerU uses deep learning models for layout detection, which gives it an advantage on complex documents with tables, formulas, and multi-column layouts. Simpler parsers like PyPDF or pdfplumber work well for basic text extraction but struggle with complex layouts.

Does MinerU require a GPU?+

A GPU is recommended for faster processing, especially for large batches. MinerU runs on CPU but processing speed is significantly slower. For production workloads with many documents, GPU acceleration is strongly recommended.

Can MinerU handle multi-language documents?+

Yes. MinerU's layout detection is language-agnostic, and it handles documents in English, Chinese, and other languages. The OCR component for scanned documents supports multiple languages through its underlying OCR engine.

How do I integrate MinerU output with a RAG pipeline?+

MinerU outputs clean Markdown or structured JSON. You can chunk the Markdown output using standard text splitters (by heading, paragraph, or token count), generate embeddings, and store them in a vector database. The structured JSON format provides pre-segmented content blocks.

Citations (3)

MinerU GitHub— MinerU converts PDFs into clean Markdown or JSON with 57K+ GitHub stars
MinerU Documentation— Layout-aware parsing handles tables, formulas, and multi-column layouts
OpenDataLab— Deep learning models for document layout detection

Related on TokRepo

RAG tools Document tools Research tools

🙏

Source & Thanks

Created by OpenDataLab. Licensed under AGPL-3.0.

MinerU — ⭐ 57,900+

Thanks to the OpenDataLab team for making high-quality document extraction accessible to the AI community.

Discussion

No comments yet. Be the first to share your thoughts.

Related Assets

Quivr — Opinionated RAG Framework for Any LLM

Quivr is an opinionated RAG framework supporting any LLM, multiple file types, and customizable retrieval. 39.1K+ stars. Apache 2.0.

Scripts

Script Depot

Outlines — Guaranteed Structured LLM Outputs

Outlines guarantees valid structured outputs from any LLM. 13.6K+ GitHub stars. JSON, Pydantic, enums, regex constraints during generation.

Scripts

Script Depot

Firecrawl Extract — Structured Data from Any URL

Firecrawl Extract pulls structured JSON from any URL using a Pydantic/Zod schema. Skip the regex/CSS dance — describe the shape, get clean data.

Workflows

Firecrawl

Tavily Extract — Pull Clean Content from Any URL

Tavily Extract converts up to 20 URLs into LLM-ready markdown in one API call. Skips ads, navigation, footers. Returns clean prose with citation metadata.

Skills

Tavily