ScriptsApr 2, 2026·3 min read

MinerU — Extract LLM-Ready Data from Any Document

Convert PDFs, scans, and complex documents into clean Markdown or JSON for RAG and LLM pipelines. 57K+ GitHub stars.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Stage only · 17/100Stage only
Agent surface
Any MCP/CLI agent
Kind
Script
Install
Stage only
Trust
Trust: Established
Entrypoint
mineru.md
Universal CLI install command
npx tokrepo install 985fe0df-6ec5-4fd6-8d3d-3c1627b0e18d
TL;DR
MinerU converts PDFs and scans into clean Markdown or JSON for RAG and LLM pipelines.
§01

What it is

MinerU is an open-source document extraction tool that converts PDFs, scanned images, and complex documents into clean Markdown or structured JSON. It handles tables, formulas, images, and multi-column layouts that break simpler parsers. The output is designed to feed directly into RAG pipelines and LLM applications.

MinerU targets AI engineers building RAG systems, data scientists who need to process research papers at scale, and teams that ingest large document corpora for LLM training or retrieval.

§02

How it saves time or tokens

Raw PDF text extraction often produces garbled output: broken table structures, missing formula text, and mangled multi-column layouts. MinerU's layout-aware parsing preserves document structure, which means your LLM receives clean, well-formatted context instead of noisy text. This reduces wasted tokens on malformed input and improves retrieval accuracy in RAG pipelines.

For batch processing, MinerU handles entire document directories in one command, eliminating manual per-file conversion workflows.

§03

How to use

  1. Install MinerU via pip: pip install magic-pdf[full]. Download the required model weights as documented in the README.
  2. Run extraction on a PDF: magic-pdf -p input.pdf -o output/ -m auto. The -m auto flag automatically selects the best extraction mode for your document.
  3. Find the output in the output/ directory as Markdown files with extracted images in a companion folder. Use the structured JSON output for programmatic pipelines.
§04

Example

# Extract a research paper to Markdown
magic-pdf -p research-paper.pdf -o ./output -m auto

# Output structure:
# output/
#   research-paper/
#     auto/
#       research-paper.md      # Clean Markdown
#       images/                 # Extracted figures
#       research-paper.json     # Structured content

The Markdown output preserves headings, tables as Markdown tables, LaTeX formulas, and image references with extracted figure files.

§05

Related on TokRepo

§06

Common pitfalls

  • Model weights are large (several GB). The initial download takes time and disk space. Ensure you have sufficient storage before installing.
  • Scanned PDFs with poor image quality produce lower-accuracy extractions. MinerU works best with high-resolution scans or born-digital PDFs.
  • GPU acceleration significantly speeds up processing. CPU-only mode works but is much slower for large document batches.

Frequently Asked Questions

What document formats does MinerU support?+

MinerU primarily targets PDF files, including both born-digital PDFs and scanned documents. It handles tables, mathematical formulas (LaTeX), multi-column layouts, and embedded images. Some formats like DOCX can be converted to PDF first for processing.

How does MinerU compare to other PDF parsers?+

MinerU uses deep learning models for layout detection, which gives it an advantage on complex documents with tables, formulas, and multi-column layouts. Simpler parsers like PyPDF or pdfplumber work well for basic text extraction but struggle with complex layouts.

Does MinerU require a GPU?+

A GPU is recommended for faster processing, especially for large batches. MinerU runs on CPU but processing speed is significantly slower. For production workloads with many documents, GPU acceleration is strongly recommended.

Can MinerU handle multi-language documents?+

Yes. MinerU's layout detection is language-agnostic, and it handles documents in English, Chinese, and other languages. The OCR component for scanned documents supports multiple languages through its underlying OCR engine.

How do I integrate MinerU output with a RAG pipeline?+

MinerU outputs clean Markdown or structured JSON. You can chunk the Markdown output using standard text splitters (by heading, paragraph, or token count), generate embeddings, and store them in a vector database. The structured JSON format provides pre-segmented content blocks.

Citations (3)
  • MinerU GitHub— MinerU converts PDFs into clean Markdown or JSON with 57K+ GitHub stars
  • MinerU Documentation— Layout-aware parsing handles tables, formulas, and multi-column layouts
  • OpenDataLab— Deep learning models for document layout detection
🙏

Source & Thanks

Created by OpenDataLab. Licensed under AGPL-3.0.

MinerU — ⭐ 57,900+

Thanks to the OpenDataLab team for making high-quality document extraction accessible to the AI community.

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets