MinerU — Extract LLM-Ready Data from Any Document
Convert PDFs, scans, and complex documents into clean Markdown or JSON for RAG and LLM pipelines. 57K+ GitHub stars.
This asset can be read and installed directly by agents
TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.
npx tokrepo install 985fe0df-6ec5-4fd6-8d3d-3c1627b0e18dWhat it is
MinerU is an open-source document extraction tool that converts PDFs, scanned images, and complex documents into clean Markdown or structured JSON. It handles tables, formulas, images, and multi-column layouts that break simpler parsers. The output is designed to feed directly into RAG pipelines and LLM applications.
MinerU targets AI engineers building RAG systems, data scientists who need to process research papers at scale, and teams that ingest large document corpora for LLM training or retrieval.
How it saves time or tokens
Raw PDF text extraction often produces garbled output: broken table structures, missing formula text, and mangled multi-column layouts. MinerU's layout-aware parsing preserves document structure, which means your LLM receives clean, well-formatted context instead of noisy text. This reduces wasted tokens on malformed input and improves retrieval accuracy in RAG pipelines.
For batch processing, MinerU handles entire document directories in one command, eliminating manual per-file conversion workflows.
How to use
- Install MinerU via pip:
pip install magic-pdf[full]. Download the required model weights as documented in the README. - Run extraction on a PDF:
magic-pdf -p input.pdf -o output/ -m auto. The-m autoflag automatically selects the best extraction mode for your document. - Find the output in the
output/directory as Markdown files with extracted images in a companion folder. Use the structured JSON output for programmatic pipelines.
Example
# Extract a research paper to Markdown
magic-pdf -p research-paper.pdf -o ./output -m auto
# Output structure:
# output/
# research-paper/
# auto/
# research-paper.md # Clean Markdown
# images/ # Extracted figures
# research-paper.json # Structured content
The Markdown output preserves headings, tables as Markdown tables, LaTeX formulas, and image references with extracted figure files.
Related on TokRepo
- AI tools for RAG — Tools for building retrieval-augmented generation pipelines
- AI tools for documents — Document processing and extraction solutions
Common pitfalls
- Model weights are large (several GB). The initial download takes time and disk space. Ensure you have sufficient storage before installing.
- Scanned PDFs with poor image quality produce lower-accuracy extractions. MinerU works best with high-resolution scans or born-digital PDFs.
- GPU acceleration significantly speeds up processing. CPU-only mode works but is much slower for large document batches.
Frequently Asked Questions
MinerU primarily targets PDF files, including both born-digital PDFs and scanned documents. It handles tables, mathematical formulas (LaTeX), multi-column layouts, and embedded images. Some formats like DOCX can be converted to PDF first for processing.
MinerU uses deep learning models for layout detection, which gives it an advantage on complex documents with tables, formulas, and multi-column layouts. Simpler parsers like PyPDF or pdfplumber work well for basic text extraction but struggle with complex layouts.
A GPU is recommended for faster processing, especially for large batches. MinerU runs on CPU but processing speed is significantly slower. For production workloads with many documents, GPU acceleration is strongly recommended.
Yes. MinerU's layout detection is language-agnostic, and it handles documents in English, Chinese, and other languages. The OCR component for scanned documents supports multiple languages through its underlying OCR engine.
MinerU outputs clean Markdown or structured JSON. You can chunk the Markdown output using standard text splitters (by heading, paragraph, or token count), generate embeddings, and store them in a vector database. The structured JSON format provides pre-segmented content blocks.
Citations (3)
- MinerU GitHub— MinerU converts PDFs into clean Markdown or JSON with 57K+ GitHub stars
- MinerU Documentation— Layout-aware parsing handles tables, formulas, and multi-column layouts
- OpenDataLab— Deep learning models for document layout detection
Related on TokRepo
Source & Thanks
Created by OpenDataLab. Licensed under AGPL-3.0.
MinerU — ⭐ 57,900+
Thanks to the OpenDataLab team for making high-quality document extraction accessible to the AI community.
Discussion
Related Assets
Quivr — Opinionated RAG Framework for Any LLM
Quivr is an opinionated RAG framework supporting any LLM, multiple file types, and customizable retrieval. 39.1K+ stars. Apache 2.0.
Outlines — Guaranteed Structured LLM Outputs
Outlines guarantees valid structured outputs from any LLM. 13.6K+ GitHub stars. JSON, Pydantic, enums, regex constraints during generation.
Firecrawl Extract — Structured Data from Any URL
Firecrawl Extract pulls structured JSON from any URL using a Pydantic/Zod schema. Skip the regex/CSS dance — describe the shape, get clean data.
Tavily Extract — Pull Clean Content from Any URL
Tavily Extract converts up to 20 URLs into LLM-ready markdown in one API call. Skips ads, navigation, footers. Returns clean prose with citation metadata.