MinerU Architecture & Features
Processing Pipeline
Input Document (PDF/Scan/Image)
│
├─ Layout Detection (LayoutLMv3)
│ └─ Identify: text, tables, figures, formulas, headers
│
├─ OCR Engine (PaddleOCR / Tesseract)
│ └─ Extract text from scanned pages
│
├─ Table Recognition
│ └─ Convert tables to Markdown/HTML/LaTeX
│
├─ Formula Recognition (UniMERNet)
│ └─ Convert math formulas to LaTeX
│
└─ Output Assembly
├─ Markdown (with images)
├─ JSON (structured blocks)
└─ Content list (flat text)Key Capabilities
| Feature | Description |
|---|---|
| Layout-aware parsing | Detects headers, paragraphs, tables, figures, formulas using deep learning models |
| Multi-column support | Correctly handles 2-column academic papers and complex layouts |
| Table extraction | Converts tables to Markdown, HTML, or LaTeX with cell merging support |
| Formula recognition | Converts mathematical formulas to LaTeX notation |
| OCR integration | PaddleOCR for 80+ languages, fallback to Tesseract |
| Image extraction | Saves embedded images with automatic naming and referencing |
| Reading order | Preserves logical reading order across complex layouts |
| Batch processing | Process hundreds of PDFs concurrently |
Output Formats
Markdown — Clean, LLM-friendly format:
# Chapter 1: Introduction
The experiment results shown in **Table 1** demonstrate...
| Metric | Model A | Model B |
|--------|---------|---------|
| F1 | 0.92 | 0.87 |
The loss function is defined as:
$$L = -\sum_{i} y_i \log(p_i)$$JSON — Structured blocks with metadata:
{
"blocks": [
{"type": "title", "text": "Chapter 1: Introduction", "level": 1},
{"type": "text", "text": "The experiment results..."},
{"type": "table", "cells": [["Metric", "Model A"], ["F1", "0.92"]]},
{"type": "equation", "latex": "L = -\\sum_{i} y_i \\log(p_i)"}
]
}CLI Commands
# Auto-detect PDF type (text vs scanned)
magic-pdf -p paper.pdf -o output/ -m auto
# Force OCR mode for scanned documents
magic-pdf -p scan.pdf -o output/ -m ocr
# Text-only mode (faster, no OCR)
magic-pdf -p textbook.pdf -o output/ -m txt
# Batch process a directory
magic-pdf -p papers/ -o output/ -m autoPerformance Benchmarks
Tested on academic papers, financial reports, and legal documents:
- Text extraction accuracy: 95%+ on clean PDFs
- Table recognition: 90%+ F1 on complex tables
- Processing speed: ~2-5 pages/second on GPU, ~0.5-1 page/second on CPU
- Language support: 80+ languages via PaddleOCR
FAQ
Q: What is MinerU? A: MinerU is an open-source document extraction tool with 57,900+ GitHub stars that converts PDFs and scanned documents into clean Markdown or JSON for LLM and RAG applications, with high-fidelity layout detection, table extraction, and formula recognition.
Q: How is MinerU different from Docling or Marker? A: MinerU focuses on layout-aware extraction with deep learning models (LayoutLMv3) and excels at complex multi-column academic papers. Docling (IBM) has broader format support. Marker is faster but less accurate on complex layouts. MinerU has the highest star count (57K+) and strongest community.
Q: Is MinerU free? A: Yes, open-source under AGPL-3.0. Free for personal and academic use. Commercial use requires compliance with AGPL terms or a commercial license.