# MinerU — Extract LLM-Ready Data from Any Document

> Convert PDFs, scans, and complex documents into clean Markdown or JSON for RAG and LLM pipelines. 57K+ GitHub stars.

## Install

Save as a script file and run:

# MinerU — Extract LLM-Ready Data from Any Document

## Quick Use

```bash
pip install magic-pdf[full]
```

```bash
# Convert a PDF to Markdown
magic-pdf -p input.pdf -o output/ -m auto
```

```python
from magic_pdf.data.data_reader_writer import FileBasedDataWriter
from magic_pdf.pipe.UNIPipe import UNIPipe

# Programmatic usage
with open("input.pdf", "rb") as f:
    pdf_bytes = f.read()

pipe = UNIPipe(pdf_bytes, model_list=[], image_writer=FileBasedDataWriter("output/images"))
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown("output/images")
print(md_content)
```

---

## Intro

MinerU is an open-source document extraction tool by OpenDataLab with 57,900+ GitHub stars, purpose-built for converting complex documents into LLM-ready formats. It handles PDFs, scanned documents, and multi-column layouts with high fidelity, preserving tables, formulas, images, and reading order. Output is clean Markdown or structured JSON ready for RAG pipelines, fine-tuning datasets, or direct LLM consumption. It powers document understanding for thousands of AI applications in production.

Works with: Any LLM (GPT-4, Claude, Gemini, Llama), LangChain, LlamaIndex, Haystack RAG pipelines. Best for teams building document-heavy AI applications. Setup time: under 5 minutes.

---

## MinerU Architecture & Features

### Processing Pipeline

```
Input Document (PDF/Scan/Image)
    │
    ├─ Layout Detection (LayoutLMv3)
    │   └─ Identify: text, tables, figures, formulas, headers
    │
    ├─ OCR Engine (PaddleOCR / Tesseract)
    │   └─ Extract text from scanned pages
    │
    ├─ Table Recognition
    │   └─ Convert tables to Markdown/HTML/LaTeX
    │
    ├─ Formula Recognition (UniMERNet)
    │   └─ Convert math formulas to LaTeX
    │
    └─ Output Assembly
        ├─ Markdown (with images)
        ├─ JSON (structured blocks)
        └─ Content list (flat text)
```

### Key Capabilities

| Feature | Description |
|---------|-------------|
| **Layout-aware parsing** | Detects headers, paragraphs, tables, figures, formulas using deep learning models |
| **Multi-column support** | Correctly handles 2-column academic papers and complex layouts |
| **Table extraction** | Converts tables to Markdown, HTML, or LaTeX with cell merging support |
| **Formula recognition** | Converts mathematical formulas to LaTeX notation |
| **OCR integration** | PaddleOCR for 80+ languages, fallback to Tesseract |
| **Image extraction** | Saves embedded images with automatic naming and referencing |
| **Reading order** | Preserves logical reading order across complex layouts |
| **Batch processing** | Process hundreds of PDFs concurrently |

### Output Formats

**Markdown** — Clean, LLM-friendly format:
```markdown
# Chapter 1: Introduction

The experiment results shown in **Table 1** demonstrate...

| Metric | Model A | Model B |
|--------|---------|---------|
| F1     | 0.92    | 0.87    |

The loss function is defined as:

$$L = -\sum_{i} y_i \log(p_i)$$
```

**JSON** — Structured blocks with metadata:
```json
{
  "blocks": [
    {"type": "title", "text": "Chapter 1: Introduction", "level": 1},
    {"type": "text", "text": "The experiment results..."},
    {"type": "table", "cells": [["Metric", "Model A"], ["F1", "0.92"]]},
    {"type": "equation", "latex": "L = -\\sum_{i} y_i \\log(p_i)"}
  ]
}
```

### CLI Commands

```bash
# Auto-detect PDF type (text vs scanned)
magic-pdf -p paper.pdf -o output/ -m auto

# Force OCR mode for scanned documents
magic-pdf -p scan.pdf -o output/ -m ocr

# Text-only mode (faster, no OCR)
magic-pdf -p textbook.pdf -o output/ -m txt

# Batch process a directory
magic-pdf -p papers/ -o output/ -m auto
```

### Performance Benchmarks

Tested on academic papers, financial reports, and legal documents:
- **Text extraction accuracy**: 95%+ on clean PDFs
- **Table recognition**: 90%+ F1 on complex tables
- **Processing speed**: ~2-5 pages/second on GPU, ~0.5-1 page/second on CPU
- **Language support**: 80+ languages via PaddleOCR

---

## FAQ

**Q: What is MinerU?**
A: MinerU is an open-source document extraction tool with 57,900+ GitHub stars that converts PDFs and scanned documents into clean Markdown or JSON for LLM and RAG applications, with high-fidelity layout detection, table extraction, and formula recognition.

**Q: How is MinerU different from Docling or Marker?**
A: MinerU focuses on layout-aware extraction with deep learning models (LayoutLMv3) and excels at complex multi-column academic papers. Docling (IBM) has broader format support. Marker is faster but less accurate on complex layouts. MinerU has the highest star count (57K+) and strongest community.

**Q: Is MinerU free?**
A: Yes, open-source under AGPL-3.0. Free for personal and academic use. Commercial use requires compliance with AGPL terms or a commercial license.

---

## Source & Thanks

> Created by [OpenDataLab](https://github.com/opendatalab). Licensed under AGPL-3.0.
>
> [MinerU](https://github.com/opendatalab/MinerU) — ⭐ 57,900+

Thanks to the OpenDataLab team for making high-quality document extraction accessible to the AI community.

---

<!-- ZH -->

## 快速使用

```bash
pip install magic-pdf[full]
```

```bash
# 将 PDF 转换为 Markdown
magic-pdf -p input.pdf -o output/ -m auto
```

---

## 简介

MinerU 是 OpenDataLab 开源的文档提取工具，拥有 57,900+ GitHub stars，专为将复杂文档转换成 LLM 可用格式而设计。支持 PDF、扫描件和多栏布局，高保真保留表格、公式、图片和阅读顺序。输出为干净的 Markdown 或结构化 JSON，可直接用于 RAG 管线、微调数据集或 LLM 消费。

适用于：GPT-4、Claude、Gemini、Llama 等任何 LLM，以及 LangChain、LlamaIndex 等 RAG 框架。适合构建文档密集型 AI 应用的团队。

---

## 核心功能

### 布局感知解析
使用深度学习模型（LayoutLMv3）检测标题、段落、表格、图片和公式。

### 表格提取
将表格转换为 Markdown、HTML 或 LaTeX，支持合并单元格。

### 公式识别
使用 UniMERNet 将数学公式转换为 LaTeX 表示。

### OCR 集成
PaddleOCR 支持 80+ 语言，Tesseract 作为后备方案。

### 批量处理
支持并发处理数百个 PDF 文件。

---

## 来源与感谢

> Created by [OpenDataLab](https://github.com/opendatalab). Licensed under AGPL-3.0.
>
> [MinerU](https://github.com/opendatalab/MinerU) — ⭐ 57,900+


---
Source: https://tokrepo.com/en/workflows/985fe0df-6ec5-4fd6-8d3d-3c1627b0e18d
Author: Script Depot