# MinerU — Extract LLM-Ready Data from Any Document > Convert PDFs, scans, and complex documents into clean Markdown or JSON for RAG and LLM pipelines. 57K+ GitHub stars. ## Install Save as a script file and run: # MinerU — Extract LLM-Ready Data from Any Document ## Quick Use ```bash pip install magic-pdf[full] ``` ```bash # Convert a PDF to Markdown magic-pdf -p input.pdf -o output/ -m auto ``` ```python from magic_pdf.data.data_reader_writer import FileBasedDataWriter from magic_pdf.pipe.UNIPipe import UNIPipe # Programmatic usage with open("input.pdf", "rb") as f: pdf_bytes = f.read() pipe = UNIPipe(pdf_bytes, model_list=[], image_writer=FileBasedDataWriter("output/images")) pipe.pipe_classify() pipe.pipe_analyze() pipe.pipe_parse() md_content = pipe.pipe_mk_markdown("output/images") print(md_content) ``` --- ## Intro MinerU is an open-source document extraction tool by OpenDataLab with 57,900+ GitHub stars, purpose-built for converting complex documents into LLM-ready formats. It handles PDFs, scanned documents, and multi-column layouts with high fidelity, preserving tables, formulas, images, and reading order. Output is clean Markdown or structured JSON ready for RAG pipelines, fine-tuning datasets, or direct LLM consumption. It powers document understanding for thousands of AI applications in production. Works with: Any LLM (GPT-4, Claude, Gemini, Llama), LangChain, LlamaIndex, Haystack RAG pipelines. Best for teams building document-heavy AI applications. Setup time: under 5 minutes. --- ## MinerU Architecture & Features ### Processing Pipeline ``` Input Document (PDF/Scan/Image) │ ├─ Layout Detection (LayoutLMv3) │ └─ Identify: text, tables, figures, formulas, headers │ ├─ OCR Engine (PaddleOCR / Tesseract) │ └─ Extract text from scanned pages │ ├─ Table Recognition │ └─ Convert tables to Markdown/HTML/LaTeX │ ├─ Formula Recognition (UniMERNet) │ └─ Convert math formulas to LaTeX │ └─ Output Assembly ├─ Markdown (with images) ├─ JSON (structured blocks) └─ Content list (flat text) ``` ### Key Capabilities | Feature | Description | |---------|-------------| | **Layout-aware parsing** | Detects headers, paragraphs, tables, figures, formulas using deep learning models | | **Multi-column support** | Correctly handles 2-column academic papers and complex layouts | | **Table extraction** | Converts tables to Markdown, HTML, or LaTeX with cell merging support | | **Formula recognition** | Converts mathematical formulas to LaTeX notation | | **OCR integration** | PaddleOCR for 80+ languages, fallback to Tesseract | | **Image extraction** | Saves embedded images with automatic naming and referencing | | **Reading order** | Preserves logical reading order across complex layouts | | **Batch processing** | Process hundreds of PDFs concurrently | ### Output Formats **Markdown** — Clean, LLM-friendly format: ```markdown # Chapter 1: Introduction The experiment results shown in **Table 1** demonstrate... | Metric | Model A | Model B | |--------|---------|---------| | F1 | 0.92 | 0.87 | The loss function is defined as: $$L = -\sum_{i} y_i \log(p_i)$$ ``` **JSON** — Structured blocks with metadata: ```json { "blocks": [ {"type": "title", "text": "Chapter 1: Introduction", "level": 1}, {"type": "text", "text": "The experiment results..."}, {"type": "table", "cells": [["Metric", "Model A"], ["F1", "0.92"]]}, {"type": "equation", "latex": "L = -\\sum_{i} y_i \\log(p_i)"} ] } ``` ### CLI Commands ```bash # Auto-detect PDF type (text vs scanned) magic-pdf -p paper.pdf -o output/ -m auto # Force OCR mode for scanned documents magic-pdf -p scan.pdf -o output/ -m ocr # Text-only mode (faster, no OCR) magic-pdf -p textbook.pdf -o output/ -m txt # Batch process a directory magic-pdf -p papers/ -o output/ -m auto ``` ### Performance Benchmarks Tested on academic papers, financial reports, and legal documents: - **Text extraction accuracy**: 95%+ on clean PDFs - **Table recognition**: 90%+ F1 on complex tables - **Processing speed**: ~2-5 pages/second on GPU, ~0.5-1 page/second on CPU - **Language support**: 80+ languages via PaddleOCR --- ## FAQ **Q: What is MinerU?** A: MinerU is an open-source document extraction tool with 57,900+ GitHub stars that converts PDFs and scanned documents into clean Markdown or JSON for LLM and RAG applications, with high-fidelity layout detection, table extraction, and formula recognition. **Q: How is MinerU different from Docling or Marker?** A: MinerU focuses on layout-aware extraction with deep learning models (LayoutLMv3) and excels at complex multi-column academic papers. Docling (IBM) has broader format support. Marker is faster but less accurate on complex layouts. MinerU has the highest star count (57K+) and strongest community. **Q: Is MinerU free?** A: Yes, open-source under AGPL-3.0. Free for personal and academic use. Commercial use requires compliance with AGPL terms or a commercial license. --- ## Source & Thanks > Created by [OpenDataLab](https://github.com/opendatalab). Licensed under AGPL-3.0. > > [MinerU](https://github.com/opendatalab/MinerU) — ⭐ 57,900+ Thanks to the OpenDataLab team for making high-quality document extraction accessible to the AI community. --- ## 快速使用 ```bash pip install magic-pdf[full] ``` ```bash # 将 PDF 转换为 Markdown magic-pdf -p input.pdf -o output/ -m auto ``` --- ## 简介 MinerU 是 OpenDataLab 开源的文档提取工具,拥有 57,900+ GitHub stars,专为将复杂文档转换成 LLM 可用格式而设计。支持 PDF、扫描件和多栏布局,高保真保留表格、公式、图片和阅读顺序。输出为干净的 Markdown 或结构化 JSON,可直接用于 RAG 管线、微调数据集或 LLM 消费。 适用于:GPT-4、Claude、Gemini、Llama 等任何 LLM,以及 LangChain、LlamaIndex 等 RAG 框架。适合构建文档密集型 AI 应用的团队。 --- ## 核心功能 ### 布局感知解析 使用深度学习模型(LayoutLMv3)检测标题、段落、表格、图片和公式。 ### 表格提取 将表格转换为 Markdown、HTML 或 LaTeX,支持合并单元格。 ### 公式识别 使用 UniMERNet 将数学公式转换为 LaTeX 表示。 ### OCR 集成 PaddleOCR 支持 80+ 语言,Tesseract 作为后备方案。 ### 批量处理 支持并发处理数百个 PDF 文件。 --- ## 来源与感谢 > Created by [OpenDataLab](https://github.com/opendatalab). Licensed under AGPL-3.0. > > [MinerU](https://github.com/opendatalab/MinerU) — ⭐ 57,900+ --- Source: https://tokrepo.com/en/workflows/985fe0df-6ec5-4fd6-8d3d-3c1627b0e18d Author: Script Depot