Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsApr 2, 2026·3 min de lectura

MinerU — Extract LLM-Ready Data from Any Document

Convert PDFs, scans, and complex documents into clean Markdown or JSON for RAG and LLM pipelines. 57K+ GitHub stars.

Introducción

MinerU is an open-source document extraction tool by OpenDataLab with 57,900+ GitHub stars, purpose-built for converting complex documents into LLM-ready formats. It handles PDFs, scanned documents, and multi-column layouts with high fidelity, preserving tables, formulas, images, and reading order. Output is clean Markdown or structured JSON ready for RAG pipelines, fine-tuning datasets, or direct LLM consumption. It powers document understanding for thousands of AI applications in production.

Works with: Any LLM (GPT-4, Claude, Gemini, Llama), LangChain, LlamaIndex, Haystack RAG pipelines. Best for teams building document-heavy AI applications. Setup time: under 5 minutes.


MinerU Architecture & Features

Processing Pipeline

Input Document (PDF/Scan/Image)
    │
    ├─ Layout Detection (LayoutLMv3)
    │   └─ Identify: text, tables, figures, formulas, headers
    │
    ├─ OCR Engine (PaddleOCR / Tesseract)
    │   └─ Extract text from scanned pages
    │
    ├─ Table Recognition
    │   └─ Convert tables to Markdown/HTML/LaTeX
    │
    ├─ Formula Recognition (UniMERNet)
    │   └─ Convert math formulas to LaTeX
    │
    └─ Output Assembly
        ├─ Markdown (with images)
        ├─ JSON (structured blocks)
        └─ Content list (flat text)

Key Capabilities

Feature Description
Layout-aware parsing Detects headers, paragraphs, tables, figures, formulas using deep learning models
Multi-column support Correctly handles 2-column academic papers and complex layouts
Table extraction Converts tables to Markdown, HTML, or LaTeX with cell merging support
Formula recognition Converts mathematical formulas to LaTeX notation
OCR integration PaddleOCR for 80+ languages, fallback to Tesseract
Image extraction Saves embedded images with automatic naming and referencing
Reading order Preserves logical reading order across complex layouts
Batch processing Process hundreds of PDFs concurrently

Output Formats

Markdown — Clean, LLM-friendly format:

# Chapter 1: Introduction

The experiment results shown in **Table 1** demonstrate...

| Metric | Model A | Model B |
|--------|---------|---------|
| F1     | 0.92    | 0.87    |

The loss function is defined as:

$$L = -\sum_{i} y_i \log(p_i)$$

JSON — Structured blocks with metadata:

{
  "blocks": [
    {"type": "title", "text": "Chapter 1: Introduction", "level": 1},
    {"type": "text", "text": "The experiment results..."},
    {"type": "table", "cells": [["Metric", "Model A"], ["F1", "0.92"]]},
    {"type": "equation", "latex": "L = -\\sum_{i} y_i \\log(p_i)"}
  ]
}

CLI Commands

# Auto-detect PDF type (text vs scanned)
magic-pdf -p paper.pdf -o output/ -m auto

# Force OCR mode for scanned documents
magic-pdf -p scan.pdf -o output/ -m ocr

# Text-only mode (faster, no OCR)
magic-pdf -p textbook.pdf -o output/ -m txt

# Batch process a directory
magic-pdf -p papers/ -o output/ -m auto

Performance Benchmarks

Tested on academic papers, financial reports, and legal documents:

  • Text extraction accuracy: 95%+ on clean PDFs
  • Table recognition: 90%+ F1 on complex tables
  • Processing speed: ~2-5 pages/second on GPU, ~0.5-1 page/second on CPU
  • Language support: 80+ languages via PaddleOCR

FAQ

Q: What is MinerU? A: MinerU is an open-source document extraction tool with 57,900+ GitHub stars that converts PDFs and scanned documents into clean Markdown or JSON for LLM and RAG applications, with high-fidelity layout detection, table extraction, and formula recognition.

Q: How is MinerU different from Docling or Marker? A: MinerU focuses on layout-aware extraction with deep learning models (LayoutLMv3) and excels at complex multi-column academic papers. Docling (IBM) has broader format support. Marker is faster but less accurate on complex layouts. MinerU has the highest star count (57K+) and strongest community.

Q: Is MinerU free? A: Yes, open-source under AGPL-3.0. Free for personal and academic use. Commercial use requires compliance with AGPL terms or a commercial license.


🙏

Fuente y agradecimientos

Created by OpenDataLab. Licensed under AGPL-3.0.

MinerU — ⭐ 57,900+

Thanks to the OpenDataLab team for making high-quality document extraction accessible to the AI community.

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados