What is MinerU — Extract LLM-Ready Data from Any Document?

Convert PDFs, scans, and complex documents into clean Markdown or JSON for RAG and LLM pipelines. 57K+ GitHub stars.

Is MinerU — Extract LLM-Ready Data from Any Document free to use?

Yes. MinerU — Extract LLM-Ready Data from Any Document is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install MinerU — Extract LLM-Ready Data from Any Document?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

MinerU — Extract LLM-Ready Data from Any Document

MinerU Architecture & Features

Processing Pipeline

Input Document (PDF/Scan/Image)
    │
    ├─ Layout Detection (LayoutLMv3)
    │   └─ Identify: text, tables, figures, formulas, headers
    │
    ├─ OCR Engine (PaddleOCR / Tesseract)
    │   └─ Extract text from scanned pages
    │
    ├─ Table Recognition
    │   └─ Convert tables to Markdown/HTML/LaTeX
    │
    ├─ Formula Recognition (UniMERNet)
    │   └─ Convert math formulas to LaTeX
    │
    └─ Output Assembly
        ├─ Markdown (with images)
        ├─ JSON (structured blocks)
        └─ Content list (flat text)

Key Capabilities

Feature	Description
Layout-aware parsing	Detects headers, paragraphs, tables, figures, formulas using deep learning models
Multi-column support	Correctly handles 2-column academic papers and complex layouts
Table extraction	Converts tables to Markdown, HTML, or LaTeX with cell merging support
Formula recognition	Converts mathematical formulas to LaTeX notation
OCR integration	PaddleOCR for 80+ languages, fallback to Tesseract
Image extraction	Saves embedded images with automatic naming and referencing
Reading order	Preserves logical reading order across complex layouts
Batch processing	Process hundreds of PDFs concurrently

Output Formats

Markdown — Clean, LLM-friendly format:

# Chapter 1: Introduction

The experiment results shown in **Table 1** demonstrate...

| Metric | Model A | Model B |
|--------|---------|---------|
| F1     | 0.92    | 0.87    |

The loss function is defined as:

$$L = -\sum_{i} y_i \log(p_i)$$

JSON — Structured blocks with metadata:

{
  "blocks": [
    {"type": "title", "text": "Chapter 1: Introduction", "level": 1},
    {"type": "text", "text": "The experiment results..."},
    {"type": "table", "cells": [["Metric", "Model A"], ["F1", "0.92"]]},
    {"type": "equation", "latex": "L = -\\sum_{i} y_i \\log(p_i)"}
  ]
}

CLI Commands

# Auto-detect PDF type (text vs scanned)
magic-pdf -p paper.pdf -o output/ -m auto

# Force OCR mode for scanned documents
magic-pdf -p scan.pdf -o output/ -m ocr

# Text-only mode (faster, no OCR)
magic-pdf -p textbook.pdf -o output/ -m txt

# Batch process a directory
magic-pdf -p papers/ -o output/ -m auto

Performance Benchmarks

Tested on academic papers, financial reports, and legal documents:

Text extraction accuracy: 95%+ on clean PDFs
Table recognition: 90%+ F1 on complex tables
Processing speed: ~2-5 pages/second on GPU, ~0.5-1 page/second on CPU
Language support: 80+ languages via PaddleOCR

FAQ

Q: What is MinerU? A: MinerU is an open-source document extraction tool with 57,900+ GitHub stars that converts PDFs and scanned documents into clean Markdown or JSON for LLM and RAG applications, with high-fidelity layout detection, table extraction, and formula recognition.

Q: How is MinerU different from Docling or Marker? A: MinerU focuses on layout-aware extraction with deep learning models (LayoutLMv3) and excels at complex multi-column academic papers. Docling (IBM) has broader format support. Marker is faster but less accurate on complex layouts. MinerU has the highest star count (57K+) and strongest community.

Q: Is MinerU free? A: Yes, open-source under AGPL-3.0. Free for personal and academic use. Commercial use requires compliance with AGPL terms or a commercial license.

MinerU — Extract LLM-Ready Data from Any Document

MinerU Architecture & Features

Processing Pipeline

Key Capabilities

Output Formats

CLI Commands

Performance Benchmarks

FAQ

Fuente y agradecimientos

Discusión

Activos relacionados

Unkey — Open-Source API Key Management Platform

Flagsmith — Open-Source Feature Flags and Remote Config

OpenStatus — Open-Source Monitoring and Status Page Platform