Scripts2026年3月31日·1 分钟阅读

Marker — Convert PDF to Markdown with High Accuracy

Fast, accurate PDF to Markdown + JSON converter. Handles tables, images, equations, code blocks, and multi-column layouts. GPU-accelerated. 33K+ GitHub stars.

TO
TokRepo精选 · Community
快速使用

先拿来用,再决定要不要深挖

这里应该同时让用户和 Agent 知道第一步该复制什么、安装什么、落到哪里。

pip install marker-pdf

# Convert a single PDF
marker_single input.pdf output/ --output_format markdown

Or use in Python:

from marker.converters.pdf import PdfConverter
converter = PdfConverter()
result = converter("report.pdf")
print(result.markdown)

介绍

Marker converts PDF files to Markdown and JSON with high accuracy and speed. It correctly handles complex layouts including tables, images, equations, code blocks, multi-column text, headers/footers, and footnotes. GPU-accelerated for fast batch processing. Built on the Surya OCR engine for multi-language support. 33,000+ GitHub stars.

Best for: RAG pipelines, document ingestion, PDF data extraction, knowledge base building Works with: Any LLM pipeline — LangChain, LlamaIndex, Haystack, custom RAG systems


Key Features

Accurate Conversion

  • Tables — Preserved as Markdown tables with alignment
  • Images — Extracted and saved as separate files
  • Equations — Converted to LaTeX notation
  • Code blocks — Detected and formatted with syntax highlighting
  • Multi-column — Correctly reads multi-column layouts in order
  • Headers/footers — Automatically removed

Performance

  • GPU-accelerated — 10x faster with CUDA
  • Batch processing — Convert entire directories
  • Multi-language — 90+ languages via Surya OCR engine

Output Formats

  • Markdown (clean, LLM-ready)
  • JSON (structured with metadata)
  • HTML

Comparison

Feature Marker PyPDF pdfplumber
Tables
Images
Equations
Multi-column
OCR (scanned)
Speed (GPU) Fast Fast Medium

FAQ

Q: What is Marker? A: A fast, accurate PDF to Markdown converter that handles tables, images, equations, code blocks, and multi-column layouts. GPU-accelerated with 90+ language support. 33K+ GitHub stars.

Q: Can Marker handle scanned PDFs? A: Yes, it includes OCR via the Surya engine, supporting 90+ languages for both native and scanned PDFs.


🙏

来源与感谢

Created by Datalab. Licensed under GPL-3.0. datalab-to/marker — 33,000+ GitHub stars

相关资产