Scripts2026年3月30日·1 分钟阅读

Marker — Convert PDF to Markdown with High Accuracy

Fast, accurate PDF to Markdown + JSON converter. Handles tables, images, equations, code blocks, and multi-column layouts. GPU-accelerated. 33K+ GitHub stars.

介绍

Marker converts PDF files to Markdown and JSON with high accuracy and speed. It correctly handles complex layouts including tables, images, equations, code blocks, multi-column text, headers/footers, and footnotes. GPU-accelerated for fast batch processing. Built on the Surya OCR engine for multi-language support. 33,000+ GitHub stars.

Best for: RAG pipelines, document ingestion, PDF data extraction, knowledge base building Works with: Any LLM pipeline — LangChain, LlamaIndex, Haystack, custom RAG systems


Key Features

Accurate Conversion

  • Tables — Preserved as Markdown tables with alignment
  • Images — Extracted and saved as separate files
  • Equations — Converted to LaTeX notation
  • Code blocks — Detected and formatted with syntax highlighting
  • Multi-column — Correctly reads multi-column layouts in order
  • Headers/footers — Automatically removed

Performance

  • GPU-accelerated — 10x faster with CUDA
  • Batch processing — Convert entire directories
  • Multi-language — 90+ languages via Surya OCR engine

Output Formats

  • Markdown (clean, LLM-ready)
  • JSON (structured with metadata)
  • HTML

Comparison

Feature Marker PyPDF pdfplumber
Tables
Images
Equations
Multi-column
OCR (scanned)
Speed (GPU) Fast Fast Medium

FAQ

Q: What is Marker? A: A fast, accurate PDF to Markdown converter that handles tables, images, equations, code blocks, and multi-column layouts. GPU-accelerated with 90+ language support. 33K+ GitHub stars.

Q: Can Marker handle scanned PDFs? A: Yes, it includes OCR via the Surya engine, supporting 90+ languages for both native and scanned PDFs.


🙏

来源与感谢

Created by Datalab. Licensed under GPL-3.0. datalab-to/marker — 33,000+ GitHub stars

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产