ScriptsMar 30, 2026·2 min read

Marker — Convert PDF to Markdown with High Accuracy

Fast, accurate PDF to Markdown + JSON converter. Handles tables, images, equations, code blocks, and multi-column layouts. GPU-accelerated. 33K+ GitHub stars.

TL;DR
Marker converts PDFs to clean Markdown with accurate table, equation, and code block extraction, optionally GPU-accelerated.
§01

What it is

Marker is a Python tool that converts PDF documents to Markdown and JSON with high accuracy. It handles complex layouts including tables, images, equations, code blocks, and multi-column text. The tool uses deep learning models for layout detection and OCR, producing clean Markdown that preserves the document's logical structure.

Marker is designed for developers, researchers, and data engineers who need to extract structured text from PDFs for RAG pipelines, document processing, or content migration.

§02

How it saves time or tokens

Traditional PDF extraction tools produce messy output that requires extensive post-processing. Marker's ML-based approach understands document layout, correctly identifies tables as tables, equations as LaTeX, and code blocks as fenced code. This means the extracted Markdown is usable immediately, saving hours of manual cleanup. For RAG pipelines, higher-quality extraction leads to better retrieval and fewer hallucinations in AI responses.

§03

How to use

  1. Install Marker:
pip install marker-pdf
  1. Convert a single PDF:
marker_single input.pdf output/ --output_format markdown
  1. Or use the Python API:
from marker.converters.pdf import PdfConverter

converter = PdfConverter()
result = converter('input.pdf')
print(result.markdown)
§04

Example

Batch processing multiple PDFs:

from marker.converters.pdf import PdfConverter
import os

converter = PdfConverter()

pdf_dir = './papers/'
for filename in os.listdir(pdf_dir):
    if filename.endswith('.pdf'):
        result = converter(os.path.join(pdf_dir, filename))
        md_path = f'./output/{filename.replace(".pdf", ".md")}'
        with open(md_path, 'w') as f:
            f.write(result.markdown)
        print(f'Converted {filename}: {len(result.markdown)} chars')

Marker preserves headings, lists, tables, and code blocks. A research paper with complex multi-column layout converts to a single-column Markdown document with correct heading hierarchy.

§05

Related on TokRepo

§06

Common pitfalls

  • Not installing GPU dependencies for large batch jobs. Marker runs on CPU by default but is significantly faster with CUDA GPU acceleration. Install the GPU extras for production workloads.
  • Expecting perfect extraction from scanned documents. Marker uses OCR for scanned pages, but OCR quality depends on scan resolution. High-resolution scans (300+ DPI) produce much better results.
  • Not checking the output format option. Marker supports both Markdown and JSON output. JSON output includes bounding boxes and metadata that are useful for downstream processing but not needed for simple text extraction.
  • Starting with an overly complex configuration instead of defaults. Begin with the minimal setup, verify it works, then customize incrementally. This approach catches configuration errors early and keeps troubleshooting straightforward.

Frequently Asked Questions

How accurate is Marker compared to other PDF extractors?+

Marker uses deep learning models for layout detection, which produces significantly better results than rule-based extractors for complex documents with tables, multi-column layouts, and mixed content types. Accuracy depends on the PDF quality and complexity.

Does Marker support OCR for scanned PDFs?+

Yes. Marker includes OCR capabilities for scanned pages. It automatically detects whether a page is text-based or image-based and applies OCR when needed. Higher-resolution scans produce better OCR results.

Can Marker extract images from PDFs?+

Yes. Marker can extract embedded images and save them as separate files alongside the Markdown output. Image references in the Markdown point to the extracted image files.

Does Marker handle LaTeX equations?+

Yes. Marker detects mathematical equations and converts them to LaTeX notation in the Markdown output. This works best with typeset equations; handwritten equations may not convert accurately.

Can I use Marker in a RAG pipeline?+

Yes. Marker's clean Markdown output is well-suited for chunking and embedding in RAG pipelines. The preserved heading structure helps create semantically meaningful chunks. Many teams use Marker as the first step in their document ingestion pipeline.

Citations (3)
🙏

Source & Thanks

Created by Datalab. Licensed under GPL-3.0. datalab-to/marker — 33,000+ GitHub stars

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets