SkillsMar 30, 2026·2 min read

Marker — Convert PDF to Markdown with High Accuracy

Fast, accurate PDF to Markdown + JSON converter. Handles tables, images, equations, code blocks, and multi-column layouts. GPU-accelerated. 33K+ GitHub stars.

Script Depot · Community

Agent ready

Ready-to-run agent install

This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.

Native · 98/100Policy: allow

Agent surface

Any MCP/CLI agent

Kind

Skill

Install

Single

Trust

Trust: Established

Entrypoint

Marker — Convert PDF to Markdown with High Accuracy

Direct install command

npx -y tokrepo@latest install 42976daf-a56a-4152-9afb-d5b00d130a08 --target codex

Run after dry-run confirms the install plan.

TL;DR

Marker converts PDFs to clean Markdown with accurate table, equation, and code block extraction, optionally GPU-accelerated.

§01

What it is

Marker is a Python tool that converts PDF documents to Markdown and JSON with high accuracy. It handles complex layouts including tables, images, equations, code blocks, and multi-column text. The tool uses deep learning models for layout detection and OCR, producing clean Markdown that preserves the document's logical structure.

Marker is designed for developers, researchers, and data engineers who need to extract structured text from PDFs for RAG pipelines, document processing, or content migration.

§02

How it saves time or tokens

Traditional PDF extraction tools produce messy output that requires extensive post-processing. Marker's ML-based approach understands document layout, correctly identifies tables as tables, equations as LaTeX, and code blocks as fenced code. This means the extracted Markdown is usable immediately, saving hours of manual cleanup. For RAG pipelines, higher-quality extraction leads to better retrieval and fewer hallucinations in AI responses.

§03

How to use

Install Marker:

pip install marker-pdf

Convert a single PDF:

marker_single input.pdf output/ --output_format markdown

Or use the Python API:

from marker.converters.pdf import PdfConverter

converter = PdfConverter()
result = converter('input.pdf')
print(result.markdown)

§04

Example

Batch processing multiple PDFs:

from marker.converters.pdf import PdfConverter
import os

converter = PdfConverter()

pdf_dir = './papers/'
for filename in os.listdir(pdf_dir):
    if filename.endswith('.pdf'):
        result = converter(os.path.join(pdf_dir, filename))
        md_path = f'./output/{filename.replace(".pdf", ".md")}'
        with open(md_path, 'w') as f:
            f.write(result.markdown)
        print(f'Converted {filename}: {len(result.markdown)} chars')

Marker preserves headings, lists, tables, and code blocks. A research paper with complex multi-column layout converts to a single-column Markdown document with correct heading hierarchy.

§05

Related on TokRepo

Documentation tools — Browse document processing tools
RAG tools — Explore retrieval-augmented generation tools

§06

Common pitfalls

Not installing GPU dependencies for large batch jobs. Marker runs on CPU by default but is significantly faster with CUDA GPU acceleration. Install the GPU extras for production workloads.
Expecting perfect extraction from scanned documents. Marker uses OCR for scanned pages, but OCR quality depends on scan resolution. High-resolution scans (300+ DPI) produce much better results.
Not checking the output format option. Marker supports both Markdown and JSON output. JSON output includes bounding boxes and metadata that are useful for downstream processing but not needed for simple text extraction.
Starting with an overly complex configuration instead of defaults. Begin with the minimal setup, verify it works, then customize incrementally. This approach catches configuration errors early and keeps troubleshooting straightforward.

Frequently Asked Questions

How accurate is Marker compared to other PDF extractors?+

Marker uses deep learning models for layout detection, which produces significantly better results than rule-based extractors for complex documents with tables, multi-column layouts, and mixed content types. Accuracy depends on the PDF quality and complexity.

Does Marker support OCR for scanned PDFs?+

Yes. Marker includes OCR capabilities for scanned pages. It automatically detects whether a page is text-based or image-based and applies OCR when needed. Higher-resolution scans produce better OCR results.

Can Marker extract images from PDFs?+

Yes. Marker can extract embedded images and save them as separate files alongside the Markdown output. Image references in the Markdown point to the extracted image files.

Does Marker handle LaTeX equations?+

Yes. Marker detects mathematical equations and converts them to LaTeX notation in the Markdown output. This works best with typeset equations; handwritten equations may not convert accurately.

Can I use Marker in a RAG pipeline?+

Yes. Marker's clean Markdown output is well-suited for chunking and embedding in RAG pipelines. The preserved heading structure helps create semantically meaningful chunks. Many teams use Marker as the first step in their document ingestion pipeline.

Citations (3)

Marker GitHub— Marker converts PDF to Markdown with ML
Marker Documentation— Layout detection and OCR for PDF extraction
Surya OCR (used by Marker)— Deep learning for document understanding

Related on TokRepo

Document tools RAG tools Research tools

🙏

Source & Thanks

Created by Datalab. Licensed under GPL-3.0. datalab-to/marker — 33,000+ GitHub stars

Discussion

No comments yet. Be the first to share your thoughts.

Related Assets

MarkItDown — Convert Any File to Markdown for LLMs

Python library by Microsoft that converts PDF, DOCX, PPTX, XLSX, images, audio, and HTML to clean Markdown. Perfect for feeding documents into LLM context windows. 8,000+ stars.

Skills

Microsoft AI

Jina Reader — Convert Any URL to LLM-Ready Text

Convert any URL to clean, LLM-friendly markdown with a simple prefix. Just prepend r.jina.ai/ to any URL. Handles JS-rendered pages, PDFs, and images. 10K+ stars.

Skills

Script Depot

Stirling PDF — Self-Hosted PDF Editor & Toolkit

Stirling PDF is the #1 open-source PDF tool on GitHub. Merge, split, convert, compress, OCR, sign, and edit PDFs — all self-hosted with no data leaving your server.

Skills

Script Depot

Zerox — Zero-Shot PDF OCR for AI Pipelines

Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.

Skills

Script Depot