Marker — Convert PDF to Markdown with High Accuracy
Fast, accurate PDF to Markdown + JSON converter. Handles tables, images, equations, code blocks, and multi-column layouts. GPU-accelerated. 33K+ GitHub stars.
Ready-to-run agent install
This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.
npx -y tokrepo@latest install 42976daf-a56a-4152-9afb-d5b00d130a08 --target codexRun after dry-run confirms the install plan.
What it is
Marker is a Python tool that converts PDF documents to Markdown and JSON with high accuracy. It handles complex layouts including tables, images, equations, code blocks, and multi-column text. The tool uses deep learning models for layout detection and OCR, producing clean Markdown that preserves the document's logical structure.
Marker is designed for developers, researchers, and data engineers who need to extract structured text from PDFs for RAG pipelines, document processing, or content migration.
How it saves time or tokens
Traditional PDF extraction tools produce messy output that requires extensive post-processing. Marker's ML-based approach understands document layout, correctly identifies tables as tables, equations as LaTeX, and code blocks as fenced code. This means the extracted Markdown is usable immediately, saving hours of manual cleanup. For RAG pipelines, higher-quality extraction leads to better retrieval and fewer hallucinations in AI responses.
How to use
- Install Marker:
pip install marker-pdf
- Convert a single PDF:
marker_single input.pdf output/ --output_format markdown
- Or use the Python API:
from marker.converters.pdf import PdfConverter
converter = PdfConverter()
result = converter('input.pdf')
print(result.markdown)
Example
Batch processing multiple PDFs:
from marker.converters.pdf import PdfConverter
import os
converter = PdfConverter()
pdf_dir = './papers/'
for filename in os.listdir(pdf_dir):
if filename.endswith('.pdf'):
result = converter(os.path.join(pdf_dir, filename))
md_path = f'./output/{filename.replace(".pdf", ".md")}'
with open(md_path, 'w') as f:
f.write(result.markdown)
print(f'Converted {filename}: {len(result.markdown)} chars')
Marker preserves headings, lists, tables, and code blocks. A research paper with complex multi-column layout converts to a single-column Markdown document with correct heading hierarchy.
Related on TokRepo
- Documentation tools — Browse document processing tools
- RAG tools — Explore retrieval-augmented generation tools
Common pitfalls
- Not installing GPU dependencies for large batch jobs. Marker runs on CPU by default but is significantly faster with CUDA GPU acceleration. Install the GPU extras for production workloads.
- Expecting perfect extraction from scanned documents. Marker uses OCR for scanned pages, but OCR quality depends on scan resolution. High-resolution scans (300+ DPI) produce much better results.
- Not checking the output format option. Marker supports both Markdown and JSON output. JSON output includes bounding boxes and metadata that are useful for downstream processing but not needed for simple text extraction.
- Starting with an overly complex configuration instead of defaults. Begin with the minimal setup, verify it works, then customize incrementally. This approach catches configuration errors early and keeps troubleshooting straightforward.
Frequently Asked Questions
Marker uses deep learning models for layout detection, which produces significantly better results than rule-based extractors for complex documents with tables, multi-column layouts, and mixed content types. Accuracy depends on the PDF quality and complexity.
Yes. Marker includes OCR capabilities for scanned pages. It automatically detects whether a page is text-based or image-based and applies OCR when needed. Higher-resolution scans produce better OCR results.
Yes. Marker can extract embedded images and save them as separate files alongside the Markdown output. Image references in the Markdown point to the extracted image files.
Yes. Marker detects mathematical equations and converts them to LaTeX notation in the Markdown output. This works best with typeset equations; handwritten equations may not convert accurately.
Yes. Marker's clean Markdown output is well-suited for chunking and embedding in RAG pipelines. The preserved heading structure helps create semantically meaningful chunks. Many teams use Marker as the first step in their document ingestion pipeline.
Citations (3)
- Marker GitHub— Marker converts PDF to Markdown with ML
- Marker Documentation— Layout detection and OCR for PDF extraction
- Surya OCR (used by Marker)— Deep learning for document understanding
Related on TokRepo
Source & Thanks
Created by Datalab. Licensed under GPL-3.0. datalab-to/marker — 33,000+ GitHub stars
Discussion
Related Assets
MarkItDown — Convert Any File to Markdown for LLMs
Python library by Microsoft that converts PDF, DOCX, PPTX, XLSX, images, audio, and HTML to clean Markdown. Perfect for feeding documents into LLM context windows. 8,000+ stars.
Jina Reader — Convert Any URL to LLM-Ready Text
Convert any URL to clean, LLM-friendly markdown with a simple prefix. Just prepend r.jina.ai/ to any URL. Handles JS-rendered pages, PDFs, and images. 10K+ stars.
Stirling PDF — Self-Hosted PDF Editor & Toolkit
Stirling PDF is the #1 open-source PDF tool on GitHub. Merge, split, convert, compress, OCR, sign, and edit PDFs — all self-hosted with no data leaving your server.
Zerox — Zero-Shot PDF OCR for AI Pipelines
Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.