Zerox — Zero-Shot PDF OCR for AI Pipelines
Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.
What it is
Zerox is a Python library that extracts text from PDFs by converting each page to an image and then using vision-capable LLMs (GPT-4o, Claude, etc.) as the OCR engine. Unlike traditional OCR tools that require trained models for specific fonts and layouts, Zerox leverages the visual understanding of large language models to read any document format without training.
Data engineers processing scanned documents, researchers extracting text from academic papers, and developers building document processing pipelines use Zerox when traditional OCR produces poor results on complex layouts, tables, or handwritten content.
How it saves time or tokens
Traditional OCR pipelines require installing Tesseract, training custom models for specific document types, and writing post-processing logic to clean up OCR errors. Zerox replaces the entire pipeline with a single function call. Vision models handle complex layouts, tables, and multi-column documents that trip up conventional OCR. The output is clean markdown rather than raw text, reducing downstream parsing work.
How to use
- Install Zerox:
pip install py-zerox
- Extract text from a PDF:
from pyzerox import zerox
import asyncio
async def main():
result = await zerox(
file_path='report.pdf',
model='gpt-4o-mini',
)
for page in result.pages:
print(page.content)
asyncio.run(main())
- The output is clean markdown for each page, ready for further processing or LLM consumption.
Example
from pyzerox import zerox
import asyncio
async def extract_with_claude():
result = await zerox(
file_path='financial_report.pdf',
model='claude-3-5-sonnet-20241022',
custom_system_prompt='Extract all text preserving table structure as markdown tables.',
)
# Each page returns clean markdown
for i, page in enumerate(result.pages):
print(f'--- Page {i+1} ---')
print(page.content)
# Save all pages to a single file
with open('extracted.md', 'w') as f:
for page in result.pages:
f.write(page.content + '\n\n')
asyncio.run(extract_with_claude())
Related on TokRepo
- Document Processing Tools -- explore tools for PDF and document handling
- AI Tools for Research -- discover tools for academic and data research workflows
Common pitfalls
- Vision model API calls cost more than traditional OCR. For large documents (100+ pages), estimate API costs before processing. GPT-4o-mini is cheaper but less accurate than GPT-4o on complex layouts.
- Zerox converts each page to an image before sending to the model. High-resolution settings produce better results but increase API costs and processing time.
- The library is async by default. Wrap calls in asyncio.run() for synchronous usage, or integrate into an existing async application.
Frequently Asked Questions
Zerox supports any vision-capable LLM including GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet, and other models that accept image inputs. You specify the model name in the function call, and Zerox handles the image conversion and API interaction.
Tesseract is a traditional OCR engine that works locally without API costs but struggles with complex layouts, tables, and handwritten text. Zerox uses vision LLMs that handle these cases much better but requires API calls with associated costs. Zerox produces markdown output while Tesseract outputs raw text.
Yes. Zerox accepts a custom_system_prompt parameter that lets you instruct the vision model on how to handle the extraction. For example, you can ask it to preserve table structures as markdown tables or extract only specific sections of each page.
Cost depends on the model and page count. Each page is sent as an image to the vision model API. GPT-4o-mini costs roughly $0.01-0.02 per page, while GPT-4o costs more. For a 50-page document, expect $0.50-1.00 with GPT-4o-mini.
Yes. Because Zerox uses vision models that understand images, it handles scanned documents, photographs of text, and handwritten content. The accuracy depends on the vision model capabilities and image quality. Results are generally better than traditional OCR for these difficult cases.
Citations (3)
- Zerox GitHub— Vision model-based PDF OCR without training data
- OpenAI GPT-4o— GPT-4o vision capabilities for document understanding
- Anthropic Claude Vision— Claude vision model for image understanding
Related on TokRepo
Source & Thanks
Created by getomni-ai. Licensed under MIT.
getomni-ai/zerox — 7k+ stars
Discussion
Related Assets
NAPI-RS — Build Node.js Native Addons in Rust
Write high-performance Node.js native modules in Rust with automatic TypeScript type generation and cross-platform prebuilt binaries.
Mamba — Fast Cross-Platform Package Manager
A drop-in conda replacement written in C++ that resolves environments in seconds instead of minutes.
Plasmo — The Browser Extension Framework
Build, test, and publish browser extensions for Chrome, Firefox, and Edge using React or Vue with hot-reload and automatic manifest generation.