What is Zerox?
Zerox is a zero-shot PDF OCR tool that uses vision language models instead of traditional OCR engines. It converts each PDF page into an image, sends it to a vision model (GPT-4o, Claude, Gemini), and extracts clean markdown text. No training, no templates, no configuration — it just works on any document layout.
Answer-Ready: Zerox is zero-shot PDF OCR using vision models. Converts PDF pages to images, extracts clean markdown via GPT-4o or Claude. No training or templates needed. Handles complex layouts, tables, and handwriting. 7k+ GitHub stars.
Best for: AI teams processing PDFs for RAG or data extraction. Works with: OpenAI GPT-4o, Anthropic Claude, Google Gemini. Setup time: Under 2 minutes.
Core Features
1. Multiple Model Support
# OpenAI
result = await zerox(file_path="doc.pdf", model="gpt-4o-mini")
# Anthropic Claude
result = await zerox(file_path="doc.pdf", model="claude-sonnet-4-20250514")
# Google Gemini
result = await zerox(file_path="doc.pdf", model="gemini/gemini-2.0-flash")2. Page Selection
result = await zerox(
file_path="long_report.pdf",
model="gpt-4o-mini",
select_pages=[1, 3, 5, 10], # Only process specific pages
)3. Node.js SDK
npm install zeroxconst { zerox } = require("zerox");
const result = await zerox({
filePath: "report.pdf",
openaiAPIKey: process.env.OPENAI_API_KEY,
});4. Custom Prompts
result = await zerox(
file_path="invoice.pdf",
model="gpt-4o-mini",
custom_system_prompt="Extract all line items as a markdown table with columns: Item, Qty, Price, Total.",
)Zerox vs Traditional OCR
| Feature | Zerox | Tesseract | AWS Textract |
|---|---|---|---|
| Setup | pip install | System deps | AWS account |
| Complex layouts | Excellent | Poor | Good |
| Tables | Markdown tables | Raw text | JSON |
| Handwriting | Yes | Limited | Yes |
| Cost | Per API call | Free | Per page |
| Training needed | None | Sometimes | No |
FAQ
Q: How much does it cost? A: Depends on the vision model. GPT-4o-mini is ~$0.01/page, Claude is similar. Self-hosted models are free.
Q: Can it handle scanned documents? A: Yes, that is its primary use case. Vision models can read scanned text, handwriting, and complex layouts.
Q: How does accuracy compare to Tesseract? A: Significantly better on complex layouts, tables, and mixed content. Tesseract may be better for simple, clean text.