Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsApr 8, 2026·2 min de lectura

Zerox — Zero-Shot PDF OCR for AI Pipelines

Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.

What is Zerox?

Zerox is a zero-shot PDF OCR tool that uses vision language models instead of traditional OCR engines. It converts each PDF page into an image, sends it to a vision model (GPT-4o, Claude, Gemini), and extracts clean markdown text. No training, no templates, no configuration — it just works on any document layout.

Answer-Ready: Zerox is zero-shot PDF OCR using vision models. Converts PDF pages to images, extracts clean markdown via GPT-4o or Claude. No training or templates needed. Handles complex layouts, tables, and handwriting. 7k+ GitHub stars.

Best for: AI teams processing PDFs for RAG or data extraction. Works with: OpenAI GPT-4o, Anthropic Claude, Google Gemini. Setup time: Under 2 minutes.

Core Features

1. Multiple Model Support

# OpenAI
result = await zerox(file_path="doc.pdf", model="gpt-4o-mini")

# Anthropic Claude
result = await zerox(file_path="doc.pdf", model="claude-sonnet-4-20250514")

# Google Gemini
result = await zerox(file_path="doc.pdf", model="gemini/gemini-2.0-flash")

2. Page Selection

result = await zerox(
    file_path="long_report.pdf",
    model="gpt-4o-mini",
    select_pages=[1, 3, 5, 10],  # Only process specific pages
)

3. Node.js SDK

npm install zerox
const { zerox } = require("zerox");
const result = await zerox({
  filePath: "report.pdf",
  openaiAPIKey: process.env.OPENAI_API_KEY,
});

4. Custom Prompts

result = await zerox(
    file_path="invoice.pdf",
    model="gpt-4o-mini",
    custom_system_prompt="Extract all line items as a markdown table with columns: Item, Qty, Price, Total.",
)

Zerox vs Traditional OCR

Feature Zerox Tesseract AWS Textract
Setup pip install System deps AWS account
Complex layouts Excellent Poor Good
Tables Markdown tables Raw text JSON
Handwriting Yes Limited Yes
Cost Per API call Free Per page
Training needed None Sometimes No

FAQ

Q: How much does it cost? A: Depends on the vision model. GPT-4o-mini is ~$0.01/page, Claude is similar. Self-hosted models are free.

Q: Can it handle scanned documents? A: Yes, that is its primary use case. Vision models can read scanned text, handwriting, and complex layouts.

Q: How does accuracy compare to Tesseract? A: Significantly better on complex layouts, tables, and mixed content. Tesseract may be better for simple, clean text.

🙏

Fuente y agradecimientos

Created by getomni-ai. Licensed under MIT.

getomni-ai/zerox — 7k+ stars

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados