ScriptsApr 8, 2026·2 min read

Zerox — Zero-Shot PDF OCR for AI Pipelines

Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.

TL;DR
Zerox converts PDF pages to images and uses vision LLMs to extract clean markdown text without any OCR training data.
§01

What it is

Zerox is a Python library that extracts text from PDFs by converting each page to an image and then using vision-capable LLMs (GPT-4o, Claude, etc.) as the OCR engine. Unlike traditional OCR tools that require trained models for specific fonts and layouts, Zerox leverages the visual understanding of large language models to read any document format without training.

Data engineers processing scanned documents, researchers extracting text from academic papers, and developers building document processing pipelines use Zerox when traditional OCR produces poor results on complex layouts, tables, or handwritten content.

§02

How it saves time or tokens

Traditional OCR pipelines require installing Tesseract, training custom models for specific document types, and writing post-processing logic to clean up OCR errors. Zerox replaces the entire pipeline with a single function call. Vision models handle complex layouts, tables, and multi-column documents that trip up conventional OCR. The output is clean markdown rather than raw text, reducing downstream parsing work.

§03

How to use

  1. Install Zerox:
pip install py-zerox
  1. Extract text from a PDF:
from pyzerox import zerox
import asyncio

async def main():
    result = await zerox(
        file_path='report.pdf',
        model='gpt-4o-mini',
    )
    for page in result.pages:
        print(page.content)

asyncio.run(main())
  1. The output is clean markdown for each page, ready for further processing or LLM consumption.
§04

Example

from pyzerox import zerox
import asyncio

async def extract_with_claude():
    result = await zerox(
        file_path='financial_report.pdf',
        model='claude-3-5-sonnet-20241022',
        custom_system_prompt='Extract all text preserving table structure as markdown tables.',
    )
    # Each page returns clean markdown
    for i, page in enumerate(result.pages):
        print(f'--- Page {i+1} ---')
        print(page.content)
    
    # Save all pages to a single file
    with open('extracted.md', 'w') as f:
        for page in result.pages:
            f.write(page.content + '\n\n')

asyncio.run(extract_with_claude())
§05

Related on TokRepo

§06

Common pitfalls

  • Vision model API calls cost more than traditional OCR. For large documents (100+ pages), estimate API costs before processing. GPT-4o-mini is cheaper but less accurate than GPT-4o on complex layouts.
  • Zerox converts each page to an image before sending to the model. High-resolution settings produce better results but increase API costs and processing time.
  • The library is async by default. Wrap calls in asyncio.run() for synchronous usage, or integrate into an existing async application.

Frequently Asked Questions

What models does Zerox support for OCR?+

Zerox supports any vision-capable LLM including GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet, and other models that accept image inputs. You specify the model name in the function call, and Zerox handles the image conversion and API interaction.

How does Zerox compare to Tesseract OCR?+

Tesseract is a traditional OCR engine that works locally without API costs but struggles with complex layouts, tables, and handwritten text. Zerox uses vision LLMs that handle these cases much better but requires API calls with associated costs. Zerox produces markdown output while Tesseract outputs raw text.

Can I customize the extraction prompt?+

Yes. Zerox accepts a custom_system_prompt parameter that lets you instruct the vision model on how to handle the extraction. For example, you can ask it to preserve table structures as markdown tables or extract only specific sections of each page.

How much does it cost to process a document with Zerox?+

Cost depends on the model and page count. Each page is sent as an image to the vision model API. GPT-4o-mini costs roughly $0.01-0.02 per page, while GPT-4o costs more. For a 50-page document, expect $0.50-1.00 with GPT-4o-mini.

Does Zerox work with scanned documents and handwriting?+

Yes. Because Zerox uses vision models that understand images, it handles scanned documents, photographs of text, and handwritten content. The accuracy depends on the vision model capabilities and image quality. Results are generally better than traditional OCR for these difficult cases.

Citations (3)
🙏

Source & Thanks

Created by getomni-ai. Licensed under MIT.

getomni-ai/zerox — 7k+ stars

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets