Zerox — Zero-Shot PDF OCR for AI Pipelines
Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.
这个资产会安全暂存
这个资产会先安全暂存。复制的指令会要求 Agent 读取暂存文件,并在激活脚本、MCP 配置或全局配置前先确认。
npx -y tokrepo@latest install 3ac555d9-d75c-4208-ba46-974e4a717234 --target codex先暂存文件;激活前需要读取暂存 README 和安装计划。
What it is
Zerox is a Python library that extracts text from PDFs by converting each page to an image and then using vision-capable LLMs (GPT-4o, Claude, etc.) as the OCR engine. Unlike traditional OCR tools that require trained models for specific fonts and layouts, Zerox leverages the visual understanding of large language models to read any document format without training.
Data engineers processing scanned documents, researchers extracting text from academic papers, and developers building document processing pipelines use Zerox when traditional OCR produces poor results on complex layouts, tables, or handwritten content.
How it saves time or tokens
Traditional OCR pipelines require installing Tesseract, training custom models for specific document types, and writing post-processing logic to clean up OCR errors. Zerox replaces the entire pipeline with a single function call. Vision models handle complex layouts, tables, and multi-column documents that trip up conventional OCR. The output is clean markdown rather than raw text, reducing downstream parsing work.
How to use
- Install Zerox:
pip install py-zerox
- Extract text from a PDF:
from pyzerox import zerox
import asyncio
async def main():
result = await zerox(
file_path='report.pdf',
model='gpt-4o-mini',
)
for page in result.pages:
print(page.content)
asyncio.run(main())
- The output is clean markdown for each page, ready for further processing or LLM consumption.
Example
from pyzerox import zerox
import asyncio
async def extract_with_claude():
result = await zerox(
file_path='financial_report.pdf',
model='claude-3-5-sonnet-20241022',
custom_system_prompt='Extract all text preserving table structure as markdown tables.',
)
# Each page returns clean markdown
for i, page in enumerate(result.pages):
print(f'--- Page {i+1} ---')
print(page.content)
# Save all pages to a single file
with open('extracted.md', 'w') as f:
for page in result.pages:
f.write(page.content + '\n\n')
asyncio.run(extract_with_claude())
Related on TokRepo
- Document Processing Tools -- explore tools for PDF and document handling
- AI Tools for Research -- discover tools for academic and data research workflows
Common pitfalls
- Vision model API calls cost more than traditional OCR. For large documents (100+ pages), estimate API costs before processing. GPT-4o-mini is cheaper but less accurate than GPT-4o on complex layouts.
- Zerox converts each page to an image before sending to the model. High-resolution settings produce better results but increase API costs and processing time.
- The library is async by default. Wrap calls in asyncio.run() for synchronous usage, or integrate into an existing async application.
常见问题
Zerox supports any vision-capable LLM including GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet, and other models that accept image inputs. You specify the model name in the function call, and Zerox handles the image conversion and API interaction.
Tesseract is a traditional OCR engine that works locally without API costs but struggles with complex layouts, tables, and handwritten text. Zerox uses vision LLMs that handle these cases much better but requires API calls with associated costs. Zerox produces markdown output while Tesseract outputs raw text.
Yes. Zerox accepts a custom_system_prompt parameter that lets you instruct the vision model on how to handle the extraction. For example, you can ask it to preserve table structures as markdown tables or extract only specific sections of each page.
Cost depends on the model and page count. Each page is sent as an image to the vision model API. GPT-4o-mini costs roughly $0.01-0.02 per page, while GPT-4o costs more. For a 50-page document, expect $0.50-1.00 with GPT-4o-mini.
Yes. Because Zerox uses vision models that understand images, it handles scanned documents, photographs of text, and handwritten content. The accuracy depends on the vision model capabilities and image quality. Results are generally better than traditional OCR for these difficult cases.
引用来源 (3)
- Zerox GitHub— Vision model-based PDF OCR without training data
- OpenAI GPT-4o— GPT-4o vision capabilities for document understanding
- Anthropic Claude Vision— Claude vision model for image understanding
来源与感谢
getomni-ai/zerox — 7k+ stars, MIT
讨论
相关资产
Segment Anything (SAM) — Zero-Shot Image Segmentation by Meta
A foundation model for promptable image segmentation that can segment any object in any image without additional training. SAM powers interactive annotation, downstream vision tasks, and zero-shot transfer.
Index TTS — Industrial Zero-Shot Text-to-Speech System
A controllable and efficient zero-shot text-to-speech system built for industrial use, supporting voice cloning and cross-lingual synthesis with high-quality output.
tsup — Bundle Your TypeScript Library with Zero Config
tsup is the zero-config TypeScript bundler built on esbuild. It emits ESM, CJS, IIFE, and DTS files from a single command — the fastest way to ship a polished TypeScript package without writing Rollup config.
vanilla-extract — Zero-Runtime Type-Safe CSS in TypeScript
A CSS-in-TypeScript framework that generates static CSS files at build time, giving you type-safe style authoring with zero runtime cost and standard CSS output.