Skills2026年4月8日·1 分钟阅读

Zerox — Zero-Shot PDF OCR for AI Pipelines

Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.

Script Depot · Community

Agent 就绪

这个资产会安全暂存

这个资产会先安全暂存。复制的指令会要求 Agent 读取暂存文件，并在激活脚本、MCP 配置或全局配置前先确认。

Stage only · 29/100策略：需暂存

Agent 入口

任意 MCP/CLI Agent

类型

Skill

安装

Stage only

信任

信任等级：Established

入口

Zerox — Zero-Shot PDF OCR for AI Pipelines

安全暂存命令

npx -y tokrepo@latest install 3ac555d9-d75c-4208-ba46-974e4a717234 --target codex

先暂存文件；激活前需要读取暂存 README 和安装计划。

TL;DR

Zerox converts PDF pages to images and uses vision LLMs to extract clean markdown text without any OCR training data.

§01

What it is

Zerox is a Python library that extracts text from PDFs by converting each page to an image and then using vision-capable LLMs (GPT-4o, Claude, etc.) as the OCR engine. Unlike traditional OCR tools that require trained models for specific fonts and layouts, Zerox leverages the visual understanding of large language models to read any document format without training.

Data engineers processing scanned documents, researchers extracting text from academic papers, and developers building document processing pipelines use Zerox when traditional OCR produces poor results on complex layouts, tables, or handwritten content.

§02

How it saves time or tokens

Traditional OCR pipelines require installing Tesseract, training custom models for specific document types, and writing post-processing logic to clean up OCR errors. Zerox replaces the entire pipeline with a single function call. Vision models handle complex layouts, tables, and multi-column documents that trip up conventional OCR. The output is clean markdown rather than raw text, reducing downstream parsing work.

§03

How to use

Install Zerox:

pip install py-zerox

Extract text from a PDF:

from pyzerox import zerox
import asyncio

async def main():
    result = await zerox(
        file_path='report.pdf',
        model='gpt-4o-mini',
    )
    for page in result.pages:
        print(page.content)

asyncio.run(main())

The output is clean markdown for each page, ready for further processing or LLM consumption.

§04

Example

from pyzerox import zerox
import asyncio

async def extract_with_claude():
    result = await zerox(
        file_path='financial_report.pdf',
        model='claude-3-5-sonnet-20241022',
        custom_system_prompt='Extract all text preserving table structure as markdown tables.',
    )
    # Each page returns clean markdown
    for i, page in enumerate(result.pages):
        print(f'--- Page {i+1} ---')
        print(page.content)
    
    # Save all pages to a single file
    with open('extracted.md', 'w') as f:
        for page in result.pages:
            f.write(page.content + '\n\n')

asyncio.run(extract_with_claude())

§05

Related on TokRepo

Document Processing Tools -- explore tools for PDF and document handling
AI Tools for Research -- discover tools for academic and data research workflows

§06

Common pitfalls

Vision model API calls cost more than traditional OCR. For large documents (100+ pages), estimate API costs before processing. GPT-4o-mini is cheaper but less accurate than GPT-4o on complex layouts.
Zerox converts each page to an image before sending to the model. High-resolution settings produce better results but increase API costs and processing time.
The library is async by default. Wrap calls in asyncio.run() for synchronous usage, or integrate into an existing async application.

常见问题

What models does Zerox support for OCR?+

Zerox supports any vision-capable LLM including GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet, and other models that accept image inputs. You specify the model name in the function call, and Zerox handles the image conversion and API interaction.

How does Zerox compare to Tesseract OCR?+

Tesseract is a traditional OCR engine that works locally without API costs but struggles with complex layouts, tables, and handwritten text. Zerox uses vision LLMs that handle these cases much better but requires API calls with associated costs. Zerox produces markdown output while Tesseract outputs raw text.

Can I customize the extraction prompt?+

Yes. Zerox accepts a custom_system_prompt parameter that lets you instruct the vision model on how to handle the extraction. For example, you can ask it to preserve table structures as markdown tables or extract only specific sections of each page.

How much does it cost to process a document with Zerox?+

Cost depends on the model and page count. Each page is sent as an image to the vision model API. GPT-4o-mini costs roughly $0.01-0.02 per page, while GPT-4o costs more. For a 50-page document, expect $0.50-1.00 with GPT-4o-mini.

Does Zerox work with scanned documents and handwriting?+

Yes. Because Zerox uses vision models that understand images, it handles scanned documents, photographs of text, and handwritten content. The accuracy depends on the vision model capabilities and image quality. Results are generally better than traditional OCR for these difficult cases.

引用来源 (3)

Zerox GitHub— Vision model-based PDF OCR without training data
OpenAI GPT-4o— GPT-4o vision capabilities for document understanding
Anthropic Claude Vision— Claude vision model for image understanding

🙏

来源与感谢

getomni-ai/zerox — 7k+ stars, MIT

讨论

登录后参与讨论。

还没有评论，来写第一条吧。

Zerox — Zero-Shot PDF OCR for AI Pipelines

这个资产会安全暂存

What it is

How it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

常见问题

引用来源 (3)

TokRepo 相关

来源与感谢

讨论

相关资产

Segment Anything (SAM) — Zero-Shot Image Segmentation by Meta

Index TTS — Industrial Zero-Shot Text-to-Speech System

VoiceCraft — Zero-Shot Speech Editing and Text-to-Speech

Index-TTS — Industrial-Grade Zero-Shot Text-to-Speech