Document Processing

2026 最佳 AI 文档处理工具推荐

OCR 引擎、PDF 解析器、文档理解和数据提取流水线。将非结构化文档转化为结构化、可搜索的数据。

30 个工具
MinerU — Extract LLM-Ready Data from Any Document logo

MinerU — Extract LLM-Ready Data from Any Document

Convert PDFs, scans, and complex documents into clean Markdown or JSON for RAG and LLM pipelines. 57K+ GitHub stars.

Script Depot 284Scripts
Unstructured — Document ETL for LLM Pipelines logo

Unstructured — Document ETL for LLM Pipelines

Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.

MCP Hub 269MCP Configs
Claude Official Skill: PDF — Read, Create & Edit PDFs logo

Claude Official Skill: PDF — Read, Create & Edit PDFs

Claude Code skill for PDF files. Read content, extract data, create new PDFs, merge documents, and convert formats. Activates automatically.

Anthropic 254Skills
Zerox — Zero-Shot PDF OCR for AI Pipelines logo

Zerox — Zero-Shot PDF OCR for AI Pipelines

Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.

Script Depot 247Skills
Kreuzberg — Polyglot Document Intelligence Framework with a Rust Core logo

Kreuzberg — Polyglot Document Intelligence Framework with a Rust Core

An open-source document extraction framework that pulls text, metadata, images, and structured data from PDFs, Office files, images, and 97+ formats, with bindings for 11 programming languages.

Script Depot 231Skills
OpenDataLoader PDF — AI-Ready Document Parser logo

OpenDataLoader PDF — AI-Ready Document Parser

An open-source PDF parser that automates document accessibility and extracts structured, AI-ready data including tables, text, bounding boxes, and tagged content.

AI Open Source 158Skills
DeepSeek-OCR — High-Accuracy Optical Context Compression logo

DeepSeek-OCR — High-Accuracy Optical Context Compression

An OCR model and toolkit from DeepSeek AI that extracts text from images and documents with high accuracy, designed for feeding structured content into LLM pipelines.

AI Open Source 32Configs
LiteParse — Fast Open-Source Document Parser in Rust logo

LiteParse — Fast Open-Source Document Parser in Rust

A fast, helpful, and open-source document parser by LlamaIndex that extracts structured text from PDFs and other documents with high speed and accuracy for RAG and AI pipelines.

Script Depot 31Scripts
Surya — Document OCR for 90+ Languages logo

Surya — Document OCR for 90+ Languages

Surya is a document OCR toolkit with 19.5K+ GitHub stars. Text recognition in 90+ languages, layout analysis, table detection, reading order, and LaTeX OCR. Benchmarks favorably against cloud OCR serv

Script Depot 447Skills
Paperless-ngx — Self-Hosted Document Management with OCR logo

Paperless-ngx — Self-Hosted Document Management with OCR

Paperless-ngx is an open-source document management system that scans, OCRs, indexes, and archives all your physical and digital documents for full-text search.

Script Depot 320Skills
RAGFlow — Deep Document Understanding RAG Engine logo

RAGFlow — Deep Document Understanding RAG Engine

Open-source RAG engine with deep document understanding. Parses complex PDFs, tables, images. Agent-powered Q&A with citations. Multi-model. 77K+ stars.

Script Depot 315Skills
MarkItDown — Convert Any File to Markdown for LLMs logo

MarkItDown — Convert Any File to Markdown for LLMs

Python library by Microsoft that converts PDF, DOCX, PPTX, XLSX, images, audio, and HTML to clean Markdown. Perfect for feeding documents into LLM context windows. 8,000+ stars.

Microsoft AI 301Skills
Kotaemon — Open-Source RAG Document Chat logo

Kotaemon — Open-Source RAG Document Chat

Clean, open-source RAG tool for chatting with your documents. Supports PDF, DOCX, web pages. Multi-model, citation, and multi-user. Self-hostable. 25K+ stars.

Script Depot 294Skills
Stirling PDF — Self-Hosted PDF Editor & Toolkit logo

Stirling PDF — Self-Hosted PDF Editor & Toolkit

Stirling PDF is the #1 open-source PDF tool on GitHub. Merge, split, convert, compress, OCR, sign, and edit PDFs — all self-hosted with no data leaving your server.

Script Depot 289Skills
Documenso — Open Source Document Signing Platform logo

Documenso — Open Source Document Signing Platform

Documenso is an open-source DocuSign alternative for self-hosted document signing with PDF e-signatures, audit trails, and Next.js stack.

AI Open Source 285Skills
MarkItDown — Convert Any Document to Markdown logo

MarkItDown — Convert Any Document to Markdown

Microsoft's Python tool to convert Office docs, PDFs, images, audio, and HTML to clean Markdown for LLM pipelines. Also available as MCP server.

Microsoft AI 281MCP ConfigsScripts
Docling — Document Parsing for AI logo

Docling — Document Parsing for AI

IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.

Script Depot 233SkillsCLI Tools
Tesseract OCR — Open Source Text Recognition Engine for 100+ Languages logo

Tesseract OCR — Open Source Text Recognition Engine for 100+ Languages

Tesseract is an open-source OCR engine maintained by Google, supporting over 100 languages. It converts images and scanned documents into machine-readable text with high accuracy across multiple output formats.

Script Depot 221Skills
BentoPDF — Privacy-First Self-Hosted PDF Toolkit logo

BentoPDF — Privacy-First Self-Hosted PDF Toolkit

BentoPDF is a self-hosted web application that provides a comprehensive set of PDF tools including merging, splitting, converting, and OCR without sending files to external services.

AI Open Source 211Skills
Gotenberg — API-Driven Document Conversion and PDF Generation Server logo

Gotenberg — API-Driven Document Conversion and PDF Generation Server

Docker-powered API server for converting HTML, Markdown, Office documents, and URLs into PDFs using Chromium and LibreOffice.

Script Depot 186Skills
Pandoc — Universal Document Format Converter logo

Pandoc — Universal Document Format Converter

Pandoc is a universal document converter that reads and writes dozens of markup formats. It converts between Markdown, LaTeX, HTML, DOCX, EPUB, PDF, and many more with a single command.

Script Depot 176Skills
Claude Office Skills — Docs/PDF/Sheets Skill Set logo

Claude Office Skills — Docs/PDF/Sheets Skill Set

A curated repo of office-focused skills (docs, PDF, spreadsheets) and an Office MCP server; copy skills into Claude Code to standardize document workflows.

Skill Factory 176Skills
KOReader — Document Viewer for E-Ink Devices and Beyond logo

KOReader — Document Viewer for E-Ink Devices and Beyond

KOReader is a free, open-source document viewer optimized for e-ink readers like Kindle, Kobo, and PocketBook. It supports PDF, EPUB, DJVU, and many other formats with fine-grained rendering controls.

AI Open Source 174Skills
PaddleOCR — AI-Powered OCR Toolkit for 100+ Languages logo

PaddleOCR — AI-Powered OCR Toolkit for 100+ Languages

A lightweight, production-ready OCR system supporting 100+ languages. Bridges documents and images to structured data for LLM pipelines.

Script Depot 149Skills
Nougat — Neural Optical Understanding for Academic Documents logo

Nougat — Neural Optical Understanding for Academic Documents

Nougat is a visual transformer model from Meta that converts academic PDF pages into structured Markdown, accurately preserving mathematical equations, tables, and text formatting.

AI Open Source 106Skills
React PDF — Display PDF Documents in React Applications logo

React PDF — Display PDF Documents in React Applications

A React component library for rendering PDF files in the browser using Mozilla pdf.js, with support for pagination, zoom, text selection, and annotations.

AI Open Source 51Configs
jsPDF — Generate PDF Documents in JavaScript logo

jsPDF — Generate PDF Documents in JavaScript

A client-side JavaScript library for generating PDF documents programmatically in the browser and Node.js.

AI Open Source 50Configs
Grimmory — Self-Hosted eBook and Comics Library Server logo

Grimmory — Self-Hosted eBook and Comics Library Server

Grimmory is a self-hosted digital library server for managing and reading eBooks, comics, and documents. It supports EPUB, PDF, CBR, CBZ, and MOBI formats with metadata management, OPDS feeds, and a responsive web reader.

AI Open Source 49Configs
pdfmake — Client-Server PDF Generation for JavaScript logo

pdfmake — Client-Server PDF Generation for JavaScript

Create complex PDF documents in the browser or Node.js using a declarative document-definition object.

Script Depot 47Scripts
Chandra — OCR Model for Complex Tables, Forms, and Handwriting logo

Chandra — OCR Model for Complex Tables, Forms, and Handwriting

High-accuracy OCR model that handles structured documents with complex tables, nested forms, and handwritten annotations while preserving full layout fidelity.

Script Depot 33Scripts

AI 文档智能

AI Document Intelligence

AI document processing has leapfrogged traditional OCR. Modern tools don't just recognize characters — they understand document layout, hierarchy, tables, and semantic structure. OCR & Text Extraction — Surya delivers state-of-the-art multilingual OCR with layout detection. Marker converts PDFs to clean Markdown preserving structure. MinerU handles complex scientific papers with equations and diagrams.

Document ETL — DocETL and Unstructured build production pipelines that ingest PDFs, Word docs, scanned images, and HTML into normalized, chunked output ready for RAG or database storage. Translation & Accessibility — PDFMathTranslate preserves mathematical notation while translating academic papers across 100+ languages.

Knowledge Extraction — RAGFlow and Kotaemon combine document parsing with retrieval, letting you ask natural language questions over your document collection with source citations. MarkItDown converts any Office format to Markdown for AI processing.

The world's knowledge is trapped in PDFs — AI document tools are the key that unlocks it.

常见问题

What is the best AI tool for extracting text from PDFs?+

For general PDFs: Marker converts to clean Markdown with excellent layout preservation. For scanned documents: Surya OCR handles 90+ languages with superior accuracy on complex layouts. For scientific papers: MinerU specializes in equations, tables, and figure extraction. For production pipelines: Unstructured and DocETL provide end-to-end document processing with chunking and metadata extraction.

Can AI extract tables from PDFs accurately?+

Yes. Modern tools like Surya, Marker, and MinerU use vision models that understand table structure — headers, merged cells, spanning rows — not just grid lines. Accuracy exceeds 95% on well-formatted tables. For complex or inconsistent tables, combining multiple tools (OCR + layout detection + LLM post-processing) produces the best results.

How do I process thousands of documents with AI?+

Use pipeline tools like DocETL or Unstructured that handle batching, parallel processing, and error recovery. They normalize different formats (PDF, DOCX, images, HTML) into a single output format, extract metadata, chunk content for RAG, and store results in your database or vector store. TokRepo hosts pre-configured pipeline configs for common document processing workflows.

探索更多分类