Document Processing

2026 最佳 AI 文档处理工具推荐

OCR 引擎、PDF 解析器、文档理解和数据提取流水线。将非结构化文档转化为结构化、可搜索的数据。

30 个工具
📜

Docling — AI Document Parsing by IBM

Parse PDFs, DOCX, PPTX, and images into structured markdown or JSON. IBM's open-source document AI with OCR, table extraction, and figure understanding.

Script Depot 6Scripts

Unstructured — Document ETL for LLM Pipelines

Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.

MCP Hub 20MCP Configs

MinerU — Extract LLM-Ready Data from Any Document

Convert PDFs, scans, and complex documents into clean Markdown or JSON for RAG and LLM pipelines. 57K+ GitHub stars.

Script Depot 20Scripts

Claude Official Skill: PDF — Read, Create & Edit PDFs

Claude Code skill for PDF files. Read content, extract data, create new PDFs, merge documents, and convert formats. Activates automatically.

Skill Factory 19Skills
📜

Zerox — Zero-Shot PDF OCR for AI Pipelines

Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.

Script Depot 5Scripts

Surya — Document OCR for 90+ Languages

Surya is a document OCR toolkit with 19.5K+ GitHub stars. Text recognition in 90+ languages, layout analysis, table detection, reading order, and LaTeX OCR. Benchmarks favorably against cloud OCR serv

Script Depot 48Scripts

Marker — Convert PDF to Markdown for AI Tools

High-accuracy PDF to Markdown converter optimized for AI pipelines. Marker handles tables, equations, code blocks, and multi-column layouts with deep learning OCR.

Script Depot 31Scripts

Cursor Rules MDC Generator — Auto-Generate from Docs

Auto-generate Cursor .mdc rule files for any library using Exa semantic search and LLM-powered documentation extraction.

AI Open Source 21Configs

Kotaemon — Open-Source RAG Document Chat

Clean, open-source RAG tool for chatting with your documents. Supports PDF, DOCX, web pages. Multi-model, citation, and multi-user. Self-hostable. 25K+ stars.

Script Depot 20Scripts

MarkItDown — Convert Any File to Markdown for LLMs

Python library by Microsoft that converts PDF, DOCX, PPTX, XLSX, images, audio, and HTML to clean Markdown. Perfect for feeding documents into LLM context windows. 8,000+ stars.

AI Open Source 19Configs

RAGFlow — Deep Document Understanding RAG Engine

Open-source RAG engine with deep document understanding. Parses complex PDFs, tables, images. Agent-powered Q&A with citations. Multi-model. 77K+ stars.

Script Depot 19Scripts

Claude Official Skill: canvas-design

Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other s...

Skill Factory 18Skills
⚙️

Documenso — Open Source Document Signing Platform

Documenso is an open-source DocuSign alternative for self-hosted document signing with PDF e-signatures, audit trails, and Next.js stack.

AI Open Source 16Configs
📜

Docling — Document Parsing for AI

IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.

Script Depot 13Scripts
📜

LlamaIndex — Data Framework for LLM Applications

Connect your data to large language models. The leading framework for RAG, document indexing, knowledge graphs, and structured data extraction.

Script Depot 9Scripts
📜

Stirling PDF — Self-Hosted PDF Editor & Toolkit

Stirling PDF is the #1 open-source PDF tool on GitHub. Merge, split, convert, compress, OCR, sign, and edit PDFs — all self-hosted with no data leaving your server.

Script Depot 9Scripts
📜

Docling — Document Parsing for AI Pipelines

Parse PDF, DOCX, PPTX, HTML, and images into clean Markdown or JSON for LLM ingestion. Handles tables, figures, equations, and complex layouts. By IBM Research. 18,000+ stars.

Script Depot 8Scripts

MarkItDown — Convert Any Document to Markdown

Microsoft's Python tool to convert Office docs, PDFs, images, audio, and HTML to clean Markdown for LLM pipelines. Also available as MCP server.

MCP Hub 8MCP ConfigsScripts
📜

Paperless-ngx — Self-Hosted Document Management with OCR

Paperless-ngx is an open-source document management system that scans, OCRs, indexes, and archives all your physical and digital documents for full-text search.

Script DepotScripts

Lark CLI Skill: Wiki — Knowledge Base Management

Lark/Feishu CLI skill for knowledge base. Create and manage knowledge spaces, organize document nodes and shortcuts.

TokRepo精选 64Skills

Claude SEO — Complete SEO Skill for Claude Code

Universal SEO analysis skill with 15 sub-skills and 12 parallel subagents. Covers technical SEO, E-E-A-T, schema markup, GEO/AEO, local SEO, Google APIs, and PDF reporting. MIT license, 4,000+ stars.

Skill Factory 44Skills

OpenDeepWiki — Turn Any Repo into AI Documentation

Self-hosted tool that converts GitHub, GitLab, and Gitea repositories into AI-powered knowledge bases with Mermaid diagrams and conversational AI. MIT license, 3,000+ stars.

Script Depot 39Scripts

Awesome Claude Skills — 50+ Verified Agent Skills

Curated collection of 50+ verified Claude skills across 11 categories: document processing, testing, debugging, security, media creation, data analysis, and meta skills. Community-driven, MIT license.

Prompt Lab 34Prompts

GitHub Copilot — Official Customization Collection

Official GitHub Copilot customization: agents, skills, instructions, plugins, hooks, and agentic workflows. Plus documentation.

Skill Factory 31Skills

Jina Reader — Convert Any URL to LLM-Ready Text

Convert any URL to clean, LLM-friendly markdown with a simple prefix. Just prepend r.jina.ai/ to any URL. Handles JS-rendered pages, PDFs, and images. 10K+ stars.

Script Depot 31Scripts

Reactive Resume — AI-Powered Open-Source Resume Builder

Free open-source resume builder with AI integration. Supports Claude, GPT, Gemini for content generation. Drag-and-drop, PDF export, self-hostable, privacy-first. MIT, 36,000+ stars.

AI Open Source 30Scripts

DocETL — LLM-Powered Document Processing Pipelines

Declarative YAML pipelines for LLM document analysis with map, reduce, and resolve operators. By UC Berkeley. 3.7K+ stars.

AI Open Source 27Knowledge

Claude Code Agent: API Architect — Design REST & GraphQL APIs

Claude Code agent for API design. REST endpoints, GraphQL schemas, authentication, rate limiting, versioning, and documentation.

Skill Factory 27Skills

Anthropic Claude Official Skills — All 17 Skills Collection

All 17 official Agent Skills by Anthropic for Claude Code: document generation, dev tools, creative design, and enterprise workflows.

Skill Factory 25Skills

Docusaurus — Documentation Sites Made Easy

Build fast, SEO-friendly documentation websites with React and Markdown. By Meta. Powers 10K+ sites. 64K+ GitHub stars.

AI Open Source 24Knowledge

AI 文档智能

AI Document Intelligence

AI document processing has leapfrogged traditional OCR. Modern tools don't just recognize characters — they understand document layout, hierarchy, tables, and semantic structure. OCR & Text Extraction — Surya delivers state-of-the-art multilingual OCR with layout detection. Marker converts PDFs to clean Markdown preserving structure. MinerU handles complex scientific papers with equations and diagrams.

Document ETL — DocETL and Unstructured build production pipelines that ingest PDFs, Word docs, scanned images, and HTML into normalized, chunked output ready for RAG or database storage. Translation & Accessibility — PDFMathTranslate preserves mathematical notation while translating academic papers across 100+ languages.

Knowledge Extraction — RAGFlow and Kotaemon combine document parsing with retrieval, letting you ask natural language questions over your document collection with source citations. MarkItDown converts any Office format to Markdown for AI processing.

The world's knowledge is trapped in PDFs — AI document tools are the key that unlocks it.

常见问题

What is the best AI tool for extracting text from PDFs?+

For general PDFs: Marker converts to clean Markdown with excellent layout preservation. For scanned documents: Surya OCR handles 90+ languages with superior accuracy on complex layouts. For scientific papers: MinerU specializes in equations, tables, and figure extraction. For production pipelines: Unstructured and DocETL provide end-to-end document processing with chunking and metadata extraction.

Can AI extract tables from PDFs accurately?+

Yes. Modern tools like Surya, Marker, and MinerU use vision models that understand table structure — headers, merged cells, spanning rows — not just grid lines. Accuracy exceeds 95% on well-formatted tables. For complex or inconsistent tables, combining multiple tools (OCR + layout detection + LLM post-processing) produces the best results.

How do I process thousands of documents with AI?+

Use pipeline tools like DocETL or Unstructured that handle batching, parallel processing, and error recovery. They normalize different formats (PDF, DOCX, images, HTML) into a single output format, extract metadata, chunk content for RAG, and store results in your database or vector store. TokRepo hosts pre-configured pipeline configs for common document processing workflows.

探索更多分类