Best AI Tools for Document Processing (2026)
OCR engines, PDF parsers, document understanding, and data extraction pipelines. Turn unstructured documents into structured, searchable data.
MinerU — Extract LLM-Ready Data from Any Document
Convert PDFs, scans, and complex documents into clean Markdown or JSON for RAG and LLM pipelines. 57K+ GitHub stars.
Unstructured — Document ETL for LLM Pipelines
Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.
Claude Official Skill: PDF — Read, Create & Edit PDFs
Claude Code skill for PDF files. Read content, extract data, create new PDFs, merge documents, and convert formats. Activates automatically.
Zerox — Zero-Shot PDF OCR for AI Pipelines
Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.
Surya — Document OCR for 90+ Languages
Surya is a document OCR toolkit with 19.5K+ GitHub stars. Text recognition in 90+ languages, layout analysis, table detection, reading order, and LaTeX OCR. Benchmarks favorably against cloud OCR serv
Documenso — Open Source Document Signing Platform
Documenso is an open-source DocuSign alternative for self-hosted document signing with PDF e-signatures, audit trails, and Next.js stack.
RAGFlow — Deep Document Understanding RAG Engine
Open-source RAG engine with deep document understanding. Parses complex PDFs, tables, images. Agent-powered Q&A with citations. Multi-model. 77K+ stars.
MarkItDown — Convert Any File to Markdown for LLMs
Python library by Microsoft that converts PDF, DOCX, PPTX, XLSX, images, audio, and HTML to clean Markdown. Perfect for feeding documents into LLM context windows. 8,000+ stars.
Kotaemon — Open-Source RAG Document Chat
Clean, open-source RAG tool for chatting with your documents. Supports PDF, DOCX, web pages. Multi-model, citation, and multi-user. Self-hostable. 25K+ stars.
Cursor Rules MDC Generator — Auto-Generate from Docs
Auto-generate Cursor .mdc rule files for any library using Exa semantic search and LLM-powered documentation extraction.
Stirling PDF — Self-Hosted PDF Editor & Toolkit
Stirling PDF is the #1 open-source PDF tool on GitHub. Merge, split, convert, compress, OCR, sign, and edit PDFs — all self-hosted with no data leaving your server.
Paperless-ngx — Self-Hosted Document Management with OCR
Paperless-ngx is an open-source document management system that scans, OCRs, indexes, and archives all your physical and digital documents for full-text search.
LlamaIndex — Data Framework for LLM Applications
Connect your data to large language models. The leading framework for RAG, document indexing, knowledge graphs, and structured data extraction.
Claude Official Skill: canvas-design
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other s...
Docling — Document Parsing for AI
IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.
MarkItDown — Convert Any Document to Markdown
Microsoft's Python tool to convert Office docs, PDFs, images, audio, and HTML to clean Markdown for LLM pipelines. Also available as MCP server.
BentoPDF — Privacy-First Self-Hosted PDF Toolkit
BentoPDF is a self-hosted web application that provides a comprehensive set of PDF tools including merging, splitting, converting, and OCR without sending files to external services.
Gotenberg — API-Driven Document Conversion and PDF Generation Server
Docker-powered API server for converting HTML, Markdown, Office documents, and URLs into PDFs using Chromium and LibreOffice.
Claude SEO — Complete SEO Skill for Claude Code
Universal SEO analysis skill with 15 sub-skills and 12 parallel subagents. Covers technical SEO, E-E-A-T, schema markup, GEO/AEO, local SEO, Google APIs, and PDF reporting. MIT license, 4,000+ stars.
Lark CLI Skill: Wiki — Knowledge Base Management
Lark/Feishu CLI skill for knowledge base. Create and manage knowledge spaces, organize document nodes and shortcuts.
Docmost — Open Source Collaborative Wiki & Documentation
Docmost is an open-source Confluence and Notion alternative for team wikis and documentation, featuring real-time collaboration, rich editor, and permission management.
Awesome Claude Skills — 50+ Verified Agent Skills
Curated collection of 50+ verified Claude skills across 11 categories: document processing, testing, debugging, security, media creation, data analysis, and meta skills. Community-driven, MIT license.
Outline — Fast Knowledge Base for Growing Teams
Outline is a beautiful, real-time collaborative knowledge base and wiki. Markdown editor, nested documents, integrations with Slack and Figma, and full-text search.
Reactive Resume — AI-Powered Open-Source Resume Builder
Free open-source resume builder with AI integration. Supports Claude, GPT, Gemini for content generation. Drag-and-drop, PDF export, self-hostable, privacy-first. MIT, 36,000+ stars.
DocETL — LLM-Powered Document Processing Pipelines
Declarative YAML pipelines for LLM document analysis with map, reduce, and resolve operators. By UC Berkeley. 3.7K+ stars.
OpenDeepWiki — Turn Any Repo into AI Documentation
Self-hosted tool that converts GitHub, GitLab, and Gitea repositories into AI-powered knowledge bases with Mermaid diagrams and conversational AI. MIT license, 3,000+ stars.
Docusaurus — Documentation Sites Made Easy
Build fast, SEO-friendly documentation websites with React and Markdown. By Meta. Powers 10K+ sites. 64K+ GitHub stars.
Linkwarden — Self-Hosted Collaborative Bookmark Manager
Linkwarden is an open-source bookmark manager that saves, organizes, and preserves web pages with full-page screenshots, PDF snapshots, and collaborative collections.
mdBook — Create Books from Markdown Like Gitbook in Rust
mdBook creates online books from Markdown files, similar to Gitbook but implemented in Rust. Used for the official Rust Book, Cargo Book, Tokio Tutorial, and many open-source documentation sites. Fast builds and a clean default theme.
VHS — Record Terminal Sessions as GIFs and Videos
VHS by Charmbracelet lets you write terminal recordings as code. Define commands in a .tape file, and VHS generates beautiful GIFs, MP4s, or WebMs — perfect for documentation, README demos, and project showcases.
AI Document Intelligence
AI Document Intelligence
AI document processing has leapfrogged traditional OCR. Modern tools don't just recognize characters — they understand document layout, hierarchy, tables, and semantic structure. OCR & Text Extraction — Surya delivers state-of-the-art multilingual OCR with layout detection. Marker converts PDFs to clean Markdown preserving structure. MinerU handles complex scientific papers with equations and diagrams.
Document ETL — DocETL and Unstructured build production pipelines that ingest PDFs, Word docs, scanned images, and HTML into normalized, chunked output ready for RAG or database storage. Translation & Accessibility — PDFMathTranslate preserves mathematical notation while translating academic papers across 100+ languages.
Knowledge Extraction — RAGFlow and Kotaemon combine document parsing with retrieval, letting you ask natural language questions over your document collection with source citations. MarkItDown converts any Office format to Markdown for AI processing.
The world's knowledge is trapped in PDFs — AI document tools are the key that unlocks it.
Questions fréquentes
What is the best AI tool for extracting text from PDFs?+
For general PDFs: Marker converts to clean Markdown with excellent layout preservation. For scanned documents: Surya OCR handles 90+ languages with superior accuracy on complex layouts. For scientific papers: MinerU specializes in equations, tables, and figure extraction. For production pipelines: Unstructured and DocETL provide end-to-end document processing with chunking and metadata extraction.
Can AI extract tables from PDFs accurately?+
Yes. Modern tools like Surya, Marker, and MinerU use vision models that understand table structure — headers, merged cells, spanning rows — not just grid lines. Accuracy exceeds 95% on well-formatted tables. For complex or inconsistent tables, combining multiple tools (OCR + layout detection + LLM post-processing) produces the best results.
How do I process thousands of documents with AI?+
Use pipeline tools like DocETL or Unstructured that handle batching, parallel processing, and error recovery. They normalize different formats (PDF, DOCX, images, HTML) into a single output format, extract metadata, chunk content for RAG, and store results in your database or vector store. TokRepo hosts pre-configured pipeline configs for common document processing workflows.