Cette page est affichée en anglais. Une traduction française est en cours.
Cette page est affichée en anglais. Une traduction française est en cours.
Document Processing

Best AI Tools for Document Processing (2026)

OCR engines, PDF parsers, document understanding, and data extraction pipelines. Turn unstructured documents into structured, searchable data.

30 outils

MinerU — Extract LLM-Ready Data from Any Document

Convert PDFs, scans, and complex documents into clean Markdown or JSON for RAG and LLM pipelines. 57K+ GitHub stars.

Script Depot 73Scripts

Unstructured — Document ETL for LLM Pipelines

Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.

MCP Hub 61MCP Configs

Claude Official Skill: PDF — Read, Create & Edit PDFs

Claude Code skill for PDF files. Read content, extract data, create new PDFs, merge documents, and convert formats. Activates automatically.

Skill Factory 52Skills
📜

Zerox — Zero-Shot PDF OCR for AI Pipelines

Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.

Script Depot 40Scripts

Surya — Document OCR for 90+ Languages

Surya is a document OCR toolkit with 19.5K+ GitHub stars. Text recognition in 90+ languages, layout analysis, table detection, reading order, and LaTeX OCR. Benchmarks favorably against cloud OCR serv

Script Depot 164Scripts
⚙️

Documenso — Open Source Document Signing Platform

Documenso is an open-source DocuSign alternative for self-hosted document signing with PDF e-signatures, audit trails, and Next.js stack.

AI Open Source 76Configs

RAGFlow — Deep Document Understanding RAG Engine

Open-source RAG engine with deep document understanding. Parses complex PDFs, tables, images. Agent-powered Q&A with citations. Multi-model. 77K+ stars.

Script Depot 66Scripts

MarkItDown — Convert Any File to Markdown for LLMs

Python library by Microsoft that converts PDF, DOCX, PPTX, XLSX, images, audio, and HTML to clean Markdown. Perfect for feeding documents into LLM context windows. 8,000+ stars.

AI Open Source 64Configs

Kotaemon — Open-Source RAG Document Chat

Clean, open-source RAG tool for chatting with your documents. Supports PDF, DOCX, web pages. Multi-model, citation, and multi-user. Self-hostable. 25K+ stars.

Script Depot 64Scripts

Cursor Rules MDC Generator — Auto-Generate from Docs

Auto-generate Cursor .mdc rule files for any library using Exa semantic search and LLM-powered documentation extraction.

AI Open Source 63Configs
📜

Stirling PDF — Self-Hosted PDF Editor & Toolkit

Stirling PDF is the #1 open-source PDF tool on GitHub. Merge, split, convert, compress, OCR, sign, and edit PDFs — all self-hosted with no data leaving your server.

Script Depot 60Scripts
📜

Paperless-ngx — Self-Hosted Document Management with OCR

Paperless-ngx is an open-source document management system that scans, OCRs, indexes, and archives all your physical and digital documents for full-text search.

Script Depot 49Scripts
📜

LlamaIndex — Data Framework for LLM Applications

Connect your data to large language models. The leading framework for RAG, document indexing, knowledge graphs, and structured data extraction.

Script Depot 48Scripts

Claude Official Skill: canvas-design

Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other s...

Skill Factory 48Skills
🖥️

Docling — Document Parsing for AI

IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.

Script Depot 46CLI Tools

MarkItDown — Convert Any Document to Markdown

Microsoft's Python tool to convert Office docs, PDFs, images, audio, and HTML to clean Markdown for LLM pipelines. Also available as MCP server.

MCP Hub 45MCP ConfigsScripts
⚙️

BentoPDF — Privacy-First Self-Hosted PDF Toolkit

BentoPDF is a self-hosted web application that provides a comprehensive set of PDF tools including merging, splitting, converting, and OCR without sending files to external services.

AI Open Source 24Configs
📜

Gotenberg — API-Driven Document Conversion and PDF Generation Server

Docker-powered API server for converting HTML, Markdown, Office documents, and URLs into PDFs using Chromium and LibreOffice.

Script Depot 15Scripts

Claude SEO — Complete SEO Skill for Claude Code

Universal SEO analysis skill with 15 sub-skills and 12 parallel subagents. Covers technical SEO, E-E-A-T, schema markup, GEO/AEO, local SEO, Google APIs, and PDF reporting. MIT license, 4,000+ stars.

Skill Factory 108Skills

Lark CLI Skill: Wiki — Knowledge Base Management

Lark/Feishu CLI skill for knowledge base. Create and manage knowledge spaces, organize document nodes and shortcuts.

TokRepo Curated 102Skills
📜

Docmost — Open Source Collaborative Wiki & Documentation

Docmost is an open-source Confluence and Notion alternative for team wikis and documentation, featuring real-time collaboration, rich editor, and permission management.

Script Depot 97Scripts

Awesome Claude Skills — 50+ Verified Agent Skills

Curated collection of 50+ verified Claude skills across 11 categories: document processing, testing, debugging, security, media creation, data analysis, and meta skills. Community-driven, MIT license.

Prompt Lab 91Prompts
📜

Outline — Fast Knowledge Base for Growing Teams

Outline is a beautiful, real-time collaborative knowledge base and wiki. Markdown editor, nested documents, integrations with Slack and Figma, and full-text search.

Script Depot 82Scripts

Reactive Resume — AI-Powered Open-Source Resume Builder

Free open-source resume builder with AI integration. Supports Claude, GPT, Gemini for content generation. Drag-and-drop, PDF export, self-hostable, privacy-first. MIT, 36,000+ stars.

AI Open Source 79Scripts

DocETL — LLM-Powered Document Processing Pipelines

Declarative YAML pipelines for LLM document analysis with map, reduce, and resolve operators. By UC Berkeley. 3.7K+ stars.

AI Open Source 78Knowledge

OpenDeepWiki — Turn Any Repo into AI Documentation

Self-hosted tool that converts GitHub, GitLab, and Gitea repositories into AI-powered knowledge bases with Mermaid diagrams and conversational AI. MIT license, 3,000+ stars.

Script Depot 77Scripts

Docusaurus — Documentation Sites Made Easy

Build fast, SEO-friendly documentation websites with React and Markdown. By Meta. Powers 10K+ sites. 64K+ GitHub stars.

AI Open Source 72Knowledge
📜

Linkwarden — Self-Hosted Collaborative Bookmark Manager

Linkwarden is an open-source bookmark manager that saves, organizes, and preserves web pages with full-page screenshots, PDF snapshots, and collaborative collections.

Script Depot 72Scripts

mdBook — Create Books from Markdown Like Gitbook in Rust

mdBook creates online books from Markdown files, similar to Gitbook but implemented in Rust. Used for the official Rust Book, Cargo Book, Tokio Tutorial, and many open-source documentation sites. Fast builds and a clean default theme.

Script Depot 71Scripts
📜

VHS — Record Terminal Sessions as GIFs and Videos

VHS by Charmbracelet lets you write terminal recordings as code. Define commands in a .tape file, and VHS generates beautiful GIFs, MP4s, or WebMs — perfect for documentation, README demos, and project showcases.

Script Depot 69Scripts

AI Document Intelligence

AI Document Intelligence

AI document processing has leapfrogged traditional OCR. Modern tools don't just recognize characters — they understand document layout, hierarchy, tables, and semantic structure. OCR & Text Extraction — Surya delivers state-of-the-art multilingual OCR with layout detection. Marker converts PDFs to clean Markdown preserving structure. MinerU handles complex scientific papers with equations and diagrams.

Document ETL — DocETL and Unstructured build production pipelines that ingest PDFs, Word docs, scanned images, and HTML into normalized, chunked output ready for RAG or database storage. Translation & Accessibility — PDFMathTranslate preserves mathematical notation while translating academic papers across 100+ languages.

Knowledge Extraction — RAGFlow and Kotaemon combine document parsing with retrieval, letting you ask natural language questions over your document collection with source citations. MarkItDown converts any Office format to Markdown for AI processing.

The world's knowledge is trapped in PDFs — AI document tools are the key that unlocks it.

Questions fréquentes

What is the best AI tool for extracting text from PDFs?+

For general PDFs: Marker converts to clean Markdown with excellent layout preservation. For scanned documents: Surya OCR handles 90+ languages with superior accuracy on complex layouts. For scientific papers: MinerU specializes in equations, tables, and figure extraction. For production pipelines: Unstructured and DocETL provide end-to-end document processing with chunking and metadata extraction.

Can AI extract tables from PDFs accurately?+

Yes. Modern tools like Surya, Marker, and MinerU use vision models that understand table structure — headers, merged cells, spanning rows — not just grid lines. Accuracy exceeds 95% on well-formatted tables. For complex or inconsistent tables, combining multiple tools (OCR + layout detection + LLM post-processing) produces the best results.

How do I process thousands of documents with AI?+

Use pipeline tools like DocETL or Unstructured that handle batching, parallel processing, and error recovery. They normalize different formats (PDF, DOCX, images, HTML) into a single output format, extract metadata, chunk content for RAG, and store results in your database or vector store. TokRepo hosts pre-configured pipeline configs for common document processing workflows.

Explorer les catégories associées