Cette page est affichée en anglais. Une traduction française est en cours.
CLI ToolsMar 29, 2026·2 min de lecture

Docling — Document Parsing for AI

IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.

Introduction

Docling is IBM's open-source document parsing library, designed for AI pipelines. It accurately converts PDFs (including scanned), Word docs, PowerPoints, images, and HTML into clean structured output — markdown, JSON, or document objects.

Best for: RAG pipeline document ingestion, PDF parsing, enterprise document processing Works with: LangChain, LlamaIndex, any LLM pipeline


Supported Formats

  • PDF — Text, tables, images, scanned documents (OCR)
  • DOCX — Microsoft Word documents
  • PPTX — PowerPoint presentations
  • HTML — Web pages
  • Images — PNG, JPG with OCR
  • Markdown — Passthrough with metadata

Key Features

  • Table extraction — Accurate table parsing to structured data
  • Layout analysis — Understands headers, paragraphs, lists, captions
  • OCR — Built-in for scanned documents
  • Chunking — Hierarchical chunking that respects document structure
  • LangChain integrationDoclingLoader for direct pipeline use

FAQ

Q: What is Docling? A: IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.

Q: How do I install Docling? A: Check the Quick Use section above for step-by-step installation instructions. Most assets can be set up in under 2 minutes.

🙏

Source et remerciements

Created by IBM. Licensed under MIT. DS4SD/docling — 15K+ GitHub stars

Discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires