Esta página se muestra en inglés. Una traducción al español está en curso.
CLI ToolsMar 29, 2026·2 min de lectura

Docling — Document Parsing for AI

IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.

Introducción

Docling is IBM's open-source document parsing library, designed for AI pipelines. It accurately converts PDFs (including scanned), Word docs, PowerPoints, images, and HTML into clean structured output — markdown, JSON, or document objects.

Best for: RAG pipeline document ingestion, PDF parsing, enterprise document processing Works with: LangChain, LlamaIndex, any LLM pipeline


Supported Formats

  • PDF — Text, tables, images, scanned documents (OCR)
  • DOCX — Microsoft Word documents
  • PPTX — PowerPoint presentations
  • HTML — Web pages
  • Images — PNG, JPG with OCR
  • Markdown — Passthrough with metadata

Key Features

  • Table extraction — Accurate table parsing to structured data
  • Layout analysis — Understands headers, paragraphs, lists, captions
  • OCR — Built-in for scanned documents
  • Chunking — Hierarchical chunking that respects document structure
  • LangChain integrationDoclingLoader for direct pipeline use

FAQ

Q: What is Docling? A: IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.

Q: How do I install Docling? A: Check the Quick Use section above for step-by-step installation instructions. Most assets can be set up in under 2 minutes.

🙏

Fuente y agradecimientos

Created by IBM. Licensed under MIT. DS4SD/docling — 15K+ GitHub stars

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados