ScriptsMar 29, 2026·1 min read

Docling — Document Parsing for AI

IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.

TO
TokRepo精选 · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

pip install docling
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("report.pdf")
print(result.document.export_to_markdown())

Intro

Docling is IBM's open-source document parsing library, designed for AI pipelines. It accurately converts PDFs (including scanned), Word docs, PowerPoints, images, and HTML into clean structured output — markdown, JSON, or document objects.

Best for: RAG pipeline document ingestion, PDF parsing, enterprise document processing Works with: LangChain, LlamaIndex, any LLM pipeline


Supported Formats

  • PDF — Text, tables, images, scanned documents (OCR)
  • DOCX — Microsoft Word documents
  • PPTX — PowerPoint presentations
  • HTML — Web pages
  • Images — PNG, JPG with OCR
  • Markdown — Passthrough with metadata

Key Features

  • Table extraction — Accurate table parsing to structured data
  • Layout analysis — Understands headers, paragraphs, lists, captions
  • OCR — Built-in for scanned documents
  • Chunking — Hierarchical chunking that respects document structure
  • LangChain integrationDoclingLoader for direct pipeline use

🙏

Source & Thanks

Created by IBM. Licensed under MIT. DS4SD/docling — 15K+ GitHub stars

Related Assets