ScriptsApr 6, 2026·2 min read

Docling — Document Parsing for AI Pipelines

Parse PDF, DOCX, PPTX, HTML, and images into clean Markdown or JSON for LLM ingestion. Handles tables, figures, equations, and complex layouts. By IBM Research. 18,000+ stars.

SC
Script Depot · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

pip install docling
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("report.pdf")
print(result.document.export_to_markdown())

CLI usage:

docling convert report.pdf --output-format markdown

Intro

Docling is a document parsing library by IBM Research that converts PDF, DOCX, PPTX, HTML, and images into clean Markdown or JSON optimized for LLM ingestion with 18,000+ GitHub stars. It handles complex layouts including tables, figures, mathematical equations, code blocks, and multi-column text — elements that naive PDF parsers destroy. Best for teams building RAG pipelines that need high-fidelity document parsing. Works with: any LLM pipeline, Haystack, LangChain, LlamaIndex. Setup time: under 2 minutes.


Supported Formats

Format Tables Figures Equations Code
PDF Yes Yes Yes Yes
DOCX Yes Yes Yes Yes
PPTX Yes Yes N/A Yes
HTML Yes Yes Yes Yes
Images OCR Yes OCR OCR
Markdown Pass-through Yes Yes Yes

Table Extraction

Docling accurately extracts tables preserving structure:

result = converter.convert("financial_report.pdf")
for table in result.document.tables:
    print(table.export_to_dataframe())  # pandas DataFrame

Figure Handling

Figures are extracted with captions and alt-text:

for figure in result.document.pictures:
    print(f"Caption: {figure.caption}")
    figure.image.save(f"figure_{figure.id}.png")

OCR for Scanned Documents

Built-in OCR for scanned PDFs and images:

converter = DocumentConverter(ocr=True)
result = converter.convert("scanned_document.pdf")

Batch Processing

from docling.document_converter import DocumentConverter
from pathlib import Path

converter = DocumentConverter()
docs = Path("./documents").glob("*.pdf")
for doc in docs:
    result = converter.convert(str(doc))
    markdown = result.document.export_to_markdown()
    Path(f"./output/{doc.stem}.md").write_text(markdown)

Integration with RAG

# With Haystack
from docling.integrations.haystack import DoclingConverter
converter = DoclingConverter()
documents = converter.run(sources=["report.pdf"])

# With LangChain
from docling.integrations.langchain import DoclingLoader
loader = DoclingLoader(file_path="report.pdf")
docs = loader.load()

Key Stats

  • 18,000+ GitHub stars
  • By IBM Research
  • 6 input formats supported
  • Table, figure, equation extraction
  • OCR for scanned documents
  • Haystack, LangChain integrations

FAQ

Q: What is Docling? A: Docling is a document parsing library by IBM Research that converts PDF, DOCX, PPTX, and other formats into clean Markdown or JSON for LLM consumption, handling tables, figures, and complex layouts.

Q: Is Docling free? A: Yes, open-source under MIT license by IBM Research.

Q: How accurate is Docling compared to PyPDF? A: Significantly more accurate for complex documents. Docling uses deep learning models to understand document layout, while PyPDF does simple text extraction.


🙏

Source & Thanks

Created by DS4SD / IBM Research. Licensed under MIT.

docling — ⭐ 18,000+

Thanks to IBM Research for solving document parsing for the AI era.

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets