ScriptsApr 7, 2026·2 min read

Docling — AI Document Parsing by IBM

Parse PDFs, DOCX, PPTX, and images into structured markdown or JSON. IBM's open-source document AI with OCR, table extraction, and figure understanding.

SC
Script Depot · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

pip install docling
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("report.pdf")
print(result.document.export_to_markdown())

Convert any PDF to clean markdown in 3 lines.

What is Docling?

Docling is IBM's open-source document parsing library that converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. It uses AI models for layout analysis, table extraction, OCR, and figure understanding — making documents ready for LLM consumption and RAG pipelines.

Answer-Ready: Docling is IBM's open-source document AI that parses PDFs, DOCX, PPTX, and images into structured markdown or JSON. Features AI-powered layout analysis, table extraction, OCR, and figure understanding. 15k+ GitHub stars.

Best for: AI teams building RAG pipelines that need to ingest documents. Works with: LangChain, LlamaIndex, any LLM framework. Setup time: Under 2 minutes.

Core Features

1. Multi-Format Support

Format Features
PDF Layout analysis, OCR, tables, figures
DOCX Structure preservation, images
PPTX Slide text and layout
Images OCR with layout detection
HTML Clean text extraction
AsciiDoc Structure parsing

2. Table Extraction

Accurately extracts tables preserving structure:

result = converter.convert("financial_report.pdf")
for table in result.document.tables:
    df = table.export_to_dataframe()
    print(df)

3. OCR Integration

Built-in OCR for scanned documents:

from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline

pipeline_options = StandardPdfPipeline.PipelineOptions(
    do_ocr=True,
    ocr_options={"lang": ["en", "zh"]},
)
converter = DocumentConverter(pipeline_options=pipeline_options)

4. Chunking for RAG

from docling.chunking import HybridChunker

chunker = HybridChunker(tokenizer="cl100k_base", max_tokens=512)
chunks = list(chunker.chunk(result.document))
# Ready for embedding and vector store

5. Framework Integration

# LangChain
from langchain_community.document_loaders import DoclingLoader
loader = DoclingLoader(file_path="report.pdf")
docs = loader.load()

# LlamaIndex
from llama_index.readers.docling import DoclingReader
reader = DoclingReader()
docs = reader.load_data(file_path="report.pdf")

6. Batch Processing

results = converter.convert_all(["doc1.pdf", "doc2.docx", "doc3.pptx"])
for result in results:
    md = result.document.export_to_markdown()

FAQ

Q: How accurate is the table extraction? A: Docling uses a dedicated TableFormer AI model. Accuracy is 90%+ on standard tables, handling merged cells and complex layouts.

Q: Does it work offline? A: Yes, all models run locally. No API calls needed.

Q: How does it compare to PyMuPDF or pdfplumber? A: Those are rule-based extractors. Docling uses AI models for layout understanding, handling complex documents much better.

🙏

Source & Thanks

Created by IBM Research. Licensed under MIT.

docling-project/docling — 15k+ stars

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets