What is Docling — AI Document Parsing by IBM?

Parse PDFs, DOCX, PPTX, and images into structured markdown or JSON. IBM's open-source document AI with OCR, table extraction, and figure understanding.

Is Docling — AI Document Parsing by IBM free to use?

Yes. Docling — AI Document Parsing by IBM is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Docling — AI Document Parsing by IBM?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Docling — AI Document Parsing by IBM

What is Docling?

Docling is IBM's open-source document parsing library that converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. It uses AI models for layout analysis, table extraction, OCR, and figure understanding — making documents ready for LLM consumption and RAG pipelines.

Answer-Ready: Docling is IBM's open-source document AI that parses PDFs, DOCX, PPTX, and images into structured markdown or JSON. Features AI-powered layout analysis, table extraction, OCR, and figure understanding. 15k+ GitHub stars.

Best for: AI teams building RAG pipelines that need to ingest documents. Works with: LangChain, LlamaIndex, any LLM framework. Setup time: Under 2 minutes.

Core Features

1. Multi-Format Support

Format	Features
PDF	Layout analysis, OCR, tables, figures
DOCX	Structure preservation, images
PPTX	Slide text and layout
Images	OCR with layout detection
HTML	Clean text extraction
AsciiDoc	Structure parsing

2. Table Extraction

Accurately extracts tables preserving structure:

result = converter.convert("financial_report.pdf")
for table in result.document.tables:
    df = table.export_to_dataframe()
    print(df)

3. OCR Integration

Built-in OCR for scanned documents:

from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline

pipeline_options = StandardPdfPipeline.PipelineOptions(
    do_ocr=True,
    ocr_options={"lang": ["en", "zh"]},
)
converter = DocumentConverter(pipeline_options=pipeline_options)

4. Chunking for RAG

from docling.chunking import HybridChunker

chunker = HybridChunker(tokenizer="cl100k_base", max_tokens=512)
chunks = list(chunker.chunk(result.document))
# Ready for embedding and vector store

5. Framework Integration

# LangChain
from langchain_community.document_loaders import DoclingLoader
loader = DoclingLoader(file_path="report.pdf")
docs = loader.load()

# LlamaIndex
from llama_index.readers.docling import DoclingReader
reader = DoclingReader()
docs = reader.load_data(file_path="report.pdf")

6. Batch Processing

results = converter.convert_all(["doc1.pdf", "doc2.docx", "doc3.pptx"])
for result in results:
    md = result.document.export_to_markdown()

FAQ

Q: How accurate is the table extraction? A: Docling uses a dedicated TableFormer AI model. Accuracy is 90%+ on standard tables, handling merged cells and complex layouts.

Q: Does it work offline? A: Yes, all models run locally. No API calls needed.

Q: How does it compare to PyMuPDF or pdfplumber? A: Those are rule-based extractors. Docling uses AI models for layout understanding, handling complex documents much better.

Docling — AI Document Parsing by IBM

Use it first, then decide how deep to go

What is Docling?

Core Features

1. Multi-Format Support

2. Table Extraction

3. OCR Integration

4. Chunking for RAG

5. Framework Integration

6. Batch Processing

FAQ

Source & Thanks

Discussion

Related Assets

Rivet — Visual AI Prompt Workflow IDE