What is Docling — Document Parsing for AI Pipelines?

Parse PDF, DOCX, PPTX, HTML, and images into clean Markdown or JSON for LLM ingestion. Handles tables, figures, equations, and complex layouts. By IBM Research. 18,000+ stars.

Is Docling — Document Parsing for AI Pipelines free to use?

Yes. Docling — Document Parsing for AI Pipelines is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Docling — Document Parsing for AI Pipelines?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Docling — Document Parsing for AI Pipelines

Supported Formats

Format	Tables	Figures	Equations	Code
PDF	Yes	Yes	Yes	Yes
DOCX	Yes	Yes	Yes	Yes
PPTX	Yes	Yes	N/A	Yes
HTML	Yes	Yes	Yes	Yes
Images	OCR	Yes	OCR	OCR
Markdown	Pass-through	Yes	Yes	Yes

Table Extraction

Docling accurately extracts tables preserving structure:

result = converter.convert("financial_report.pdf")
for table in result.document.tables:
    print(table.export_to_dataframe())  # pandas DataFrame

Figure Handling

Figures are extracted with captions and alt-text:

for figure in result.document.pictures:
    print(f"Caption: {figure.caption}")
    figure.image.save(f"figure_{figure.id}.png")

OCR for Scanned Documents

Built-in OCR for scanned PDFs and images:

converter = DocumentConverter(ocr=True)
result = converter.convert("scanned_document.pdf")

Batch Processing

from docling.document_converter import DocumentConverter
from pathlib import Path

converter = DocumentConverter()
docs = Path("./documents").glob("*.pdf")
for doc in docs:
    result = converter.convert(str(doc))
    markdown = result.document.export_to_markdown()
    Path(f"./output/{doc.stem}.md").write_text(markdown)

Integration with RAG

# With Haystack
from docling.integrations.haystack import DoclingConverter
converter = DoclingConverter()
documents = converter.run(sources=["report.pdf"])

# With LangChain
from docling.integrations.langchain import DoclingLoader
loader = DoclingLoader(file_path="report.pdf")
docs = loader.load()

Key Stats

18,000+ GitHub stars
By IBM Research
6 input formats supported
Table, figure, equation extraction
OCR for scanned documents
Haystack, LangChain integrations

FAQ

Q: What is Docling? A: Docling is a document parsing library by IBM Research that converts PDF, DOCX, PPTX, and other formats into clean Markdown or JSON for LLM consumption, handling tables, figures, and complex layouts.

Q: Is Docling free? A: Yes, open-source under MIT license by IBM Research.

Q: How accurate is Docling compared to PyPDF? A: Significantly more accurate for complex documents. Docling uses deep learning models to understand document layout, while PyPDF does simple text extraction.

Docling — Document Parsing for AI Pipelines

Use it first, then decide how deep to go

Supported Formats

Table Extraction

Figure Handling

OCR for Scanned Documents

Batch Processing

Integration with RAG

Key Stats

FAQ

Source & Thanks

Discussion

Related Assets

Mastra — TypeScript AI Agent Framework