Docling — Document Parsing for AI
IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.
What it is
Docling is IBM's open-source document parsing library that converts PDFs, Word documents, PowerPoint files, images, and HTML into structured markdown or JSON. It handles complex layouts with tables, headers, figures, and multi-column text, producing clean output suitable for RAG pipelines and LLM ingestion. The library uses AI models for layout understanding and OCR when needed.
Developers building RAG applications, document Q&A systems, or any pipeline that needs to extract structured text from documents benefit from Docling. It replaces fragile PDF parsers with a model-based approach that handles real-world document complexity.
How it saves time or tokens
Docling produces clean, structured output that LLMs can process directly. Without it, developers chain together multiple tools (PDF parsers, table extractors, OCR engines) and write custom post-processing code. Docling handles the entire pipeline in one library call. Clean output also means fewer tokens wasted on HTML tags, layout artifacts, and parsing noise.
How to use
- Install Docling via pip
- Create a DocumentConverter instance
- Convert documents and get structured markdown or JSON
Example
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert('report.pdf')
# Get markdown output
markdown = result.document.export_to_markdown()
print(markdown)
# Get structured JSON
json_output = result.document.export_to_dict()
# Process multiple files
for path in ['doc1.pdf', 'doc2.docx', 'slides.pptx']:
result = converter.convert(path)
print(f'{path}: {len(result.document.pages)} pages')
Related on TokRepo
- AI tools for documents — Browse document processing and parsing tools
- RAG tools — Explore retrieval-augmented generation frameworks
Common pitfalls
- Complex table layouts with merged cells may not parse perfectly; verify output for critical documents
- OCR accuracy depends on image quality; low-resolution scans produce lower quality text extraction
- Processing large documents (100+ pages) can be slow; batch processing with parallel workers improves throughput
Frequently Asked Questions
Docling supports PDF, DOCX, PPTX, images (PNG, JPEG, TIFF), HTML, and AsciiDoc. PDF is the primary format with the most robust parsing support. Other formats have varying levels of layout preservation.
Docling uses AI models to detect table boundaries and cell structure. Tables are converted to markdown tables or structured JSON with row and column information. Complex tables with merged cells may require post-processing.
Yes. Docling includes OCR capabilities for scanned documents and images. The OCR pipeline runs automatically when text is not extractable from the PDF. Quality depends on scan resolution.
Docling is open-source and runs locally with no API costs. LlamaParse is a cloud service with potentially better accuracy on complex layouts but requires an API key and per-page pricing. Docling gives you data privacy and no usage fees.
Yes. Docling is maintained by IBM Research and used in production document processing pipelines. It handles common document formats reliably. For edge cases, combine it with fallback parsers.
Citations (3)
- Docling GitHub— IBM document parsing library for structured output
- Docling Documentation— AI-based layout understanding and OCR
- Docling PyPI— Multi-format document conversion for RAG pipelines
Related on TokRepo
Source & Thanks
Created by IBM. Licensed under MIT. DS4SD/docling — 15K+ GitHub stars
Discussion
Related Assets
NAPI-RS — Build Node.js Native Addons in Rust
Write high-performance Node.js native modules in Rust with automatic TypeScript type generation and cross-platform prebuilt binaries.
Mamba — Fast Cross-Platform Package Manager
A drop-in conda replacement written in C++ that resolves environments in seconds instead of minutes.
Plasmo — The Browser Extension Framework
Build, test, and publish browser extensions for Chrome, Firefox, and Edge using React or Vue with hot-reload and automatic manifest generation.