What is Docling?
Docling is IBM's open-source document parsing library that converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. It uses AI models for layout analysis, table extraction, OCR, and figure understanding — making documents ready for LLM consumption and RAG pipelines.
Answer-Ready: Docling is IBM's open-source document AI that parses PDFs, DOCX, PPTX, and images into structured markdown or JSON. Features AI-powered layout analysis, table extraction, OCR, and figure understanding. 15k+ GitHub stars.
Best for: AI teams building RAG pipelines that need to ingest documents. Works with: LangChain, LlamaIndex, any LLM framework. Setup time: Under 2 minutes.
Core Features
1. Multi-Format Support
| Format | Features |
|---|---|
| Layout analysis, OCR, tables, figures | |
| DOCX | Structure preservation, images |
| PPTX | Slide text and layout |
| Images | OCR with layout detection |
| HTML | Clean text extraction |
| AsciiDoc | Structure parsing |
2. Table Extraction
Accurately extracts tables preserving structure:
result = converter.convert("financial_report.pdf")
for table in result.document.tables:
df = table.export_to_dataframe()
print(df)3. OCR Integration
Built-in OCR for scanned documents:
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
pipeline_options = StandardPdfPipeline.PipelineOptions(
do_ocr=True,
ocr_options={"lang": ["en", "zh"]},
)
converter = DocumentConverter(pipeline_options=pipeline_options)4. Chunking for RAG
from docling.chunking import HybridChunker
chunker = HybridChunker(tokenizer="cl100k_base", max_tokens=512)
chunks = list(chunker.chunk(result.document))
# Ready for embedding and vector store5. Framework Integration
# LangChain
from langchain_community.document_loaders import DoclingLoader
loader = DoclingLoader(file_path="report.pdf")
docs = loader.load()
# LlamaIndex
from llama_index.readers.docling import DoclingReader
reader = DoclingReader()
docs = reader.load_data(file_path="report.pdf")6. Batch Processing
results = converter.convert_all(["doc1.pdf", "doc2.docx", "doc3.pptx"])
for result in results:
md = result.document.export_to_markdown()FAQ
Q: How accurate is the table extraction? A: Docling uses a dedicated TableFormer AI model. Accuracy is 90%+ on standard tables, handling merged cells and complex layouts.
Q: Does it work offline? A: Yes, all models run locally. No API calls needed.
Q: How does it compare to PyMuPDF or pdfplumber? A: Those are rule-based extractors. Docling uses AI models for layout understanding, handling complex documents much better.