Supported Formats
| Format | Tables | Figures | Equations | Code |
|---|---|---|---|---|
| Yes | Yes | Yes | Yes | |
| DOCX | Yes | Yes | Yes | Yes |
| PPTX | Yes | Yes | N/A | Yes |
| HTML | Yes | Yes | Yes | Yes |
| Images | OCR | Yes | OCR | OCR |
| Markdown | Pass-through | Yes | Yes | Yes |
Table Extraction
Docling accurately extracts tables preserving structure:
result = converter.convert("financial_report.pdf")
for table in result.document.tables:
print(table.export_to_dataframe()) # pandas DataFrameFigure Handling
Figures are extracted with captions and alt-text:
for figure in result.document.pictures:
print(f"Caption: {figure.caption}")
figure.image.save(f"figure_{figure.id}.png")OCR for Scanned Documents
Built-in OCR for scanned PDFs and images:
converter = DocumentConverter(ocr=True)
result = converter.convert("scanned_document.pdf")Batch Processing
from docling.document_converter import DocumentConverter
from pathlib import Path
converter = DocumentConverter()
docs = Path("./documents").glob("*.pdf")
for doc in docs:
result = converter.convert(str(doc))
markdown = result.document.export_to_markdown()
Path(f"./output/{doc.stem}.md").write_text(markdown)Integration with RAG
# With Haystack
from docling.integrations.haystack import DoclingConverter
converter = DoclingConverter()
documents = converter.run(sources=["report.pdf"])
# With LangChain
from docling.integrations.langchain import DoclingLoader
loader = DoclingLoader(file_path="report.pdf")
docs = loader.load()Key Stats
- 18,000+ GitHub stars
- By IBM Research
- 6 input formats supported
- Table, figure, equation extraction
- OCR for scanned documents
- Haystack, LangChain integrations
FAQ
Q: What is Docling? A: Docling is a document parsing library by IBM Research that converts PDF, DOCX, PPTX, and other formats into clean Markdown or JSON for LLM consumption, handling tables, figures, and complex layouts.
Q: Is Docling free? A: Yes, open-source under MIT license by IBM Research.
Q: How accurate is Docling compared to PyPDF? A: Significantly more accurate for complex documents. Docling uses deep learning models to understand document layout, while PyPDF does simple text extraction.