# Docling — Document Parsing for AI Pipelines > Parse PDF, DOCX, PPTX, HTML, and images into clean Markdown or JSON for LLM ingestion. Handles tables, figures, equations, and complex layouts. By IBM Research. 18,000+ stars. ## Install Save as a script file and run: ## Quick Use ```bash pip install docling ``` ```python from docling.document_converter import DocumentConverter converter = DocumentConverter() result = converter.convert("report.pdf") print(result.document.export_to_markdown()) ``` CLI usage: ```bash docling convert report.pdf --output-format markdown ``` --- ## Intro Docling is a document parsing library by IBM Research that converts PDF, DOCX, PPTX, HTML, and images into clean Markdown or JSON optimized for LLM ingestion with 18,000+ GitHub stars. It handles complex layouts including tables, figures, mathematical equations, code blocks, and multi-column text — elements that naive PDF parsers destroy. Best for teams building RAG pipelines that need high-fidelity document parsing. Works with: any LLM pipeline, Haystack, LangChain, LlamaIndex. Setup time: under 2 minutes. --- ## Supported Formats | Format | Tables | Figures | Equations | Code | |--------|--------|---------|-----------|------| | PDF | Yes | Yes | Yes | Yes | | DOCX | Yes | Yes | Yes | Yes | | PPTX | Yes | Yes | N/A | Yes | | HTML | Yes | Yes | Yes | Yes | | Images | OCR | Yes | OCR | OCR | | Markdown | Pass-through | Yes | Yes | Yes | ### Table Extraction Docling accurately extracts tables preserving structure: ```python result = converter.convert("financial_report.pdf") for table in result.document.tables: print(table.export_to_dataframe()) # pandas DataFrame ``` ### Figure Handling Figures are extracted with captions and alt-text: ```python for figure in result.document.pictures: print(f"Caption: {figure.caption}") figure.image.save(f"figure_{figure.id}.png") ``` ### OCR for Scanned Documents Built-in OCR for scanned PDFs and images: ```python converter = DocumentConverter(ocr=True) result = converter.convert("scanned_document.pdf") ``` ### Batch Processing ```python from docling.document_converter import DocumentConverter from pathlib import Path converter = DocumentConverter() docs = Path("./documents").glob("*.pdf") for doc in docs: result = converter.convert(str(doc)) markdown = result.document.export_to_markdown() Path(f"./output/{doc.stem}.md").write_text(markdown) ``` ### Integration with RAG ```python # With Haystack from docling.integrations.haystack import DoclingConverter converter = DoclingConverter() documents = converter.run(sources=["report.pdf"]) # With LangChain from docling.integrations.langchain import DoclingLoader loader = DoclingLoader(file_path="report.pdf") docs = loader.load() ``` ### Key Stats - 18,000+ GitHub stars - By IBM Research - 6 input formats supported - Table, figure, equation extraction - OCR for scanned documents - Haystack, LangChain integrations ### FAQ **Q: What is Docling?** A: Docling is a document parsing library by IBM Research that converts PDF, DOCX, PPTX, and other formats into clean Markdown or JSON for LLM consumption, handling tables, figures, and complex layouts. **Q: Is Docling free?** A: Yes, open-source under MIT license by IBM Research. **Q: How accurate is Docling compared to PyPDF?** A: Significantly more accurate for complex documents. Docling uses deep learning models to understand document layout, while PyPDF does simple text extraction. --- ## Source & Thanks > Created by [DS4SD / IBM Research](https://github.com/DS4SD). Licensed under MIT. > > [docling](https://github.com/DS4SD/docling) — ⭐ 18,000+ Thanks to IBM Research for solving document parsing for the AI era. --- ## 快速使用 ```bash pip install docling ``` ```python from docling.document_converter import DocumentConverter result = DocumentConverter().convert("report.pdf") print(result.document.export_to_markdown()) ``` --- ## 简介 Docling 是 IBM Research 开发的文档解析库,GitHub 18,000+ stars。将 PDF、DOCX、PPTX 等格式转换为干净的 Markdown 或 JSON,精确处理表格、图表、数学公式和复杂版面。适合构建 RAG 管道需要高保真文档解析的团队。 --- ## 来源与感谢 > Created by [DS4SD / IBM Research](https://github.com/DS4SD). Licensed under MIT. > > [docling](https://github.com/DS4SD/docling) — ⭐ 18,000+ --- Source: https://tokrepo.com/en/workflows/a8327829-385d-47cf-9b22-fa9a5d2aafde Author: Script Depot