# Docling — AI Document Parsing by IBM

> Parse PDFs, DOCX, PPTX, and images into structured markdown or JSON. IBM's open-source document AI with OCR, table extraction, and figure understanding.

## Install

Save as a script file and run:

## Quick Use

```bash
pip install docling
```

```python
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("report.pdf")
print(result.document.export_to_markdown())
```

Convert any PDF to clean markdown in 3 lines.

## What is Docling?

Docling is IBM's open-source document parsing library that converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. It uses AI models for layout analysis, table extraction, OCR, and figure understanding — making documents ready for LLM consumption and RAG pipelines.

**Answer-Ready**: Docling is IBM's open-source document AI that parses PDFs, DOCX, PPTX, and images into structured markdown or JSON. Features AI-powered layout analysis, table extraction, OCR, and figure understanding. 15k+ GitHub stars.

**Best for**: AI teams building RAG pipelines that need to ingest documents. **Works with**: LangChain, LlamaIndex, any LLM framework. **Setup time**: Under 2 minutes.

## Core Features

### 1. Multi-Format Support

| Format | Features |
|--------|----------|
| PDF | Layout analysis, OCR, tables, figures |
| DOCX | Structure preservation, images |
| PPTX | Slide text and layout |
| Images | OCR with layout detection |
| HTML | Clean text extraction |
| AsciiDoc | Structure parsing |

### 2. Table Extraction
Accurately extracts tables preserving structure:

```python
result = converter.convert("financial_report.pdf")
for table in result.document.tables:
    df = table.export_to_dataframe()
    print(df)
```

### 3. OCR Integration
Built-in OCR for scanned documents:

```python
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline

pipeline_options = StandardPdfPipeline.PipelineOptions(
    do_ocr=True,
    ocr_options={"lang": ["en", "zh"]},
)
converter = DocumentConverter(pipeline_options=pipeline_options)
```

### 4. Chunking for RAG

```python
from docling.chunking import HybridChunker

chunker = HybridChunker(tokenizer="cl100k_base", max_tokens=512)
chunks = list(chunker.chunk(result.document))
# Ready for embedding and vector store
```

### 5. Framework Integration

```python
# LangChain
from langchain_community.document_loaders import DoclingLoader
loader = DoclingLoader(file_path="report.pdf")
docs = loader.load()

# LlamaIndex
from llama_index.readers.docling import DoclingReader
reader = DoclingReader()
docs = reader.load_data(file_path="report.pdf")
```

### 6. Batch Processing

```python
results = converter.convert_all(["doc1.pdf", "doc2.docx", "doc3.pptx"])
for result in results:
    md = result.document.export_to_markdown()
```

## FAQ

**Q: How accurate is the table extraction?**
A: Docling uses a dedicated TableFormer AI model. Accuracy is 90%+ on standard tables, handling merged cells and complex layouts.

**Q: Does it work offline?**
A: Yes, all models run locally. No API calls needed.

**Q: How does it compare to PyMuPDF or pdfplumber?**
A: Those are rule-based extractors. Docling uses AI models for layout understanding, handling complex documents much better.

## Source & Thanks

> Created by [IBM Research](https://github.com/docling-project). Licensed under MIT.
>
> [docling-project/docling](https://github.com/docling-project/docling) — 15k+ stars

<!-- ZH -->

## 快速使用

```bash
pip install docling
```

三行代码将 PDF 转为结构化 Markdown。

## 什么是 Docling？

Docling 是 IBM 开源的文档解析库，将 PDF、DOCX、PPTX 和图片转为结构化 Markdown 或 JSON，内置 AI 布局分析、表格提取和 OCR。

**一句话总结**：IBM 开源文档 AI，解析 PDF/DOCX/PPTX 为结构化 Markdown，内置表格提取和 OCR，15k+ GitHub stars。

**适合人群**：构建 RAG 管线需要文档摄入的 AI 团队。

## 核心功能

### 1. 多格式支持
PDF、DOCX、PPTX、图片、HTML 全覆盖。

### 2. 表格提取
AI 模型精确提取表格，支持合并单元格。

### 3. OCR
内置多语言 OCR，支持扫描文档。

### 4. RAG 分块
内置分块器，直接输出适合嵌入的文本块。

### 5. 框架集成
LangChain、LlamaIndex 一行接入。

## 常见问题

**Q: 离线可用？**
A: 可以，所有模型本地运行。

**Q: 和 PyMuPDF 比较？**
A: PyMuPDF 基于规则，Docling 用 AI 模型理解布局，处理复杂文档更好。

## 来源与致谢

> [docling-project/docling](https://github.com/docling-project/docling) — 15k+ stars, MIT

---
Source: https://tokrepo.com/en/workflows/8dc956cc-c20b-4d92-82dd-24a9280a6315
Author: Script Depot