# Docling — AI Document Parsing by IBM > Parse PDFs, DOCX, PPTX, and images into structured markdown or JSON. IBM's open-source document AI with OCR, table extraction, and figure understanding. ## Install Save as a script file and run: ## Quick Use ```bash pip install docling ``` ```python from docling.document_converter import DocumentConverter converter = DocumentConverter() result = converter.convert("report.pdf") print(result.document.export_to_markdown()) ``` Convert any PDF to clean markdown in 3 lines. ## What is Docling? Docling is IBM's open-source document parsing library that converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. It uses AI models for layout analysis, table extraction, OCR, and figure understanding — making documents ready for LLM consumption and RAG pipelines. **Answer-Ready**: Docling is IBM's open-source document AI that parses PDFs, DOCX, PPTX, and images into structured markdown or JSON. Features AI-powered layout analysis, table extraction, OCR, and figure understanding. 15k+ GitHub stars. **Best for**: AI teams building RAG pipelines that need to ingest documents. **Works with**: LangChain, LlamaIndex, any LLM framework. **Setup time**: Under 2 minutes. ## Core Features ### 1. Multi-Format Support | Format | Features | |--------|----------| | PDF | Layout analysis, OCR, tables, figures | | DOCX | Structure preservation, images | | PPTX | Slide text and layout | | Images | OCR with layout detection | | HTML | Clean text extraction | | AsciiDoc | Structure parsing | ### 2. Table Extraction Accurately extracts tables preserving structure: ```python result = converter.convert("financial_report.pdf") for table in result.document.tables: df = table.export_to_dataframe() print(df) ``` ### 3. OCR Integration Built-in OCR for scanned documents: ```python from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline pipeline_options = StandardPdfPipeline.PipelineOptions( do_ocr=True, ocr_options={"lang": ["en", "zh"]}, ) converter = DocumentConverter(pipeline_options=pipeline_options) ``` ### 4. Chunking for RAG ```python from docling.chunking import HybridChunker chunker = HybridChunker(tokenizer="cl100k_base", max_tokens=512) chunks = list(chunker.chunk(result.document)) # Ready for embedding and vector store ``` ### 5. Framework Integration ```python # LangChain from langchain_community.document_loaders import DoclingLoader loader = DoclingLoader(file_path="report.pdf") docs = loader.load() # LlamaIndex from llama_index.readers.docling import DoclingReader reader = DoclingReader() docs = reader.load_data(file_path="report.pdf") ``` ### 6. Batch Processing ```python results = converter.convert_all(["doc1.pdf", "doc2.docx", "doc3.pptx"]) for result in results: md = result.document.export_to_markdown() ``` ## FAQ **Q: How accurate is the table extraction?** A: Docling uses a dedicated TableFormer AI model. Accuracy is 90%+ on standard tables, handling merged cells and complex layouts. **Q: Does it work offline?** A: Yes, all models run locally. No API calls needed. **Q: How does it compare to PyMuPDF or pdfplumber?** A: Those are rule-based extractors. Docling uses AI models for layout understanding, handling complex documents much better. ## Source & Thanks > Created by [IBM Research](https://github.com/docling-project). Licensed under MIT. > > [docling-project/docling](https://github.com/docling-project/docling) — 15k+ stars ## 快速使用 ```bash pip install docling ``` 三行代码将 PDF 转为结构化 Markdown。 ## 什么是 Docling? Docling 是 IBM 开源的文档解析库,将 PDF、DOCX、PPTX 和图片转为结构化 Markdown 或 JSON,内置 AI 布局分析、表格提取和 OCR。 **一句话总结**:IBM 开源文档 AI,解析 PDF/DOCX/PPTX 为结构化 Markdown,内置表格提取和 OCR,15k+ GitHub stars。 **适合人群**:构建 RAG 管线需要文档摄入的 AI 团队。 ## 核心功能 ### 1. 多格式支持 PDF、DOCX、PPTX、图片、HTML 全覆盖。 ### 2. 表格提取 AI 模型精确提取表格,支持合并单元格。 ### 3. OCR 内置多语言 OCR,支持扫描文档。 ### 4. RAG 分块 内置分块器,直接输出适合嵌入的文本块。 ### 5. 框架集成 LangChain、LlamaIndex 一行接入。 ## 常见问题 **Q: 离线可用?** A: 可以,所有模型本地运行。 **Q: 和 PyMuPDF 比较?** A: PyMuPDF 基于规则,Docling 用 AI 模型理解布局,处理复杂文档更好。 ## 来源与致谢 > [docling-project/docling](https://github.com/docling-project/docling) — 15k+ stars, MIT --- Source: https://tokrepo.com/en/workflows/8dc956cc-c20b-4d92-82dd-24a9280a6315 Author: Script Depot