Scripts2026年3月29日·1 分钟阅读

Docling — Document Parsing for AI

IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.

TO
TokRepo精选 · Community
快速使用

先拿来用,再决定要不要深挖

这里应该同时让用户和 Agent 知道第一步该复制什么、安装什么、落到哪里。

pip install docling
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("report.pdf")
print(result.document.export_to_markdown())

介绍

Docling is IBM's open-source document parsing library, designed for AI pipelines. It accurately converts PDFs (including scanned), Word docs, PowerPoints, images, and HTML into clean structured output — markdown, JSON, or document objects.

Best for: RAG pipeline document ingestion, PDF parsing, enterprise document processing Works with: LangChain, LlamaIndex, any LLM pipeline


Supported Formats

  • PDF — Text, tables, images, scanned documents (OCR)
  • DOCX — Microsoft Word documents
  • PPTX — PowerPoint presentations
  • HTML — Web pages
  • Images — PNG, JPG with OCR
  • Markdown — Passthrough with metadata

Key Features

  • Table extraction — Accurate table parsing to structured data
  • Layout analysis — Understands headers, paragraphs, lists, captions
  • OCR — Built-in for scanned documents
  • Chunking — Hierarchical chunking that respects document structure
  • LangChain integrationDoclingLoader for direct pipeline use

🙏

来源与感谢

Created by IBM. Licensed under MIT. DS4SD/docling — 15K+ GitHub stars

相关资产