CLI Tools2026年3月29日·1 分钟阅读

Docling — Document Parsing for AI

IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.

介绍

Docling is IBM's open-source document parsing library, designed for AI pipelines. It accurately converts PDFs (including scanned), Word docs, PowerPoints, images, and HTML into clean structured output — markdown, JSON, or document objects.

Best for: RAG pipeline document ingestion, PDF parsing, enterprise document processing Works with: LangChain, LlamaIndex, any LLM pipeline


Supported Formats

  • PDF — Text, tables, images, scanned documents (OCR)
  • DOCX — Microsoft Word documents
  • PPTX — PowerPoint presentations
  • HTML — Web pages
  • Images — PNG, JPG with OCR
  • Markdown — Passthrough with metadata

Key Features

  • Table extraction — Accurate table parsing to structured data
  • Layout analysis — Understands headers, paragraphs, lists, captions
  • OCR — Built-in for scanned documents
  • Chunking — Hierarchical chunking that respects document structure
  • LangChain integrationDoclingLoader for direct pipeline use

FAQ

Q: What is Docling? A: IBM document parsing library. Converts PDFs, DOCX, PPTX, images, and HTML into structured markdown or JSON. Built for RAG pipelines and LLM ingestion.

Q: How do I install Docling? A: Check the Quick Use section above for step-by-step installation instructions. Most assets can be set up in under 2 minutes.

🙏

来源与感谢

Created by IBM. Licensed under MIT. DS4SD/docling — 15K+ GitHub stars

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产