Supported Formats
- PDF — Text, tables, images, scanned documents (OCR)
- DOCX — Microsoft Word documents
- PPTX — PowerPoint presentations
- HTML — Web pages
- Images — PNG, JPG with OCR
- Markdown — Passthrough with metadata
Key Features
- Table extraction — Accurate table parsing to structured data
- Layout analysis — Understands headers, paragraphs, lists, captions
- OCR — Built-in for scanned documents
- Chunking — Hierarchical chunking that respects document structure
- LangChain integration —
DoclingLoaderfor direct pipeline use