Kotaemon — Open-Source RAG Document Chat
Clean, open-source RAG tool for chatting with your documents. Supports PDF, DOCX, web pages. Multi-model, citation, and multi-user. Self-hostable. 25K+ stars.
What it is
Kotaemon is an open-source RAG (retrieval-augmented generation) tool that lets you chat with your documents. Upload PDFs, DOCX files, or web pages, and ask questions in natural language. Kotaemon retrieves relevant passages and generates answers with citations pointing back to the source documents. It supports multiple LLM providers and can be self-hosted.
It targets researchers, analysts, and knowledge workers who need to extract information from large document collections without reading everything manually.
How it saves time or tokens
Kotaemon handles the full RAG pipeline internally: document parsing, chunking, embedding, vector storage, retrieval, and answer generation with citations. Instead of building this stack from individual components, you run a single application. The citation feature is particularly valuable -- you can verify every answer against the source document.
How to use
- Install and run:
pip install kotaemon
python -m kotaemon
Or with Docker:
docker run -p 7860:7860 ghcr.io/cinnamon/kotaemon:latest
- Open http://localhost:7860.
- Configure your LLM provider (OpenAI, Anthropic, Ollama) in Settings.
- Upload documents and start asking questions.
Example
User: What are the main risks identified in the annual report?
Kotaemon: The report identifies three main risks:
1. Currency fluctuation exposure in Asian markets [page 12]
2. Supply chain disruption from single-source dependencies [page 15]
3. Regulatory changes in data privacy requirements [page 23]
[Click citations to view source passages]
Each answer includes clickable citations that link to the exact source passages.
Related on TokRepo
- AI tools for RAG -- RAG tools and frameworks
- AI tools for documents -- document processing tools
Common pitfalls
- Document parsing quality varies by file type. PDFs with complex layouts (multi-column, tables, scanned images) may not parse correctly. Pre-process problematic PDFs with an OCR tool for better results.
- Embedding model choice affects retrieval quality. The default embedding model works for general text. For specialized domains (legal, medical), consider a domain-specific embedding model.
- Large document collections increase storage and retrieval latency. For hundreds of documents, ensure adequate disk space and consider using a more performant vector store backend.
Frequently Asked Questions
Kotaemon supports PDF, DOCX, TXT, Markdown, and web pages (via URL). PDFs are the primary use case and receive the most parsing attention. For other formats, documents are converted to text before processing. Complex formatting in DOCX files is simplified during ingestion.
Yes. Kotaemon supports Ollama and other local LLM providers. Both the chat model and the embedding model can run locally, ensuring no data leaves your machine. This is ideal for sensitive documents. Quality depends on the local model's capability.
When Kotaemon generates an answer, it includes references to the specific document passages it used. Each citation links to the source document and highlights the relevant passage. This lets you verify the answer's accuracy and read the original context. Citations are a core feature, not an add-on.
Yes. Kotaemon supports multi-user access with separate accounts and document collections. Each user can upload their own documents and maintain private conversations. An admin can manage users and configure global settings. For team deployments, use the Docker version with persistent storage.
Both are RAG applications for document chat. Kotaemon focuses on clean document understanding with strong citation support. AnythingLLM is broader, including agents and a plugin system. Kotaemon has a more polished document experience with better PDF handling. AnythingLLM offers more flexibility with its agent and workspace features.
Citations (3)
- Kotaemon GitHub— Kotaemon repository
- Kotaemon Docs— Kotaemon documentation
- RAG Paper (arXiv)— RAG retrieval-augmented generation concepts
Related on TokRepo
Source & Thanks
Created by Cinnamon. Licensed under Apache 2.0. Cinnamon/kotaemon — 25,000+ GitHub stars
Discussion
Related Assets
Moodle — Open-Source Learning Management System
The most widely used open-source learning platform, providing course management, assessments, and collaboration tools for educators and organizations worldwide.
Sylius — Headless E-Commerce Framework on Symfony
An open-source headless e-commerce platform built on Symfony and API Platform, designed for developers who need a customizable and API-first commerce solution.
Akaunting — Free Self-Hosted Accounting Software
A free, open-source online accounting application built on Laravel for small businesses and freelancers to manage invoices, expenses, and financial reports.