# CocoIndex — Incremental Data Indexing Engine for AI Agents > CocoIndex is an open-source framework for building incremental data indexing pipelines. It keeps embeddings and knowledge graphs in sync with source data using change-data-capture, enabling always-fresh context for AI agents and RAG applications. ## Install Save as a script file and run: # CocoIndex — Incremental Data Indexing Engine for AI Agents ## Quick Use ```bash pip install cocoindex # Define a flow in Python and run cocoindex server start ``` ## Introduction CocoIndex is a data indexing framework designed for AI applications that need continuously fresh context. Instead of re-processing entire datasets on every update, CocoIndex tracks source changes and incrementally updates downstream indexes such as vector stores or knowledge graphs. ## What CocoIndex Does - Tracks changes in source data and incrementally updates derived indexes - Builds and maintains vector embeddings, knowledge graphs, and search indexes - Connects to databases, file systems, and APIs as data sources - Orchestrates multi-step transformation pipelines with built-in chunking and embedding - Exposes a server mode for continuous background synchronization ## Architecture Overview CocoIndex models data flows as directed acyclic graphs of transformation steps. Each step declares its inputs and outputs. A change-data-capture layer monitors sources for inserts, updates, and deletes, then propagates only the affected records through the pipeline. State is checkpointed in PostgreSQL so restarts resume without reprocessing. ## Self-Hosting & Configuration - Install via pip and define indexing flows in Python scripts - Configure a PostgreSQL instance for internal state management - Point source connectors at your data (local files, databases, or cloud storage) - Set target connectors for vector stores like Qdrant, Weaviate, or pgvector - Run cocoindex server for continuous incremental updates or trigger one-shot builds ## Key Features - True incremental processing avoids redundant embedding and transformation costs - Declarative Python API for defining multi-step data flows - Supports custom transformation functions for domain-specific logic - Built-in connectors for popular vector databases and embedding providers - Lightweight Rust core for efficient data processing with Python bindings ## Comparison with Similar Tools - **LlamaIndex** — Focuses on query-time retrieval; CocoIndex focuses on keeping indexes incrementally fresh - **LangChain** — General LLM orchestration framework without built-in incremental indexing - **Airbyte** — General ELT platform for data warehouses, not optimized for embedding pipelines - **Dagster** — Workflow orchestrator that can schedule jobs but lacks native CDC-based incremental updates - **Unstructured** — Document parsing library without pipeline orchestration or incremental tracking ## FAQ **Q: Does CocoIndex replace my vector database?** A: No. CocoIndex sits upstream and keeps your vector database populated with fresh embeddings. It supports multiple vector DB targets. **Q: What data sources does CocoIndex support?** A: It supports local files, PostgreSQL, and custom source connectors. The connector list is growing with community contributions. **Q: Can I use CocoIndex without a GPU?** A: Yes. CocoIndex calls external embedding APIs by default. You can also run local models if a GPU is available. **Q: How does CocoIndex handle schema changes?** A: Changing a flow definition triggers a rebuild of affected downstream steps while preserving unaffected data. ## Sources - https://github.com/cocoindex-io/cocoindex - https://cocoindex.io --- Source: https://tokrepo.com/en/workflows/asset-8424324d Author: Script Depot