# CocoIndex — Incremental Data Indexing Engine for AI Agents

> CocoIndex is an open-source framework for building incremental data indexing pipelines. It keeps embeddings and knowledge graphs in sync with source data using change-data-capture, enabling always-fresh context for AI agents and RAG applications.

## Install

Save as a script file and run:

# CocoIndex — Incremental Data Indexing Engine for AI Agents

## Quick Use
```bash
pip install cocoindex
# Define a flow in Python and run
cocoindex server start
```

## Introduction
CocoIndex is a data indexing framework designed for AI applications that need continuously fresh context. Instead of re-processing entire datasets on every update, CocoIndex tracks source changes and incrementally updates downstream indexes such as vector stores or knowledge graphs.

## What CocoIndex Does
- Tracks changes in source data and incrementally updates derived indexes
- Builds and maintains vector embeddings, knowledge graphs, and search indexes
- Connects to databases, file systems, and APIs as data sources
- Orchestrates multi-step transformation pipelines with built-in chunking and embedding
- Exposes a server mode for continuous background synchronization

## Architecture Overview
CocoIndex models data flows as directed acyclic graphs of transformation steps. Each step declares its inputs and outputs. A change-data-capture layer monitors sources for inserts, updates, and deletes, then propagates only the affected records through the pipeline. State is checkpointed in PostgreSQL so restarts resume without reprocessing.

## Self-Hosting & Configuration
- Install via pip and define indexing flows in Python scripts
- Configure a PostgreSQL instance for internal state management
- Point source connectors at your data (local files, databases, or cloud storage)
- Set target connectors for vector stores like Qdrant, Weaviate, or pgvector
- Run cocoindex server for continuous incremental updates or trigger one-shot builds

## Key Features
- True incremental processing avoids redundant embedding and transformation costs
- Declarative Python API for defining multi-step data flows
- Supports custom transformation functions for domain-specific logic
- Built-in connectors for popular vector databases and embedding providers
- Lightweight Rust core for efficient data processing with Python bindings

## Comparison with Similar Tools
- **LlamaIndex** — Focuses on query-time retrieval; CocoIndex focuses on keeping indexes incrementally fresh
- **LangChain** — General LLM orchestration framework without built-in incremental indexing
- **Airbyte** — General ELT platform for data warehouses, not optimized for embedding pipelines
- **Dagster** — Workflow orchestrator that can schedule jobs but lacks native CDC-based incremental updates
- **Unstructured** — Document parsing library without pipeline orchestration or incremental tracking

## FAQ
**Q: Does CocoIndex replace my vector database?**
A: No. CocoIndex sits upstream and keeps your vector database populated with fresh embeddings. It supports multiple vector DB targets.

**Q: What data sources does CocoIndex support?**
A: It supports local files, PostgreSQL, and custom source connectors. The connector list is growing with community contributions.

**Q: Can I use CocoIndex without a GPU?**
A: Yes. CocoIndex calls external embedding APIs by default. You can also run local models if a GPU is available.

**Q: How does CocoIndex handle schema changes?**
A: Changing a flow definition triggers a rebuild of affected downstream steps while preserving unaffected data.

## Sources
- https://github.com/cocoindex-io/cocoindex
- https://cocoindex.io

---
Source: https://tokrepo.com/en/workflows/asset-8424324d
Author: Script Depot