ScriptsMay 19, 2026·3 min read

CocoIndex — Incremental Data Indexing Engine for AI Agents

CocoIndex is an open-source framework for building incremental data indexing pipelines. It keeps embeddings and knowledge graphs in sync with source data using change-data-capture, enabling always-fresh context for AI agents and RAG applications.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
CocoIndex Overview
Universal CLI install command
npx tokrepo install 8424324d-5318-11f1-9bc6-00163e2b0d79

Introduction

CocoIndex is a data indexing framework designed for AI applications that need continuously fresh context. Instead of re-processing entire datasets on every update, CocoIndex tracks source changes and incrementally updates downstream indexes such as vector stores or knowledge graphs.

What CocoIndex Does

  • Tracks changes in source data and incrementally updates derived indexes
  • Builds and maintains vector embeddings, knowledge graphs, and search indexes
  • Connects to databases, file systems, and APIs as data sources
  • Orchestrates multi-step transformation pipelines with built-in chunking and embedding
  • Exposes a server mode for continuous background synchronization

Architecture Overview

CocoIndex models data flows as directed acyclic graphs of transformation steps. Each step declares its inputs and outputs. A change-data-capture layer monitors sources for inserts, updates, and deletes, then propagates only the affected records through the pipeline. State is checkpointed in PostgreSQL so restarts resume without reprocessing.

Self-Hosting & Configuration

  • Install via pip and define indexing flows in Python scripts
  • Configure a PostgreSQL instance for internal state management
  • Point source connectors at your data (local files, databases, or cloud storage)
  • Set target connectors for vector stores like Qdrant, Weaviate, or pgvector
  • Run cocoindex server for continuous incremental updates or trigger one-shot builds

Key Features

  • True incremental processing avoids redundant embedding and transformation costs
  • Declarative Python API for defining multi-step data flows
  • Supports custom transformation functions for domain-specific logic
  • Built-in connectors for popular vector databases and embedding providers
  • Lightweight Rust core for efficient data processing with Python bindings

Comparison with Similar Tools

  • LlamaIndex — Focuses on query-time retrieval; CocoIndex focuses on keeping indexes incrementally fresh
  • LangChain — General LLM orchestration framework without built-in incremental indexing
  • Airbyte — General ELT platform for data warehouses, not optimized for embedding pipelines
  • Dagster — Workflow orchestrator that can schedule jobs but lacks native CDC-based incremental updates
  • Unstructured — Document parsing library without pipeline orchestration or incremental tracking

FAQ

Q: Does CocoIndex replace my vector database? A: No. CocoIndex sits upstream and keeps your vector database populated with fresh embeddings. It supports multiple vector DB targets.

Q: What data sources does CocoIndex support? A: It supports local files, PostgreSQL, and custom source connectors. The connector list is growing with community contributions.

Q: Can I use CocoIndex without a GPU? A: Yes. CocoIndex calls external embedding APIs by default. You can also run local models if a GPU is available.

Q: How does CocoIndex handle schema changes? A: Changing a flow definition triggers a rebuild of affected downstream steps while preserving unaffected data.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets