Introduction
CocoIndex is a data indexing framework designed for AI applications that need continuously fresh context. Instead of re-processing entire datasets on every update, CocoIndex tracks source changes and incrementally updates downstream indexes such as vector stores or knowledge graphs.
What CocoIndex Does
- Tracks changes in source data and incrementally updates derived indexes
- Builds and maintains vector embeddings, knowledge graphs, and search indexes
- Connects to databases, file systems, and APIs as data sources
- Orchestrates multi-step transformation pipelines with built-in chunking and embedding
- Exposes a server mode for continuous background synchronization
Architecture Overview
CocoIndex models data flows as directed acyclic graphs of transformation steps. Each step declares its inputs and outputs. A change-data-capture layer monitors sources for inserts, updates, and deletes, then propagates only the affected records through the pipeline. State is checkpointed in PostgreSQL so restarts resume without reprocessing.
Self-Hosting & Configuration
- Install via pip and define indexing flows in Python scripts
- Configure a PostgreSQL instance for internal state management
- Point source connectors at your data (local files, databases, or cloud storage)
- Set target connectors for vector stores like Qdrant, Weaviate, or pgvector
- Run cocoindex server for continuous incremental updates or trigger one-shot builds
Key Features
- True incremental processing avoids redundant embedding and transformation costs
- Declarative Python API for defining multi-step data flows
- Supports custom transformation functions for domain-specific logic
- Built-in connectors for popular vector databases and embedding providers
- Lightweight Rust core for efficient data processing with Python bindings
Comparison with Similar Tools
- LlamaIndex — Focuses on query-time retrieval; CocoIndex focuses on keeping indexes incrementally fresh
- LangChain — General LLM orchestration framework without built-in incremental indexing
- Airbyte — General ELT platform for data warehouses, not optimized for embedding pipelines
- Dagster — Workflow orchestrator that can schedule jobs but lacks native CDC-based incremental updates
- Unstructured — Document parsing library without pipeline orchestration or incremental tracking
FAQ
Q: Does CocoIndex replace my vector database? A: No. CocoIndex sits upstream and keeps your vector database populated with fresh embeddings. It supports multiple vector DB targets.
Q: What data sources does CocoIndex support? A: It supports local files, PostgreSQL, and custom source connectors. The connector list is growing with community contributions.
Q: Can I use CocoIndex without a GPU? A: Yes. CocoIndex calls external embedding APIs by default. You can also run local models if a GPU is available.
Q: How does CocoIndex handle schema changes? A: Changing a flow definition triggers a rebuild of affected downstream steps while preserving unaffected data.