为什么选它
If you’re already building a RAG application with LlamaIndex, you don’t need a separate memory library. LlamaIndex ships three production-ready memory modules that compose cleanly with its ChatEngine, AgentWorker, and query pipelines.
ChatMemoryBuffer is the simplest: a ring buffer of recent messages with token-aware trimming. VectorMemory embeds messages and retrieves them by similarity — useful when conversations go long and chronological recency isn’t enough. CompositeMemory combines both, plus an optional ChatSummaryMemoryBuffer that runs LLM-based rolling summarization.
Ceiling: these modules focus on conversation memory within a single session. For user-level persistent facts across sessions, either pair LlamaIndex with mem0/Zep, or use LlamaIndex agents with a custom long-term memory tool.
Quick Start — ChatEngine + CompositeMemory
CompositeMemory combines a recency buffer (primary) with semantic recall + summarization (secondary). The ChatEngine calls .get() on the composite before every turn to assemble context. Swap the VectorMemory backend for any LlamaIndex vector store (Chroma, Qdrant, PGVector, Pinecone).
# pip install llama-index llama-index-vector-stores-chroma
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.memory import (
ChatMemoryBuffer, VectorMemory, ChatSummaryMemoryBuffer, SimpleComposableMemory,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)
primary = ChatMemoryBuffer.from_defaults(token_limit=3000)
secondary = [
VectorMemory.from_defaults(embed_model=Settings.embed_model, retriever_kwargs={"similarity_top_k": 3}),
ChatSummaryMemoryBuffer.from_defaults(token_limit=2000),
]
memory = SimpleComposableMemory.from_defaults(primary_memory=primary, secondary_memory_sources=secondary)
chat_engine = index.as_chat_engine(chat_mode="condense_plus_context", memory=memory)
print(chat_engine.chat("I'm William, building a Nuxt app. Remember that."))
print(chat_engine.chat("What framework am I using?"))核心能力
ChatMemoryBuffer (recency)
Token-aware ring buffer of recent messages. Drop-in for any ChatEngine. Configurable token_limit, automatic trimming.
VectorMemory (semantic recall)
Embeds every message and retrieves semantically similar history. Useful when conversations span many topics and recent ≠ relevant.
ChatSummaryMemoryBuffer (summarization)
Runs LLM summarization when the buffer overflows. Summary persists; raw messages are dropped. Pair with ChatMemoryBuffer for hybrid recent + summary.
SimpleComposableMemory
Combines a primary memory with any number of secondary sources. Retrievals from all are merged into a single context block passed to the LLM.
Agent-friendly
Memory modules plug into LlamaIndex’s AgentWorker / FunctionAgent as first-class arguments. Agents keep memory state across tool calls and iterations.
Persistent state optional
Memory is per-ChatEngine instance by default. Serialize/deserialize via memory.to_dict() / from_dict() or pair with a persistent store for cross-process state.
对比
| Scope | Best Fit | Persistent Cross-session | Dedicated Library? | |
|---|---|---|---|---|
| LlamaIndex Memorythis | Session + RAG | RAG pipelines with chat | Manual (serialize + reload) | No — built into LlamaIndex |
| mem0 | User-level facts | Production chatbots | Yes (default) | Yes, standalone |
| Zep | Sessions + user facts | Production chat with UI | Yes | Yes, standalone service |
| LangMem | Thread-scoped + namespace | LangChain agents | Yes (namespace by user) | Yes, LangChain-native |
实际用例
01. RAG-first chat apps
Apps where the primary workload is retrieval-over-documents and chat memory is secondary. LlamaIndex gives you both in one framework, no integration glue.
02. Research assistants
Long conversations over academic papers or codebases. VectorMemory + summarization keeps the agent grounded in both recent turns and earlier context.
03. Agentic workflows
LlamaIndex agents (FunctionAgent, ReActAgent) that maintain state across tool-calling loops — memory modules are how the agent "remembers" what it already tried.
价格与许可
LlamaIndex: MIT open source. Memory modules are part of llama-index-core — no extra install, no license cost. You pay only for LLM + embedding API calls and your chosen vector store.
Production storage: use llama-index-vector-stores-* integrations (Qdrant, PGVector, Chroma, Pinecone, Weaviate, Milvus). Memory state is stored alongside your RAG data.
Cost profile: similar to other SDK-only options. ChatSummaryMemoryBuffer adds summarization LLM calls proportional to conversation length.
相关 TokRepo 资产
Pal MCP Server — Multi-Model AI Gateway for Claude Code
MCP server that lets Claude Code use Gemini, OpenAI, Grok, and Ollama as a unified AI dev team. Features model routing, CLI-to-CLI bridge, and conversation continuity across 7+ providers.
Ollama Model Library — Best AI Models for Local Use
Curated guide to the best models available on Ollama for coding, chat, and reasoning. Compare Llama, Mistral, Gemma, Phi, and Qwen models for local AI development.
Replicate — Run AI Models via Simple API Calls
Cloud platform to run open-source AI models with a simple API. Replicate hosts Llama, Stable Diffusion, Whisper, and thousands of models — no GPU setup or Docker required.
Open WebUI — Self-Hosted AI Chat Platform
Feature-rich, offline-capable AI interface for Ollama, OpenAI, and local LLMs. Built-in RAG, voice, model builder. 130K+ stars.
常见问题
LlamaIndex Memory vs mem0 — which for my project?+
LlamaIndex Memory when you want memory inside a RAG pipeline with no extra dependencies. mem0 when you need cross-session user-level facts, managed cloud, or framework independence. They can coexist — LlamaIndex for session memory, mem0 for user profile.
Does LlamaIndex Memory work with LangChain?+
The modules are LlamaIndex-specific (they hook into LlamaIndex’s ChatEngine/Agent abstractions). For LangChain, use LangMem. You can share a vector DB between the two frameworks though.
Can I use LlamaIndex Memory without RAG?+
Yes. A ChatMemoryBuffer or CompositeMemory can wrap any LlamaIndex ChatEngine, including ones without a backing index. But if you don’t need RAG at all, the bare LLM SDK + mem0 is a lighter stack.
How does LlamaIndex handle long conversations?+
Either (a) raise ChatMemoryBuffer’s token_limit and pay the context window cost, (b) add ChatSummaryMemoryBuffer to compress history, or (c) use CompositeMemory to combine recency + summary + semantic recall. (c) is the production-grade answer.
Is LlamaIndex Memory production-ready?+
Yes — it ships in llama-index-core and is used in production by many LlamaIndex customers. API is stable; new modules are added incrementally without breaking existing ones.