AI Memory
LlamaIndex Memory — Built-in Memory for RAG Pipelines logo

LlamaIndex Memory — 内置于 RAG 流水线的记忆能力

LlamaIndex 内置多种记忆模块(ChatMemoryBuffer、VectorMemory、CompositeMemory),只需一个构造参数即可给 RAG 流水线加上记忆能力。

为什么选它

If you’re already building a RAG application with LlamaIndex, you don’t need a separate memory library. LlamaIndex ships three production-ready memory modules that compose cleanly with its ChatEngine, AgentWorker, and query pipelines.

ChatMemoryBuffer is the simplest: a ring buffer of recent messages with token-aware trimming. VectorMemory embeds messages and retrieves them by similarity — useful when conversations go long and chronological recency isn’t enough. CompositeMemory combines both, plus an optional ChatSummaryMemoryBuffer that runs LLM-based rolling summarization.

Ceiling: these modules focus on conversation memory within a single session. For user-level persistent facts across sessions, either pair LlamaIndex with mem0/Zep, or use LlamaIndex agents with a custom long-term memory tool.

Quick Start — ChatEngine + CompositeMemory

CompositeMemory combines a recency buffer (primary) with semantic recall + summarization (secondary). The ChatEngine calls .get() on the composite before every turn to assemble context. Swap the VectorMemory backend for any LlamaIndex vector store (Chroma, Qdrant, PGVector, Pinecone).

# pip install llama-index llama-index-vector-stores-chroma
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.memory import (
    ChatMemoryBuffer, VectorMemory, ChatSummaryMemoryBuffer, SimpleComposableMemory,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)

primary = ChatMemoryBuffer.from_defaults(token_limit=3000)
secondary = [
    VectorMemory.from_defaults(embed_model=Settings.embed_model, retriever_kwargs={"similarity_top_k": 3}),
    ChatSummaryMemoryBuffer.from_defaults(token_limit=2000),
]
memory = SimpleComposableMemory.from_defaults(primary_memory=primary, secondary_memory_sources=secondary)

chat_engine = index.as_chat_engine(chat_mode="condense_plus_context", memory=memory)

print(chat_engine.chat("I'm William, building a Nuxt app. Remember that."))
print(chat_engine.chat("What framework am I using?"))

核心能力

ChatMemoryBuffer (recency)

Token-aware ring buffer of recent messages. Drop-in for any ChatEngine. Configurable token_limit, automatic trimming.

VectorMemory (semantic recall)

Embeds every message and retrieves semantically similar history. Useful when conversations span many topics and recent ≠ relevant.

ChatSummaryMemoryBuffer (summarization)

Runs LLM summarization when the buffer overflows. Summary persists; raw messages are dropped. Pair with ChatMemoryBuffer for hybrid recent + summary.

SimpleComposableMemory

Combines a primary memory with any number of secondary sources. Retrievals from all are merged into a single context block passed to the LLM.

Agent-friendly

Memory modules plug into LlamaIndex’s AgentWorker / FunctionAgent as first-class arguments. Agents keep memory state across tool calls and iterations.

Persistent state optional

Memory is per-ChatEngine instance by default. Serialize/deserialize via memory.to_dict() / from_dict() or pair with a persistent store for cross-process state.

对比

 ScopeBest FitPersistent Cross-sessionDedicated Library?
LlamaIndex MemorythisSession + RAGRAG pipelines with chatManual (serialize + reload)No — built into LlamaIndex
mem0User-level factsProduction chatbotsYes (default)Yes, standalone
ZepSessions + user factsProduction chat with UIYesYes, standalone service
LangMemThread-scoped + namespaceLangChain agentsYes (namespace by user)Yes, LangChain-native

实际用例

01. RAG-first chat apps

Apps where the primary workload is retrieval-over-documents and chat memory is secondary. LlamaIndex gives you both in one framework, no integration glue.

02. Research assistants

Long conversations over academic papers or codebases. VectorMemory + summarization keeps the agent grounded in both recent turns and earlier context.

03. Agentic workflows

LlamaIndex agents (FunctionAgent, ReActAgent) that maintain state across tool-calling loops — memory modules are how the agent "remembers" what it already tried.

价格与许可

LlamaIndex: MIT open source. Memory modules are part of llama-index-core — no extra install, no license cost. You pay only for LLM + embedding API calls and your chosen vector store.

Production storage: use llama-index-vector-stores-* integrations (Qdrant, PGVector, Chroma, Pinecone, Weaviate, Milvus). Memory state is stored alongside your RAG data.

Cost profile: similar to other SDK-only options. ChatSummaryMemoryBuffer adds summarization LLM calls proportional to conversation length.

相关 TokRepo 资产

常见问题

LlamaIndex Memory vs mem0 — which for my project?+

LlamaIndex Memory when you want memory inside a RAG pipeline with no extra dependencies. mem0 when you need cross-session user-level facts, managed cloud, or framework independence. They can coexist — LlamaIndex for session memory, mem0 for user profile.

Does LlamaIndex Memory work with LangChain?+

The modules are LlamaIndex-specific (they hook into LlamaIndex’s ChatEngine/Agent abstractions). For LangChain, use LangMem. You can share a vector DB between the two frameworks though.

Can I use LlamaIndex Memory without RAG?+

Yes. A ChatMemoryBuffer or CompositeMemory can wrap any LlamaIndex ChatEngine, including ones without a backing index. But if you don’t need RAG at all, the bare LLM SDK + mem0 is a lighter stack.

How does LlamaIndex handle long conversations?+

Either (a) raise ChatMemoryBuffer’s token_limit and pay the context window cost, (b) add ChatSummaryMemoryBuffer to compress history, or (c) use CompositeMemory to combine recency + summary + semantic recall. (c) is the production-grade answer.

Is LlamaIndex Memory production-ready?+

Yes — it ships in llama-index-core and is used in production by many LlamaIndex customers. API is stable; new modules are added incrementally without breaking existing ones.

同类推荐