LlamaIndex Memory — Built-in Memory for RAG Pipelines
LlamaIndex ships first-class memory modules for chat engines and agents — ChatMemoryBuffer, VectorMemory, CompositeMemory — letting you add memory to a RAG pipeline with a single constructor arg.
Why LlamaIndex Memory
If you’re already building a RAG application with LlamaIndex, you don’t need a separate memory library. LlamaIndex ships three production-ready memory modules that compose cleanly with its ChatEngine, AgentWorker, and query pipelines.
ChatMemoryBuffer is the simplest: a ring buffer of recent messages with token-aware trimming. VectorMemory embeds messages and retrieves them by similarity — useful when conversations go long and chronological recency isn’t enough. CompositeMemory combines both, plus an optional ChatSummaryMemoryBuffer that runs LLM-based rolling summarization.
Ceiling: these modules focus on conversation memory within a single session. For user-level persistent facts across sessions, either pair LlamaIndex with mem0/Zep, or use LlamaIndex agents with a custom long-term memory tool.
Quick Start — ChatEngine + CompositeMemory
CompositeMemory combines a recency buffer (primary) with semantic recall + summarization (secondary). The ChatEngine calls .get() on the composite before every turn to assemble context. Swap the VectorMemory backend for any LlamaIndex vector store (Chroma, Qdrant, PGVector, Pinecone).
# pip install llama-index llama-index-vector-stores-chroma
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.memory import (
ChatMemoryBuffer, VectorMemory, ChatSummaryMemoryBuffer, SimpleComposableMemory,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)
primary = ChatMemoryBuffer.from_defaults(token_limit=3000)
secondary = [
VectorMemory.from_defaults(embed_model=Settings.embed_model, retriever_kwargs={"similarity_top_k": 3}),
ChatSummaryMemoryBuffer.from_defaults(token_limit=2000),
]
memory = SimpleComposableMemory.from_defaults(primary_memory=primary, secondary_memory_sources=secondary)
chat_engine = index.as_chat_engine(chat_mode="condense_plus_context", memory=memory)
print(chat_engine.chat("I'm William, building a Nuxt app. Remember that."))
print(chat_engine.chat("What framework am I using?"))Key Features
ChatMemoryBuffer (recency)
Token-aware ring buffer of recent messages. Drop-in for any ChatEngine. Configurable token_limit, automatic trimming.
VectorMemory (semantic recall)
Embeds every message and retrieves semantically similar history. Useful when conversations span many topics and recent ≠ relevant.
ChatSummaryMemoryBuffer (summarization)
Runs LLM summarization when the buffer overflows. Summary persists; raw messages are dropped. Pair with ChatMemoryBuffer for hybrid recent + summary.
SimpleComposableMemory
Combines a primary memory with any number of secondary sources. Retrievals from all are merged into a single context block passed to the LLM.
Agent-friendly
Memory modules plug into LlamaIndex’s AgentWorker / FunctionAgent as first-class arguments. Agents keep memory state across tool calls and iterations.
Persistent state optional
Memory is per-ChatEngine instance by default. Serialize/deserialize via memory.to_dict() / from_dict() or pair with a persistent store for cross-process state.
Comparison
| Scope | Best Fit | Persistent Cross-session | Dedicated Library? | |
|---|---|---|---|---|
| LlamaIndex Memorythis | Session + RAG | RAG pipelines with chat | Manual (serialize + reload) | No — built into LlamaIndex |
| mem0 | User-level facts | Production chatbots | Yes (default) | Yes, standalone |
| Zep | Sessions + user facts | Production chat with UI | Yes | Yes, standalone service |
| LangMem | Thread-scoped + namespace | LangChain agents | Yes (namespace by user) | Yes, LangChain-native |
Use Cases
01. RAG-first chat apps
Apps where the primary workload is retrieval-over-documents and chat memory is secondary. LlamaIndex gives you both in one framework, no integration glue.
02. Research assistants
Long conversations over academic papers or codebases. VectorMemory + summarization keeps the agent grounded in both recent turns and earlier context.
03. Agentic workflows
LlamaIndex agents (FunctionAgent, ReActAgent) that maintain state across tool-calling loops — memory modules are how the agent "remembers" what it already tried.
Pricing & License
LlamaIndex: MIT open source. Memory modules are part of llama-index-core — no extra install, no license cost. You pay only for LLM + embedding API calls and your chosen vector store.
Production storage: use llama-index-vector-stores-* integrations (Qdrant, PGVector, Chroma, Pinecone, Weaviate, Milvus). Memory state is stored alongside your RAG data.
Cost profile: similar to other SDK-only options. ChatSummaryMemoryBuffer adds summarization LLM calls proportional to conversation length.
Related Assets on TokRepo
LLaMA-Factory — Fine-Tune 100+ LLMs with a Unified Interface
LLaMA-Factory provides a web UI and CLI to fine-tune large language models including LLaMA, Mistral, Qwen, and more using LoRA, QLoRA, and full-parameter methods without writing training scripts.
LLaMA-Factory — Unified LLM Fine-Tuning Framework
LLaMA-Factory offers a web UI and CLI for fine-tuning over 100 large language models using methods like LoRA, QLoRA, and full-parameter training, with built-in evaluation and export.
Llama Index — Data Framework for LLM Applications
Leading data framework for connecting LLMs to external data. LlamaIndex handles ingestion, indexing, retrieval, and query engines for building production RAG applications.
Llama Stack — Meta Official LLM App Framework
Official Meta framework for building LLM applications with Llama models. Inference, safety, RAG, agents, evals, and tool use. Standardized APIs. 8.3K+ stars.
Frequently Asked Questions
LlamaIndex Memory vs mem0 — which for my project?+
LlamaIndex Memory when you want memory inside a RAG pipeline with no extra dependencies. mem0 when you need cross-session user-level facts, managed cloud, or framework independence. They can coexist — LlamaIndex for session memory, mem0 for user profile.
Does LlamaIndex Memory work with LangChain?+
The modules are LlamaIndex-specific (they hook into LlamaIndex’s ChatEngine/Agent abstractions). For LangChain, use LangMem. You can share a vector DB between the two frameworks though.
Can I use LlamaIndex Memory without RAG?+
Yes. A ChatMemoryBuffer or CompositeMemory can wrap any LlamaIndex ChatEngine, including ones without a backing index. But if you don’t need RAG at all, the bare LLM SDK + mem0 is a lighter stack.
How does LlamaIndex handle long conversations?+
Either (a) raise ChatMemoryBuffer’s token_limit and pay the context window cost, (b) add ChatSummaryMemoryBuffer to compress history, or (c) use CompositeMemory to combine recency + summary + semantic recall. (c) is the production-grade answer.
Is LlamaIndex Memory production-ready?+
Yes — it ships in llama-index-core and is used in production by many LlamaIndex customers. API is stable; new modules are added incrementally without breaking existing ones.