AI Memory

LlamaIndex Memory — Built-in Memory for RAG Pipelines

LlamaIndex ships first-class memory modules for chat engines and agents — ChatMemoryBuffer, VectorMemory, CompositeMemory — letting you add memory to a RAG pipeline with a single constructor arg.

Official Site GitHub

Why LlamaIndex Memory

If you’re already building a RAG application with LlamaIndex, you don’t need a separate memory library. LlamaIndex ships three production-ready memory modules that compose cleanly with its ChatEngine, AgentWorker, and query pipelines.

ChatMemoryBuffer is the simplest: a ring buffer of recent messages with token-aware trimming. VectorMemory embeds messages and retrieves them by similarity — useful when conversations go long and chronological recency isn’t enough. CompositeMemory combines both, plus an optional ChatSummaryMemoryBuffer that runs LLM-based rolling summarization.

Ceiling: these modules focus on conversation memory within a single session. For user-level persistent facts across sessions, either pair LlamaIndex with mem0/Zep, or use LlamaIndex agents with a custom long-term memory tool.

Quick Start — ChatEngine + CompositeMemory

CompositeMemory combines a recency buffer (primary) with semantic recall + summarization (secondary). The ChatEngine calls .get() on the composite before every turn to assemble context. Swap the VectorMemory backend for any LlamaIndex vector store (Chroma, Qdrant, PGVector, Pinecone).

# pip install llama-index llama-index-vector-stores-chroma
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.memory import (
    ChatMemoryBuffer, VectorMemory, ChatSummaryMemoryBuffer, SimpleComposableMemory,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)

primary = ChatMemoryBuffer.from_defaults(token_limit=3000)
secondary = [
    VectorMemory.from_defaults(embed_model=Settings.embed_model, retriever_kwargs={"similarity_top_k": 3}),
    ChatSummaryMemoryBuffer.from_defaults(token_limit=2000),
]
memory = SimpleComposableMemory.from_defaults(primary_memory=primary, secondary_memory_sources=secondary)

chat_engine = index.as_chat_engine(chat_mode="condense_plus_context", memory=memory)

print(chat_engine.chat("I'm William, building a Nuxt app. Remember that."))
print(chat_engine.chat("What framework am I using?"))

Key Features

ChatMemoryBuffer (recency)

Token-aware ring buffer of recent messages. Drop-in for any ChatEngine. Configurable token_limit, automatic trimming.

VectorMemory (semantic recall)

Embeds every message and retrieves semantically similar history. Useful when conversations span many topics and recent ≠ relevant.

ChatSummaryMemoryBuffer (summarization)

Runs LLM summarization when the buffer overflows. Summary persists; raw messages are dropped. Pair with ChatMemoryBuffer for hybrid recent + summary.

SimpleComposableMemory

Combines a primary memory with any number of secondary sources. Retrievals from all are merged into a single context block passed to the LLM.

Agent-friendly

Memory modules plug into LlamaIndex’s AgentWorker / FunctionAgent as first-class arguments. Agents keep memory state across tool calls and iterations.

Persistent state optional

Memory is per-ChatEngine instance by default. Serialize/deserialize via memory.to_dict() / from_dict() or pair with a persistent store for cross-process state.

Comparison

	Scope	Best Fit	Persistent Cross-session	Dedicated Library?
LlamaIndex Memorythis	Session + RAG	RAG pipelines with chat	Manual (serialize + reload)	No — built into LlamaIndex
mem0	User-level facts	Production chatbots	Yes (default)	Yes, standalone
Zep	Sessions + user facts	Production chat with UI	Yes	Yes, standalone service
LangMem	Thread-scoped + namespace	LangChain agents	Yes (namespace by user)	Yes, LangChain-native

Use Cases

01. RAG-first chat apps

Apps where the primary workload is retrieval-over-documents and chat memory is secondary. LlamaIndex gives you both in one framework, no integration glue.

02. Research assistants

Long conversations over academic papers or codebases. VectorMemory + summarization keeps the agent grounded in both recent turns and earlier context.

03. Agentic workflows

LlamaIndex agents (FunctionAgent, ReActAgent) that maintain state across tool-calling loops — memory modules are how the agent "remembers" what it already tried.

Pricing & License

LlamaIndex: MIT open source. Memory modules are part of llama-index-core — no extra install, no license cost. You pay only for LLM + embedding API calls and your chosen vector store.

Production storage: use llama-index-vector-stores-* integrations (Qdrant, PGVector, Chroma, Pinecone, Weaviate, Milvus). Memory state is stored alongside your RAG data.

Cost profile: similar to other SDK-only options. ChatSummaryMemoryBuffer adds summarization LLM calls proportional to conversation length.

Related Assets on TokRepo

LLaMA-Factory — Fine-Tune 100+ LLMs with a Unified Interface

LLaMA-Factory provides a web UI and CLI to fine-tune large language models including LLaMA, Mistral, Qwen, and more using LoRA, QLoRA, and full-parameter methods without writing training scripts.

LLaMA-Factory — Unified LLM Fine-Tuning Framework

LLaMA-Factory offers a web UI and CLI for fine-tuning over 100 large language models using methods like LoRA, QLoRA, and full-parameter training, with built-in evaluation and export.

Llama Index — Data Framework for LLM Applications

Leading data framework for connecting LLMs to external data. LlamaIndex handles ingestion, indexing, retrieval, and query engines for building production RAG applications.

Llama Stack — Meta Official LLM App Framework

Official Meta framework for building LLM applications with Llama models. Inference, safety, RAG, agents, evals, and tool use. Standardized APIs. 8.3K+ stars.

Frequently Asked Questions

LlamaIndex Memory vs mem0 — which for my project?+

LlamaIndex Memory when you want memory inside a RAG pipeline with no extra dependencies. mem0 when you need cross-session user-level facts, managed cloud, or framework independence. They can coexist — LlamaIndex for session memory, mem0 for user profile.

Does LlamaIndex Memory work with LangChain?+

The modules are LlamaIndex-specific (they hook into LlamaIndex’s ChatEngine/Agent abstractions). For LangChain, use LangMem. You can share a vector DB between the two frameworks though.

Can I use LlamaIndex Memory without RAG?+

Yes. A ChatMemoryBuffer or CompositeMemory can wrap any LlamaIndex ChatEngine, including ones without a backing index. But if you don’t need RAG at all, the bare LLM SDK + mem0 is a lighter stack.

How does LlamaIndex handle long conversations?+

Either (a) raise ChatMemoryBuffer’s token_limit and pay the context window cost, (b) add ChatSummaryMemoryBuffer to compress history, or (c) use CompositeMemory to combine recency + summary + semantic recall. (c) is the production-grade answer.

Is LlamaIndex Memory production-ready?+

Yes — it ships in llama-index-core and is used in production by many LlamaIndex customers. API is stable; new modules are added incrementally without breaking existing ones.

Compare Alternatives

LangMem — LangChain-Native Memory SDK mem0 — Long-term Memory for AI Agents (2026 Guide)Zep — Memory Service for LLM Apps with Built-in Summarization MemGPT — The Paper That Started Paged Agent Memory