AI Memory

Zep — Memory Service for LLM Apps with Built-in Summarization

Zep is a dedicated memory service for production LLM apps. It stores sessions, summarizes long histories, extracts facts, and retrieves them with hybrid vector + keyword + graph search.

Official Site GitHub

Why Zep

Zep’s differentiator is the session abstraction. Where mem0 thinks in facts, Zep thinks in sessions: a conversation has a beginning, a middle that gets summarized, and a tail that stays verbatim. When you fetch memory, Zep returns a summary of the old + the latest messages + relevant long-term facts — already formatted for prompt injection.

Under the hood it runs a hybrid retrieval pipeline: dense vector search for semantic similarity, BM25 for exact terms, and a small entity graph for "who’s who" resolution. That extra machinery costs ~20ms per call but noticeably improves recall on real chat corpora — especially when users reference specific project names or past entities.

Zep ships as both managed and self-hosted. The managed service (Zep Cloud) includes a web UI where you can inspect every stored memory, which turns out to be critical for debugging agent behavior. Self-host when you have data-residency needs; use Cloud when time-to-production matters more than per-token cost.

Quick Start — Python SDK

The key API is memory.get(session_id).context — it returns a single pre-formatted string containing the running summary, extracted user facts, and the last N messages. Drop it into your system prompt and the rest of your LLM code stays unchanged.

# pip install zep-python openai
from zep_python.client import Zep
from openai import OpenAI

zep = Zep(api_key="z_...")  # or base_url="http://localhost:8000" for self-host
oai = OpenAI()

user_id, session_id = "william", "s_2026_04_14"
zep.user.add(user_id=user_id, email="william@example.com")
zep.memory.add_session(session_id=session_id, user_id=user_id)

def chat(message: str) -> str:
    zep.memory.add(session_id=session_id,
                   messages=[{"role": "user", "content": message}])

    # Zep returns pre-formatted context: summary + facts + recent messages
    ctx = zep.memory.get(session_id=session_id).context

    resp = oai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": ctx},
            {"role": "user", "content": message},
        ],
    )
    answer = resp.choices[0].message.content

    zep.memory.add(session_id=session_id,
                   messages=[{"role": "assistant", "content": answer}])
    return answer

print(chat("I'm planning a trip to Tokyo in May"))
print(chat("What should I pack for that trip?"))

Key Features

Automatic session summarization

Zep runs a summarizer in the background as sessions grow. Once a session passes ~20 messages, older turns are collapsed into a running summary — context window never balloons.

Hybrid search (vector + BM25 + graph)

Every memory is indexed three ways. Queries blend all three signals, which measurably outperforms vector-only retrieval on real chat data where exact term matches matter.

Knowledge graph extraction

Zep’s graph service extracts entities and relationships from conversations. Ask "who did the user mention working with" and the graph returns them directly — no LLM hallucination.

Fact extraction with dedup

Long-term user facts are extracted automatically and deduped against prior memories. Inspect and edit them in the Zep UI if your agent remembers something wrong.

Low-latency SDK (10-30ms p99 managed)

The managed service sits geographically close to major LLM providers. Hot-path reads use a pre-computed session context object — single DB round trip.

Self-host option

Full stack (API, worker, Postgres, NATS) runs via docker-compose. Apache 2.0 licensed. Same SDK, just point base_url at your cluster.

Comparison

	Session Model	Summarization	Graph Support	Deployment
Zepthis	First-class sessions + users	Built-in (rolling)	Yes — native entity graph	Managed + self-host
mem0	Facts only (no session concept)	No	Optional Neo4j plugin	SDK + optional platform
Letta	Agent state (not sessions)	Agent-driven paging	No	Self-host + cloud
LangMem	LangChain thread-based	Opt-in	No	SDK only

Use Cases

01. Production customer support

Sessions map naturally to conversations. Summarization keeps month-long customer relationships queryable without ballooning token cost. Zep’s UI gives support engineers a window into what the bot "knows" about a customer.

02. Multi-agent teams sharing context

User-level facts are scoped to a user, not a session — so a handoff from sales bot to onboarding bot can share everything known about the user while keeping session histories separate.

03. Analytics-heavy assistants

When agents need to answer "show me everyone who mentioned feature X", the graph layer lets you traverse entity relationships directly, not fuzz-match across 500K embeddings.

Pricing & License

Zep Community Edition: Apache 2.0, self-host. Includes the full API, hybrid search, summarization, and graph. Run on your own Postgres + infra.

Zep Cloud: Free dev tier, then pay-as-you-go. Paid plans add the web UI, team management, SOC 2 reporting, and scale-out. Current pricing on getzep.com/pricing.

What you actually pay for: summarization LLM calls. Zep bills managed LLM use through their platform; self-host uses your own OpenAI/Claude key. Expect ~$0.0003 per added message turn on cheap models.

Related Assets on TokRepo

Zep — Long-Term Memory for AI Agents and Assistants

Production memory layer for AI assistants. Zep stores conversation history, extracts facts, builds knowledge graphs, and provides temporal-aware retrieval for LLMs.

Apache Zeppelin — Web-Based Notebook for Interactive Data Analytics

Apache Zeppelin is a web-based notebook that supports multiple language backends including Spark, SQL, Python, and Scala, enabling interactive data exploration, visualization, and collaboration.

Zephyr RTOS — Scalable Real-Time Operating System for IoT Devices

Build connected embedded systems on a vendor-neutral real-time OS supporting 700+ boards. Zephyr provides a small-footprint kernel with POSIX compatibility, networking stacks, Bluetooth, and security primitives for resource-constrained devices.

Zephyr RTOS — Scalable Real-Time Operating System for IoT

A small, scalable real-time operating system for resource-constrained devices supporting multiple architectures, backed by the Linux Foundation.

Frequently Asked Questions

Zep vs mem0 — which should I pick?+

Pick Zep when sessions are a first-class concept in your app (support, tutoring, booking) and you want summarization + graph + UI out of the box. Pick mem0 when you want a lighter-weight fact store and prefer to compose your own session logic.

Can Zep replace my vector database?+

For memory yes — Zep stores embeddings internally. For general RAG over documents, no: keep a separate vector DB (Qdrant/Pinecone/Chroma). Zep is tuned for conversation memory, not arbitrary document corpora.

Does Zep work with local LLMs?+

Yes. Self-hosted Zep supports Ollama, LiteLLM, and any OpenAI-compatible endpoint for summarization and extraction. The SDK is LLM-agnostic on the read path — it returns text/facts that you feed to whatever model you like.

How does Zep’s graph differ from Graphiti?+

Zep’s graph is conversation-scoped: entities and relations mentioned in chat, extracted and updated as the session progresses. Graphiti is a temporal graph library — it tracks time-bounded validity of every edge. Use Zep for in-app memory; use Graphiti when you need to reason about "what was true when".

What’s the latency penalty of hybrid search?+

Typically +10-20ms vs pure vector search on a 100K-memory corpus. Worth it for recall improvements on chat corpora (exact term matches, entity references). If you need sub-50ms p99, self-host close to your app.

Compare Alternatives

mem0 — Long-term Memory for AI Agents (2026 Guide)Letta — Agent Memory OS (formerly MemGPT)Graphiti — Temporal Knowledge Graphs for AI Agents LangMem — LangChain-Native Memory SDK