AI Memory

Zep — 内置会话摘要与混合检索的 LLM 记忆服务

Zep 是面向生产 LLM 应用的记忆服务，负责会话存储、长历史摘要、事实抽取，并通过向量+关键字+图谱混合检索返回。

Why Zep

Zep’s differentiator is the session abstraction. Where mem0 thinks in facts, Zep thinks in sessions: a conversation has a beginning, a middle that gets summarized, and a tail that stays verbatim. When you fetch memory, Zep returns a summary of the old + the latest messages + relevant long-term facts — already formatted for prompt injection.

Under the hood it runs a hybrid retrieval pipeline: dense vector search for semantic similarity, BM25 for exact terms, and a small entity graph for "who’s who" resolution. That extra machinery costs ~20ms per call but noticeably improves recall on real chat corpora — especially when users reference specific project names or past entities.

Zep ships as both managed and self-hosted. The managed service (Zep Cloud) includes a web UI where you can inspect every stored memory, which turns out to be critical for debugging agent behavior. Self-host when you have data-residency needs; use Cloud when time-to-production matters more than per-token cost.

Quick Start — Python SDK

The key API is memory.get(session_id).context — it returns a single pre-formatted string containing the running summary, extracted user facts, and the last N messages. Drop it into your system prompt and the rest of your LLM code stays unchanged.

# pip install zep-python openai
from zep_python.client import Zep
from openai import OpenAI

zep = Zep(api_key="z_...")  # or base_url="http://localhost:8000" for self-host
oai = OpenAI()

user_id, session_id = "william", "s_2026_04_14"
zep.user.add(user_id=user_id, email="william@example.com")
zep.memory.add_session(session_id=session_id, user_id=user_id)

def chat(message: str) -> str:
    zep.memory.add(session_id=session_id,
                   messages=[{"role": "user", "content": message}])

    # Zep returns pre-formatted context: summary + facts + recent messages
    ctx = zep.memory.get(session_id=session_id).context

    resp = oai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": ctx},
            {"role": "user", "content": message},
        ],
    )
    answer = resp.choices[0].message.content

    zep.memory.add(session_id=session_id,
                   messages=[{"role": "assistant", "content": answer}])
    return answer

print(chat("I'm planning a trip to Tokyo in May"))
print(chat("What should I pack for that trip?"))

核心能力

Automatic session summarization

Zep runs a summarizer in the background as sessions grow. Once a session passes ~20 messages, older turns are collapsed into a running summary — context window never balloons.

Hybrid search (vector + BM25 + graph)

Every memory is indexed three ways. Queries blend all three signals, which measurably outperforms vector-only retrieval on real chat data where exact term matches matter.

Knowledge graph extraction

Zep’s graph service extracts entities and relationships from conversations. Ask "who did the user mention working with" and the graph returns them directly — no LLM hallucination.

Fact extraction with dedup

Long-term user facts are extracted automatically and deduped against prior memories. Inspect and edit them in the Zep UI if your agent remembers something wrong.

Low-latency SDK (10-30ms p99 managed)

The managed service sits geographically close to major LLM providers. Hot-path reads use a pre-computed session context object — single DB round trip.

Self-host option

Full stack (API, worker, Postgres, NATS) runs via docker-compose. Apache 2.0 licensed. Same SDK, just point base_url at your cluster.

对比

	Session Model	Summarization	Graph Support	Deployment
Zep本工具	First-class sessions + users	Built-in (rolling)	Yes — native entity graph	Managed + self-host
mem0	Facts only (no session concept)	No	Optional Neo4j plugin	SDK + optional platform
Letta	Agent state (not sessions)	Agent-driven paging	No	Self-host + cloud
LangMem	LangChain thread-based	Opt-in	No	SDK only

实际用例

01. Production customer support

Sessions map naturally to conversations. Summarization keeps month-long customer relationships queryable without ballooning token cost. Zep’s UI gives support engineers a window into what the bot "knows" about a customer.

02. Multi-agent teams sharing context

User-level facts are scoped to a user, not a session — so a handoff from sales bot to onboarding bot can share everything known about the user while keeping session histories separate.

03. Analytics-heavy assistants

When agents need to answer "show me everyone who mentioned feature X", the graph layer lets you traverse entity relationships directly, not fuzz-match across 500K embeddings.

价格与许可

Zep Community Edition: Apache 2.0, self-host. Includes the full API, hybrid search, summarization, and graph. Run on your own Postgres + infra.

Zep Cloud: Free dev tier, then pay-as-you-go. Paid plans add the web UI, team management, SOC 2 reporting, and scale-out. Current pricing on getzep.com/pricing.

What you actually pay for: summarization LLM calls. Zep bills managed LLM use through their platform; self-host uses your own OpenAI/Claude key. Expect ~$0.0003 per added message turn on cheap models.

常见问题

Zep vs mem0 — which should I pick?+

Pick Zep when sessions are a first-class concept in your app (support, tutoring, booking) and you want summarization + graph + UI out of the box. Pick mem0 when you want a lighter-weight fact store and prefer to compose your own session logic.

Can Zep replace my vector database?+

For memory yes — Zep stores embeddings internally. For general RAG over documents, no: keep a separate vector DB (Qdrant/Pinecone/Chroma). Zep is tuned for conversation memory, not arbitrary document corpora.

Does Zep work with local LLMs?+

Yes. Self-hosted Zep supports Ollama, LiteLLM, and any OpenAI-compatible endpoint for summarization and extraction. The SDK is LLM-agnostic on the read path — it returns text/facts that you feed to whatever model you like.

How does Zep’s graph differ from Graphiti?+

Zep’s graph is conversation-scoped: entities and relations mentioned in chat, extracted and updated as the session progresses. Graphiti is a temporal graph library — it tracks time-bounded validity of every edge. Use Zep for in-app memory; use Graphiti when you need to reason about "what was true when".

What’s the latency penalty of hybrid search?+

Typically +10-20ms vs pure vector search on a 100K-memory corpus. Worth it for recall improvements on chat corpora (exact term matches, entity references). If you need sub-50ms p99, self-host close to your app.