What is GPTCache — Semantic Cache for LLM API Calls?

A caching layer for LLM queries that uses semantic similarity to return cached responses, reducing API costs and latency.

Is GPTCache — Semantic Cache for LLM API Calls free to use?

Yes. GPTCache — Semantic Cache for LLM API Calls is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install GPTCache — Semantic Cache for LLM API Calls?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

GPTCache — Semantic Cache for LLM API Calls

Introduction

GPTCache is an open-source semantic caching library that stores and retrieves LLM responses based on the meaning of queries rather than exact string matching. When a new query is semantically similar to a previously cached one, GPTCache returns the cached response, cutting API costs and response latency by up to 100x.

What GPTCache Does

Caches LLM API responses and matches new queries using embedding-based similarity
Supports multiple embedding providers including OpenAI, Hugging Face, and ONNX models
Integrates with LangChain, LlamaIndex, and direct OpenAI SDK usage
Provides pluggable storage backends including SQLite, MySQL, Redis, and Milvus
Handles cache eviction with LRU and TTL strategies

Architecture Overview

GPTCache processes incoming queries through an embedding model to produce vector representations. These vectors are compared against cached query embeddings using a configurable similarity search backend such as FAISS or Milvus. If a match exceeds the similarity threshold, the cached response is returned directly. Otherwise, the query is forwarded to the LLM and the response is stored with its embedding for future matches.

Self-Hosting & Configuration

Install via pip with optional extras for specific backends
Configure the embedding model, vector store, and similarity threshold in the init call
Use SQLite for lightweight local caching or Redis and MySQL for production deployments
Set cache eviction policies based on time-to-live or maximum cache size
Deploy as a standalone service or embed directly in application code

Key Features

Reduces LLM API costs proportionally to cache hit rate
Semantic matching catches paraphrased queries that exact-match caching would miss
Modular architecture allows swapping embedding models and storage backends independently
Pre-built adapters for OpenAI, LangChain, and LlamaIndex with minimal code changes
Supports both synchronous and asynchronous operation modes

Comparison with Similar Tools

Redis — Key-value cache requiring exact matches; GPTCache adds semantic similarity matching
LangChain Cache — Basic caching built into LangChain; GPTCache offers more backends and similarity strategies
Portkey Cache — Cloud-hosted caching service; GPTCache is fully self-hosted and open source
Prompt caching (Anthropic/OpenAI) — Provider-side prefix caching; GPTCache works at the application layer across providers

FAQ

Q: How accurate is semantic matching? A: Accuracy depends on the embedding model and similarity threshold. A well-tuned setup achieves high precision with minimal false positives.

Q: Does GPTCache work with streaming responses? A: Yes, it supports caching and replaying streamed LLM responses.

Q: What happens when the cache is full? A: Configurable eviction policies (LRU, TTL) automatically remove the oldest or least-used entries.

Q: Can I use GPTCache with local LLMs? A: Yes, it works with any LLM that provides a compatible API, including locally hosted models.

GPTCache — Semantic Cache for LLM API Calls

Ready-to-run agent install

Introduction

What GPTCache Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

Cloudflare AI Gateway — LLM Proxy, Cache & Analytics

LMCache — Supercharge LLM Inference with KV Cache Sharing

Helicone Cache — Cut LLM Spend with Drop-In Response Caching

LLM Prompt Caching — Cache-Key Design Runbook