Introduction
GPTCache is an open-source semantic caching library that stores and retrieves LLM responses based on the meaning of queries rather than exact string matching. When a new query is semantically similar to a previously cached one, GPTCache returns the cached response, cutting API costs and response latency by up to 100x.
What GPTCache Does
- Caches LLM API responses and matches new queries using embedding-based similarity
- Supports multiple embedding providers including OpenAI, Hugging Face, and ONNX models
- Integrates with LangChain, LlamaIndex, and direct OpenAI SDK usage
- Provides pluggable storage backends including SQLite, MySQL, Redis, and Milvus
- Handles cache eviction with LRU and TTL strategies
Architecture Overview
GPTCache processes incoming queries through an embedding model to produce vector representations. These vectors are compared against cached query embeddings using a configurable similarity search backend such as FAISS or Milvus. If a match exceeds the similarity threshold, the cached response is returned directly. Otherwise, the query is forwarded to the LLM and the response is stored with its embedding for future matches.
Self-Hosting & Configuration
- Install via pip with optional extras for specific backends
- Configure the embedding model, vector store, and similarity threshold in the init call
- Use SQLite for lightweight local caching or Redis and MySQL for production deployments
- Set cache eviction policies based on time-to-live or maximum cache size
- Deploy as a standalone service or embed directly in application code
Key Features
- Reduces LLM API costs proportionally to cache hit rate
- Semantic matching catches paraphrased queries that exact-match caching would miss
- Modular architecture allows swapping embedding models and storage backends independently
- Pre-built adapters for OpenAI, LangChain, and LlamaIndex with minimal code changes
- Supports both synchronous and asynchronous operation modes
Comparison with Similar Tools
- Redis — Key-value cache requiring exact matches; GPTCache adds semantic similarity matching
- LangChain Cache — Basic caching built into LangChain; GPTCache offers more backends and similarity strategies
- Portkey Cache — Cloud-hosted caching service; GPTCache is fully self-hosted and open source
- Prompt caching (Anthropic/OpenAI) — Provider-side prefix caching; GPTCache works at the application layer across providers
FAQ
Q: How accurate is semantic matching? A: Accuracy depends on the embedding model and similarity threshold. A well-tuned setup achieves high precision with minimal false positives.
Q: Does GPTCache work with streaming responses? A: Yes, it supports caching and replaying streamed LLM responses.
Q: What happens when the cache is full? A: Configurable eviction policies (LRU, TTL) automatically remove the oldest or least-used entries.
Q: Can I use GPTCache with local LLMs? A: Yes, it works with any LLM that provides a compatible API, including locally hosted models.