# GPTCache — Semantic Cache for LLM API Calls > A caching layer for LLM queries that uses semantic similarity to return cached responses, reducing API costs and latency. ## Install Save in your project root: # GPTCache — Semantic Cache for LLMs ## Quick Use ```bash pip install gptcache ``` ```python from gptcache import cache from gptcache.adapter import openai cache.init() # Subsequent identical or similar queries return cached results response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": "What is Python?"}] ) ``` ## Introduction GPTCache is an open-source semantic caching library that stores and retrieves LLM responses based on the meaning of queries rather than exact string matching. When a new query is semantically similar to a previously cached one, GPTCache returns the cached response, cutting API costs and response latency by up to 100x. ## What GPTCache Does - Caches LLM API responses and matches new queries using embedding-based similarity - Supports multiple embedding providers including OpenAI, Hugging Face, and ONNX models - Integrates with LangChain, LlamaIndex, and direct OpenAI SDK usage - Provides pluggable storage backends including SQLite, MySQL, Redis, and Milvus - Handles cache eviction with LRU and TTL strategies ## Architecture Overview GPTCache processes incoming queries through an embedding model to produce vector representations. These vectors are compared against cached query embeddings using a configurable similarity search backend such as FAISS or Milvus. If a match exceeds the similarity threshold, the cached response is returned directly. Otherwise, the query is forwarded to the LLM and the response is stored with its embedding for future matches. ## Self-Hosting & Configuration - Install via pip with optional extras for specific backends - Configure the embedding model, vector store, and similarity threshold in the init call - Use SQLite for lightweight local caching or Redis and MySQL for production deployments - Set cache eviction policies based on time-to-live or maximum cache size - Deploy as a standalone service or embed directly in application code ## Key Features - Reduces LLM API costs proportionally to cache hit rate - Semantic matching catches paraphrased queries that exact-match caching would miss - Modular architecture allows swapping embedding models and storage backends independently - Pre-built adapters for OpenAI, LangChain, and LlamaIndex with minimal code changes - Supports both synchronous and asynchronous operation modes ## Comparison with Similar Tools - **Redis** — Key-value cache requiring exact matches; GPTCache adds semantic similarity matching - **LangChain Cache** — Basic caching built into LangChain; GPTCache offers more backends and similarity strategies - **Portkey Cache** — Cloud-hosted caching service; GPTCache is fully self-hosted and open source - **Prompt caching (Anthropic/OpenAI)** — Provider-side prefix caching; GPTCache works at the application layer across providers ## FAQ **Q: How accurate is semantic matching?** A: Accuracy depends on the embedding model and similarity threshold. A well-tuned setup achieves high precision with minimal false positives. **Q: Does GPTCache work with streaming responses?** A: Yes, it supports caching and replaying streamed LLM responses. **Q: What happens when the cache is full?** A: Configurable eviction policies (LRU, TTL) automatically remove the oldest or least-used entries. **Q: Can I use GPTCache with local LLMs?** A: Yes, it works with any LLM that provides a compatible API, including locally hosted models. ## Sources - https://github.com/zilliztech/GPTCache - https://gptcache.readthedocs.io/ --- Source: https://tokrepo.com/en/workflows/asset-27d8d746 Author: AI Open Source