# GPTCache — Semantic Cache for LLM API Calls

> A caching layer for LLM queries that uses semantic similarity to return cached responses, reducing API costs and latency.

## Install

Save in your project root:

# GPTCache — Semantic Cache for LLMs

## Quick Use
```bash
pip install gptcache
```
```python
from gptcache import cache
from gptcache.adapter import openai
cache.init()
# Subsequent identical or similar queries return cached results
response = openai.ChatCompletion.create(
    model="gpt-4", messages=[{"role": "user", "content": "What is Python?"}]
)
```

## Introduction
GPTCache is an open-source semantic caching library that stores and retrieves LLM responses based on the meaning of queries rather than exact string matching. When a new query is semantically similar to a previously cached one, GPTCache returns the cached response, cutting API costs and response latency by up to 100x.

## What GPTCache Does
- Caches LLM API responses and matches new queries using embedding-based similarity
- Supports multiple embedding providers including OpenAI, Hugging Face, and ONNX models
- Integrates with LangChain, LlamaIndex, and direct OpenAI SDK usage
- Provides pluggable storage backends including SQLite, MySQL, Redis, and Milvus
- Handles cache eviction with LRU and TTL strategies

## Architecture Overview
GPTCache processes incoming queries through an embedding model to produce vector representations. These vectors are compared against cached query embeddings using a configurable similarity search backend such as FAISS or Milvus. If a match exceeds the similarity threshold, the cached response is returned directly. Otherwise, the query is forwarded to the LLM and the response is stored with its embedding for future matches.

## Self-Hosting & Configuration
- Install via pip with optional extras for specific backends
- Configure the embedding model, vector store, and similarity threshold in the init call
- Use SQLite for lightweight local caching or Redis and MySQL for production deployments
- Set cache eviction policies based on time-to-live or maximum cache size
- Deploy as a standalone service or embed directly in application code

## Key Features
- Reduces LLM API costs proportionally to cache hit rate
- Semantic matching catches paraphrased queries that exact-match caching would miss
- Modular architecture allows swapping embedding models and storage backends independently
- Pre-built adapters for OpenAI, LangChain, and LlamaIndex with minimal code changes
- Supports both synchronous and asynchronous operation modes

## Comparison with Similar Tools
- **Redis** — Key-value cache requiring exact matches; GPTCache adds semantic similarity matching
- **LangChain Cache** — Basic caching built into LangChain; GPTCache offers more backends and similarity strategies
- **Portkey Cache** — Cloud-hosted caching service; GPTCache is fully self-hosted and open source
- **Prompt caching (Anthropic/OpenAI)** — Provider-side prefix caching; GPTCache works at the application layer across providers

## FAQ
**Q: How accurate is semantic matching?**
A: Accuracy depends on the embedding model and similarity threshold. A well-tuned setup achieves high precision with minimal false positives.

**Q: Does GPTCache work with streaming responses?**
A: Yes, it supports caching and replaying streamed LLM responses.

**Q: What happens when the cache is full?**
A: Configurable eviction policies (LRU, TTL) automatically remove the oldest or least-used entries.

**Q: Can I use GPTCache with local LLMs?**
A: Yes, it works with any LLM that provides a compatible API, including locally hosted models.

## Sources
- https://github.com/zilliztech/GPTCache
- https://gptcache.readthedocs.io/

---
Source: https://tokrepo.com/en/workflows/asset-27d8d746
Author: AI Open Source