Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsJun 2, 2026·3 min de lectura

GPTCache — Semantic Cache for LLM API Calls

A caching layer for LLM queries that uses semantic similarity to return cached responses, reducing API costs and latency.

Listo para agents

Instalación lista para agent

Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
GPTCache Overview
Comando de instalación directa
npx -y tokrepo@latest install 27d8d746-5e1a-11f1-9bc6-00163e2b0d79 --target codex

Ejecutar después de confirmar el plan con dry-run.

Introduction

GPTCache is an open-source semantic caching library that stores and retrieves LLM responses based on the meaning of queries rather than exact string matching. When a new query is semantically similar to a previously cached one, GPTCache returns the cached response, cutting API costs and response latency by up to 100x.

What GPTCache Does

  • Caches LLM API responses and matches new queries using embedding-based similarity
  • Supports multiple embedding providers including OpenAI, Hugging Face, and ONNX models
  • Integrates with LangChain, LlamaIndex, and direct OpenAI SDK usage
  • Provides pluggable storage backends including SQLite, MySQL, Redis, and Milvus
  • Handles cache eviction with LRU and TTL strategies

Architecture Overview

GPTCache processes incoming queries through an embedding model to produce vector representations. These vectors are compared against cached query embeddings using a configurable similarity search backend such as FAISS or Milvus. If a match exceeds the similarity threshold, the cached response is returned directly. Otherwise, the query is forwarded to the LLM and the response is stored with its embedding for future matches.

Self-Hosting & Configuration

  • Install via pip with optional extras for specific backends
  • Configure the embedding model, vector store, and similarity threshold in the init call
  • Use SQLite for lightweight local caching or Redis and MySQL for production deployments
  • Set cache eviction policies based on time-to-live or maximum cache size
  • Deploy as a standalone service or embed directly in application code

Key Features

  • Reduces LLM API costs proportionally to cache hit rate
  • Semantic matching catches paraphrased queries that exact-match caching would miss
  • Modular architecture allows swapping embedding models and storage backends independently
  • Pre-built adapters for OpenAI, LangChain, and LlamaIndex with minimal code changes
  • Supports both synchronous and asynchronous operation modes

Comparison with Similar Tools

  • Redis — Key-value cache requiring exact matches; GPTCache adds semantic similarity matching
  • LangChain Cache — Basic caching built into LangChain; GPTCache offers more backends and similarity strategies
  • Portkey Cache — Cloud-hosted caching service; GPTCache is fully self-hosted and open source
  • Prompt caching (Anthropic/OpenAI) — Provider-side prefix caching; GPTCache works at the application layer across providers

FAQ

Q: How accurate is semantic matching? A: Accuracy depends on the embedding model and similarity threshold. A well-tuned setup achieves high precision with minimal false positives.

Q: Does GPTCache work with streaming responses? A: Yes, it supports caching and replaying streamed LLM responses.

Q: What happens when the cache is full? A: Configurable eviction policies (LRU, TTL) automatically remove the oldest or least-used entries.

Q: Can I use GPTCache with local LLMs? A: Yes, it works with any LLM that provides a compatible API, including locally hosted models.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados