Layered caching (exact, normalized, semantic) that cuts LLM costs/latency without stale answers.

Recursos seleccionados para complementar tu lectura
LLM caching can cut costs by 60%+ and reduce latency by 90%.
Yet most teams don't cache LLM responses because "AI is non-deterministic."
This article shows caching strategies that work for non-deterministic systems.
Traditional caching: exact-match keys LLM caching: semantic similarity
Challenge: "How do I reset my password?" vs "I forgot my password, help!" are semantically identical but different strings.
Cache identical prompts:
class ExactMatchCache:
def get(self, prompt):
return self.cache.get(hash(prompt))
def set(self, prompt, response):
self.cache.set(hash(prompt), response, ttl=3600)
Hit rate: 8-12% (low but easy to implement) Use case: Repeated exact queries (FAQs, common commands)
Normalize prompts before caching:
def normalize(prompt):
# Lowercase, remove punctuation, trim whitespace
normalized = prompt.lower().strip()
normalized = re.sub(r'[^\w\s]', '', normalized)
# Remove stop words
normalized = remove_stop_words(normalized)
return normalized
class NormalizedCache:
def get(self, prompt):
key = hash(normalize(prompt))
return self.cache.get(key)
Hit rate: 20-30% (better, still simple) Use case: Queries with minor variations
Cache by meaning, not string:
from sentence_transformers import SentenceTransformer
class SemanticCache:
def __init__(self):
self.model = SentenceTransformer('all-MiniLM-L6-v2')
self.cache = {}
self.embeddings = {}
def get(self, prompt, similarity_threshold=0.95):
prompt_embedding = self.model.encode(prompt)
# Find similar cached prompts
for cached_prompt, cached_embedding in self.embeddings.items():
similarity = cosine_similarity(prompt_embedding, cached_embedding)
if similarity > similarity_threshold:
return self.cache[cached_prompt]
return None
def set(self, prompt, response):
embedding = self.model.encode(prompt)
self.embeddings[prompt] = embedding
self.cache[prompt] = response
Hit rate: 40-55% (significant improvement) Cost: +30ms latency for embedding calculation Use case: Semantically similar queries
Cache prompt templates with variables:
class ParameterizedCache:
def get(self, template, params):
# Cache template, not filled prompt
cached_response_template = self.cache.get(template)
if cached_response_template:
return cached_response_template.format(**params)
return None
Example:
template = "Summarize this document: {document}"
params = {"document": user_document}
# Cache "Summarize this document: {document}" template
# Fill with actual document on cache hit
Hit rate: 30-45% (depends on template reuse) Use case: Repetitive tasks with variable inputs
# Short TTL for dynamic content
cache.set(key, value, ttl=300) # 5 minutes
# Long TTL for static content
cache.set(key, value, ttl=86400) # 24 hours
Invalidate on data changes:
@event_listener("document_updated")
def invalidate_document_cache(document_id):
cache.delete_pattern(f"doc:{document_id}:*")
Tag cache entries with version:
def cache_key(prompt, model_version):
return f"{hash(prompt)}:v{model_version}"
# Invalidate all cache when model updates
def on_model_update(new_version):
cache.clear() # Or selectively delete old versions
Combine multiple cache layers:
class TieredCache:
def __init__(self):
self.exact_cache = ExactMatchCache() # L1: Fast, low hit rate
self.normalized_cache = NormalizedCache() # L2: Medium speed/hit rate
self.semantic_cache = SemanticCache() # L3: Slower, high hit rate
def get(self, prompt):
# Try L1 (fastest)
result = self.exact_cache.get(prompt)
if result:
return result, "L1"
# Try L2
result = self.normalized_cache.get(prompt)
if result:
self.exact_cache.set(prompt, result) # Promote to L1
return result, "L2"
# Try L3
result = self.semantic_cache.get(prompt)
if result:
self.normalized_cache.set(prompt, result) # Promote to L2
return result, "L3"
return None, "MISS"
Key metrics:
(Misses × Model Cost) - Cache CostAvg Latency (Miss) - Avg Latency (Hit)Cost Savings / Cache Infrastructure CostDashboard:
metrics = {
"hit_rate": cache_hits / total_requests,
"miss_rate": cache_misses / total_requests,
"avg_hit_latency": 15, # ms
"avg_miss_latency": 2100, # ms
"cost_per_hit": 0.0001, # Cache lookup cost
"cost_per_miss": 0.05, # LLM API cost
"monthly_savings": (cache_hits * 0.05) - (total_requests * 0.0001)
}
Before caching:
After semantic caching (48% hit rate):
Setup:
LLM caching isn't optional at scale:
Implementation order:
AI Achitect
Building the internet
Checklist de 47 puntos para detectar bugs, riesgos de seguridad y problemas de rendimiento antes del lanzamiento.
Templates probados en producción, usados por desarrolladores. Ahorra semanas de setup en tu próximo proyecto.
Consultorías modulares con diagnóstico técnico, plan de acción y acompañamiento directo. Desde auditorías express hasta CTO fraccionado.
2 cupos para consultorías en el Q2