LLM Caching: The Underutilized Key to Cost and Latency Reduction
Layered caching (exact, normalized, semantic) that cuts LLM costs/latency without stale answers.

Introduction
LLM caching can cut costs by 60%+ and reduce latency by 90%.
Yet most teams don't cache LLM responses because "AI is non-deterministic."
This article shows caching strategies that work for non-deterministic systems.
Why LLM Caching Is Different
Traditional caching: exact-match keys LLM caching: semantic similarity
Challenge: "How do I reset my password?" vs "I forgot my password, help!" are semantically identical but different strings.
Caching Strategy Tiers
Tier 1: Exact Match Caching (Baseline)
Cache identical prompts:
class ExactMatchCache:
def get(self, prompt):
return self.cache.get(hash(prompt))
def set(self, prompt, response):
self.cache.set(hash(prompt), response, ttl=3600)
Hit rate: 8-12% (low but easy to implement) Use case: Repeated exact queries (FAQs, common commands)
Tier 2: Normalized Caching
Normalize prompts before caching:
def normalize(prompt):
# Lowercase, remove punctuation, trim whitespace
normalized = prompt.lower().strip()
normalized = re.sub(r'[^\w\s]', '', normalized)
# Remove stop words
normalized = remove_stop_words(normalized)
return normalized
class NormalizedCache:
def get(self, prompt):
key = hash(normalize(prompt))
return self.cache.get(key)
Hit rate: 20-30% (better, still simple) Use case: Queries with minor variations
Tier 3: Semantic Caching
Cache by meaning, not string:
from sentence_transformers import SentenceTransformer
class SemanticCache:
def __init__(self):
self.model = SentenceTransformer('all-MiniLM-L6-v2')
self.cache = {}
self.embeddings = {}
def get(self, prompt, similarity_threshold=0.95):
prompt_embedding = self.model.encode(prompt)
# Find similar cached prompts
for cached_prompt, cached_embedding in self.embeddings.items():
similarity = cosine_similarity(prompt_embedding, cached_embedding)
if similarity > similarity_threshold:
return self.cache[cached_prompt]
return None
def set(self, prompt, response):
embedding = self.model.encode(prompt)
self.embeddings[prompt] = embedding
self.cache[prompt] = response
Hit rate: 40-55% (significant improvement) Cost: +30ms latency for embedding calculation Use case: Semantically similar queries
Tier 4: Parameterized Caching
Cache prompt templates with variables:
class ParameterizedCache:
def get(self, template, params):
# Cache template, not filled prompt
cached_response_template = self.cache.get(template)
if cached_response_template:
return cached_response_template.format(**params)
return None
Example:
template = "Summarize this document: {document}"
params = {"document": user_document}
# Cache "Summarize this document: {document}" template
# Fill with actual document on cache hit
Hit rate: 30-45% (depends on template reuse) Use case: Repetitive tasks with variable inputs
Cache Invalidation Strategies
Strategy 1: Time-Based (TTL)
# Short TTL for dynamic content
cache.set(key, value, ttl=300) # 5 minutes
# Long TTL for static content
cache.set(key, value, ttl=86400) # 24 hours
Strategy 2: Event-Based
Invalidate on data changes:
@event_listener("document_updated")
def invalidate_document_cache(document_id):
cache.delete_pattern(f"doc:{document_id}:*")
Strategy 3: Version-Based
Tag cache entries with version:
def cache_key(prompt, model_version):
return f"{hash(prompt)}:v{model_version}"
# Invalidate all cache when model updates
def on_model_update(new_version):
cache.clear() # Or selectively delete old versions
Tiered Caching Architecture
Combine multiple cache layers:
class TieredCache:
def __init__(self):
self.exact_cache = ExactMatchCache() # L1: Fast, low hit rate
self.normalized_cache = NormalizedCache() # L2: Medium speed/hit rate
self.semantic_cache = SemanticCache() # L3: Slower, high hit rate
def get(self, prompt):
# Try L1 (fastest)
result = self.exact_cache.get(prompt)
if result:
return result, "L1"
# Try L2
result = self.normalized_cache.get(prompt)
if result:
self.exact_cache.set(prompt, result) # Promote to L1
return result, "L2"
# Try L3
result = self.semantic_cache.get(prompt)
if result:
self.normalized_cache.set(prompt, result) # Promote to L2
return result, "L3"
return None, "MISS"
Measuring Cache Effectiveness
Key metrics:
- Hit Rate: Cached requests / Total requests
- Cost Savings:
(Misses × Model Cost) - Cache Cost - Latency Improvement:
Avg Latency (Miss) - Avg Latency (Hit) - ROI:
Cost Savings / Cache Infrastructure Cost
Dashboard:
metrics = {
"hit_rate": cache_hits / total_requests,
"miss_rate": cache_misses / total_requests,
"avg_hit_latency": 15, # ms
"avg_miss_latency": 2100, # ms
"cost_per_hit": 0.0001, # Cache lookup cost
"cost_per_miss": 0.05, # LLM API cost
"monthly_savings": (cache_hits * 0.05) - (total_requests * 0.0001)
}
Real Example: Customer Support Chatbot
Before caching:
- Requests: 1M/month
- Cost: $50K/month
- Avg latency: 2.1s
After semantic caching (48% hit rate):
- Cache hits: 480K
- Cache misses: 520K
- Cost: $26K LLM + $800 cache = $26.8K (46% savings)
- Avg latency: 1.1s (48% improvement)
Setup:
- Semantic cache (Tier 3)
- Similarity threshold: 0.92
- TTL: 1 hour
- Embedding model: all-MiniLM-L6-v2 (fast)
Conclusion
LLM caching isn't optional at scale:
- Exact match: 8-12% hit rate (baseline)
- Normalized: 20-30% hit rate (easy improvement)
- Semantic: 40-55% hit rate (best ROI)
- Parameterized: 30-45% hit rate (template-based)
Implementation order:
- Start: Exact match (Week 1)
- Add: Normalized (Week 2)
- Add: Semantic (Week 3-4)
- Optimize: Monitor and tune thresholds
Anderson Lima
AI Achitect
Building the internet