#llm-caching#cache-strategy#cost-optimization#latency-reduction

LLM Caching: The Underutilized Key to Cost and Latency Reduction

Layered caching (exact, normalized, semantic) that cuts LLM costs/latency without stale answers.

Anderson LimaAI Achitect
25 de enero de 2026
4 min read
29 views
LLM Caching: The Underutilized Key to Cost and Latency Reduction

Introduction

LLM caching can cut costs by 60%+ and reduce latency by 90%.

Yet most teams don't cache LLM responses because "AI is non-deterministic."

This article shows caching strategies that work for non-deterministic systems.

Why LLM Caching Is Different

Traditional caching: exact-match keys LLM caching: semantic similarity

Challenge: "How do I reset my password?" vs "I forgot my password, help!" are semantically identical but different strings.

Caching Strategy Tiers

Tier 1: Exact Match Caching (Baseline)

Cache identical prompts:

class ExactMatchCache:
    def get(self, prompt):
        return self.cache.get(hash(prompt))
    
    def set(self, prompt, response):
        self.cache.set(hash(prompt), response, ttl=3600)

Hit rate: 8-12% (low but easy to implement) Use case: Repeated exact queries (FAQs, common commands)

Tier 2: Normalized Caching

Normalize prompts before caching:

def normalize(prompt):
    # Lowercase, remove punctuation, trim whitespace
    normalized = prompt.lower().strip()
    normalized = re.sub(r'[^\w\s]', '', normalized)
    # Remove stop words
    normalized = remove_stop_words(normalized)
    return normalized

class NormalizedCache:
    def get(self, prompt):
        key = hash(normalize(prompt))
        return self.cache.get(key)

Hit rate: 20-30% (better, still simple) Use case: Queries with minor variations

Tier 3: Semantic Caching

Cache by meaning, not string:

from sentence_transformers import SentenceTransformer

class SemanticCache:
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.cache = {}
        self.embeddings = {}
    
    def get(self, prompt, similarity_threshold=0.95):
        prompt_embedding = self.model.encode(prompt)
        
        # Find similar cached prompts
        for cached_prompt, cached_embedding in self.embeddings.items():
            similarity = cosine_similarity(prompt_embedding, cached_embedding)
            if similarity > similarity_threshold:
                return self.cache[cached_prompt]
        
        return None
    
    def set(self, prompt, response):
        embedding = self.model.encode(prompt)
        self.embeddings[prompt] = embedding
        self.cache[prompt] = response

Hit rate: 40-55% (significant improvement) Cost: +30ms latency for embedding calculation Use case: Semantically similar queries

Tier 4: Parameterized Caching

Cache prompt templates with variables:

class ParameterizedCache:
    def get(self, template, params):
        # Cache template, not filled prompt
        cached_response_template = self.cache.get(template)
        if cached_response_template:
            return cached_response_template.format(**params)
        return None

Example:

template = "Summarize this document: {document}"
params = {"document": user_document}

# Cache "Summarize this document: {document}" template
# Fill with actual document on cache hit

Hit rate: 30-45% (depends on template reuse) Use case: Repetitive tasks with variable inputs

Cache Invalidation Strategies

Strategy 1: Time-Based (TTL)

# Short TTL for dynamic content
cache.set(key, value, ttl=300)  # 5 minutes

# Long TTL for static content
cache.set(key, value, ttl=86400)  # 24 hours

Strategy 2: Event-Based

Invalidate on data changes:

@event_listener("document_updated")
def invalidate_document_cache(document_id):
    cache.delete_pattern(f"doc:{document_id}:*")

Strategy 3: Version-Based

Tag cache entries with version:

def cache_key(prompt, model_version):
    return f"{hash(prompt)}:v{model_version}"

# Invalidate all cache when model updates
def on_model_update(new_version):
    cache.clear()  # Or selectively delete old versions

Tiered Caching Architecture

Combine multiple cache layers:

class TieredCache:
    def __init__(self):
        self.exact_cache = ExactMatchCache()          # L1: Fast, low hit rate
        self.normalized_cache = NormalizedCache()     # L2: Medium speed/hit rate
        self.semantic_cache = SemanticCache()         # L3: Slower, high hit rate
    
    def get(self, prompt):
        # Try L1 (fastest)
        result = self.exact_cache.get(prompt)
        if result:
            return result, "L1"
        
        # Try L2
        result = self.normalized_cache.get(prompt)
        if result:
            self.exact_cache.set(prompt, result)  # Promote to L1
            return result, "L2"
        
        # Try L3
        result = self.semantic_cache.get(prompt)
        if result:
            self.normalized_cache.set(prompt, result)  # Promote to L2
            return result, "L3"
        
        return None, "MISS"

Measuring Cache Effectiveness

Key metrics:

  1. Hit Rate: Cached requests / Total requests
  2. Cost Savings: (Misses × Model Cost) - Cache Cost
  3. Latency Improvement: Avg Latency (Miss) - Avg Latency (Hit)
  4. ROI: Cost Savings / Cache Infrastructure Cost

Dashboard:

metrics = {
    "hit_rate": cache_hits / total_requests,
    "miss_rate": cache_misses / total_requests,
    "avg_hit_latency": 15,  # ms
    "avg_miss_latency": 2100,  # ms
    "cost_per_hit": 0.0001,  # Cache lookup cost
    "cost_per_miss": 0.05,   # LLM API cost
    "monthly_savings": (cache_hits * 0.05) - (total_requests * 0.0001)
}

Real Example: Customer Support Chatbot

Before caching:

  • Requests: 1M/month
  • Cost: $50K/month
  • Avg latency: 2.1s

After semantic caching (48% hit rate):

  • Cache hits: 480K
  • Cache misses: 520K
  • Cost: $26K LLM + $800 cache = $26.8K (46% savings)
  • Avg latency: 1.1s (48% improvement)

Setup:

  • Semantic cache (Tier 3)
  • Similarity threshold: 0.92
  • TTL: 1 hour
  • Embedding model: all-MiniLM-L6-v2 (fast)

Conclusion

LLM caching isn't optional at scale:

  • Exact match: 8-12% hit rate (baseline)
  • Normalized: 20-30% hit rate (easy improvement)
  • Semantic: 40-55% hit rate (best ROI)
  • Parameterized: 30-45% hit rate (template-based)

Implementation order:

  1. Start: Exact match (Week 1)
  2. Add: Normalized (Week 2)
  3. Add: Semantic (Week 3-4)
  4. Optimize: Monitor and tune thresholds

Anderson Lima

AI Achitect

Building the internet