·

System Design Deep Dive - 04 Caching Strategies for LLM Responses

Post by ailswan May. 24, 2026

中文 ↓

🎯 Caching Strategies for LLM Responses


1️⃣ Core Framework

When discussing Caching Strategies for LLM Responses, I frame it as:

  1. Why caching matters for LLM systems
  2. Exact response caching
  3. Semantic caching
  4. Prompt / context caching
  5. Embedding and retrieval caching
  6. Streaming and partial caching
  7. Cache invalidation and safety
  8. Trade-offs: cost reduction vs correctness risk

2️⃣ Why Caching Matters

LLM inference is expensive.

Caching can reduce:


Basic Idea

User Request
→ Check Cache
→ Cache Hit: Return Cached Response
→ Cache Miss: Call LLM
→ Store Result

👉 Interview Answer

Caching is important in LLM systems because inference is expensive and latency-sensitive.

A good cache can reduce GPU cost, improve response time, reduce repeated token generation, and lower load on retrieval and tool systems.


3️⃣ What Can Be Cached?


Cacheable Components

LLM systems can cache multiple layers:


High-Level Cache Layers

Request Cache
→ Prompt Cache
→ Retrieval Cache
→ Embedding Cache
→ Tool Result Cache
→ Response Cache

👉 Interview Answer

Caching in LLM systems is not limited to final responses.

We can cache prompt prefixes, embeddings, retrieval results, tool results, safety checks, and full model responses.

Different cache layers reduce different parts of the latency and cost.


4️⃣ Exact Response Cache


What Is Exact Response Caching?

Exact response caching returns the same output when the exact same request appears again.

Same prompt + same model + same parameters
→ Same cached response

Cache Key

A good key includes:


Best For


👉 Interview Answer

Exact response caching stores the final answer for identical requests.

The cache key must include model, prompt, parameters, system instructions, tool definitions, and relevant user or tenant context.

It works best for deterministic and repeated queries.


5️⃣ Semantic Cache


What Is Semantic Caching?

Semantic caching returns cached answers for similar questions, not just identical prompts.

"How do I reset my password?"
≈
"I forgot my password. What should I do?"

How It Works

User Query
→ Query Embedding
→ Search Cache by Similarity
→ If Similar Enough, Return Cached Answer

Benefits


Risk

Similar does not always mean equivalent.


👉 Interview Answer

Semantic caching uses embeddings to find cached answers for similar queries.

It can reduce cost for repeated natural-language questions, but it must be used carefully because semantically similar queries may still require different answers.


6️⃣ Prompt Prefix Caching


What Is Prompt Prefix Caching?

Many LLM requests share the same beginning.

Examples:


Idea

Shared prompt prefix
→ Cache internal representation
→ Reuse for multiple requests

Why It Helps

It reduces repeated prefill computation.


Best For


👉 Interview Answer

Prompt prefix caching reuses the computation for repeated prompt prefixes.

It is useful when many requests share the same system prompt, instructions, tools, or static context.

This mainly reduces prefill latency and cost.


7️⃣ Embedding Cache


Why Cache Embeddings?

Embedding generation also costs money and time.

Repeated queries or documents may produce the same embeddings.


Cache Key

hash(text + embedding_model + preprocessing_version)

Best For


👉 Interview Answer

Embedding caching avoids recomputing vectors for repeated text.

The cache key should include the text, embedding model version, and preprocessing version.

This is especially useful in RAG systems and document indexing pipelines.


8️⃣ Retrieval Cache


What Is Retrieval Caching?

Retrieval cache stores search results for repeated queries.

Query + filters
→ Retrieved chunks
→ Cache result

Cache Key Should Include


Important Risk

Caching retrieval without permissions can leak data.


👉 Interview Answer

Retrieval caching stores retrieved chunks or document IDs for repeated queries.

The cache key must include filters, permissions, index version, tenant, and freshness requirements.

Otherwise, retrieval caching can cause stale results or permission leaks.


9️⃣ Tool Result Cache


Why Cache Tool Results?

Agents often call the same tools repeatedly.

Examples:


Cache Carefully

Do not blindly cache:


Rule

Cache stable read-only results.
Avoid caching sensitive live state.

👉 Interview Answer

Tool result caching can reduce repeated external calls, but it must respect freshness, permissions, and sensitivity.

Stable read-only results are good candidates, while live user-specific or security-sensitive state should be fetched fresh.


🔟 Cache Invalidation


Why Invalidation Is Hard

Cached answers can become stale.


Invalidation Triggers


Common Strategies


👉 Interview Answer

Cache invalidation is one of the hardest parts of LLM caching.

Cached responses should be invalidated when documents, prompts, model versions, permissions, tools, or business rules change.

Versioned cache keys and TTLs are commonly used.


1️⃣1️⃣ Safety and Privacy Risks


Main Risks

Caching can introduce serious issues:


Example

User A asks about private account data.
Response cached.
User B gets same cached response.

This is a severe security bug.


Controls


👉 Interview Answer

Caching LLM responses creates privacy and safety risks.

Cache keys must include tenant, user, permission, and context signals when needed.

Sensitive or user-specific responses should often not be cached at all.


1️⃣2️⃣ Determinism and Temperature


Why Determinism Matters

Caching works best when outputs are deterministic.


Low Temperature

temperature = 0
→ More stable output
→ Better cache hit correctness

High Temperature

temperature = 1
→ More creative output
→ Cached answer may feel wrong

Rule

Cache deterministic tasks more aggressively.


👉 Interview Answer

Caching works best for deterministic or low-temperature requests.

For creative, high-temperature, or user-specific generation, caching final responses is riskier and less useful.


1️⃣3️⃣ Streaming Cache


Streaming Makes Caching Harder

Responses are generated token by token.


Options

Cache Final Response

Stream to user
→ Store final completed response

Partial Cache

Cache prefix or completed chunks

Important Rule

Do not cache incomplete failed responses as complete outputs.


👉 Interview Answer

For streaming responses, the simplest approach is to cache only the final completed response.

Partial caching is possible, but the system must clearly distinguish complete, partial, cancelled, and failed responses.


1️⃣4️⃣ Cache Metrics


What to Monitor


Why Important

High hit rate is not enough.

The cache must be correct.


👉 Interview Answer

Cache metrics should measure both efficiency and correctness.

I would monitor hit rate, latency saved, token cost saved, stale hits, permission errors, eviction rate, and storage cost.


1️⃣5️⃣ Decision Framework


Cache Full Responses When


Use Semantic Cache When


Avoid Caching When


👉 Interview Answer

I would cache aggressively for stable, deterministic, low-risk tasks.

I would avoid caching for sensitive, fast-changing, user-specific, or high-risk outputs.


1️⃣6️⃣ Best Practices


Practical Rules


Design Principle

Cache stable computation,
not unstable truth.

👉 Interview Answer

The best LLM caching strategy is layered and risk-aware.

Cache deterministic and stable computation, but avoid caching sensitive, stale, permission-dependent, or correctness-critical live state.


🧠 Staff-Level Answer Final


👉 Interview Answer Full Version

Caching is important in LLM systems because model inference is expensive, slow, and GPU-intensive.

But LLM caching is more complex than normal web caching because responses depend on prompts, model versions, parameters, retrieved context, tools, user permissions, and safety policies.

There are several layers of caching.

Exact response caching stores final answers for identical requests.

Semantic caching uses embeddings to reuse answers for similar questions.

Prompt prefix caching reuses computation for shared system prompts, instructions, tools, or static context.

Embedding caching avoids recomputing vectors for repeated text.

Retrieval caching stores retrieved chunks for repeated queries.

Tool result caching avoids repeated external calls for stable read-only data.

Each layer has different risks.

The biggest risks are stale answers, permission leaks, cross-user data leakage, incorrect personalization, and unsafe cached outputs.

Cache keys must include model version, prompt version, temperature, output schema, tenant, permissions, index version, filters, and freshness requirements where relevant.

For RAG systems, retrieval caches must be permission-aware and index-version-aware.

For user-specific or sensitive data, caching should be avoided or heavily constrained.

Cache invalidation is critical.

TTL, versioned keys, event-based invalidation, namespace invalidation, and do-not-cache policies are common strategies.

Caching works best for deterministic, low-temperature, stable, repeated tasks.

It is risky for high-temperature creative tasks, live state, financial decisions, security decisions, or personalized private data.

The core design principle is: cache stable computation, not unstable truth.


⭐ Final Insight

LLM Caching 的核心不是简单地:

“相同问题返回相同答案”

真正的 LLM caching 是 layered caching:

Response Cache

  • Semantic Cache
  • Prompt Prefix Cache
  • Embedding Cache
  • Retrieval Cache
  • Tool Result Cache。

但最大的风险是:

stale answer、 permission leak、 cross-user data leakage。

最重要的一句话:

Cache stable computation, not unstable truth.


中文部分


🎯 Caching Strategies for LLM Responses


1️⃣ 核心框架

讨论 Caching Strategies for LLM Responses 时,我通常从这些方面分析:

  1. 为什么 LLM systems 需要 caching
  2. Exact response caching
  3. Semantic caching
  4. Prompt / context caching
  5. Embedding and retrieval caching
  6. Streaming and partial caching
  7. Cache invalidation and safety
  8. 核心权衡:cost reduction vs correctness risk

2️⃣ 为什么 Caching 很重要?

LLM inference 很昂贵。

Caching 可以降低:


Basic Idea

User Request
→ Check Cache
→ Cache Hit: Return Cached Response
→ Cache Miss: Call LLM
→ Store Result

👉 面试回答

Caching 对 LLM systems 很重要, 因为 inference 昂贵且 latency-sensitive。

好的 cache 可以降低 GPU cost、 改善 response time、 减少重复 token generation, 并降低 retrieval 和 tool systems 的压力。


3️⃣ 什么可以被 Cache?


Cacheable Components

LLM systems 可以 cache 多个层次:


High-Level Cache Layers

Request Cache
→ Prompt Cache
→ Retrieval Cache
→ Embedding Cache
→ Tool Result Cache
→ Response Cache

👉 面试回答

LLM systems 中的 caching 不只限于 final responses。

我们可以 cache prompt prefixes、 embeddings、retrieval results、 tool results、safety checks 和 full model responses。

不同 cache layers 会减少不同部分的 latency 和 cost。


4️⃣ Exact Response Cache


什么是 Exact Response Caching?

Exact response caching 在完全相同 request 再次出现时返回相同 output。

Same prompt + same model + same parameters
→ Same cached response

Cache Key

好的 key 应该包含:


Best For


👉 面试回答

Exact response caching 会为 identical requests 存储 final answer。

Cache key 必须包含 model、prompt、 parameters、system instructions、 tool definitions 和 relevant user 或 tenant context。

它最适合 deterministic 和 repeated queries。


5️⃣ Semantic Cache


什么是 Semantic Caching?

Semantic caching 对 similar questions 返回 cached answers, 而不只是 identical prompts。

"How do I reset my password?"
≈
"I forgot my password. What should I do?"

How It Works

User Query
→ Query Embedding
→ Search Cache by Similarity
→ If Similar Enough, Return Cached Answer

Benefits


Risk

Similar 不一定等于 equivalent。


👉 面试回答

Semantic caching 使用 embeddings 为 similar queries 查找 cached answers。

它可以减少 repeated natural-language questions 的 cost, 但必须谨慎使用, 因为 semantically similar queries 仍然可能需要不同答案。


6️⃣ Prompt Prefix Caching


什么是 Prompt Prefix Caching?

很多 LLM requests 共享相同开头。

Examples:


Idea

Shared prompt prefix
→ Cache internal representation
→ Reuse for multiple requests

为什么有帮助?

它减少重复 prefill computation。


Best For


👉 面试回答

Prompt prefix caching 复用 repeated prompt prefixes 的计算。

当很多 requests 共享相同 system prompt、 instructions、tools 或 static context 时非常有用。

它主要减少 prefill latency 和 cost。


7️⃣ Embedding Cache


为什么 Cache Embeddings?

Embedding generation 也有时间和成本。

重复 queries 或 documents 可能产生相同 embeddings。


Cache Key

hash(text + embedding_model + preprocessing_version)

Best For


👉 面试回答

Embedding caching 避免对 repeated text 重新计算 vectors。

Cache key 应包含 text、embedding model version 和 preprocessing version。

这在 RAG systems 和 document indexing pipelines 中特别有用。


8️⃣ Retrieval Cache


什么是 Retrieval Caching?

Retrieval cache 为 repeated queries 存储 search results。

Query + filters
→ Retrieved chunks
→ Cache result

Cache Key Should Include


Important Risk

如果 retrieval cache 不包含 permissions, 可能造成 data leak。


👉 面试回答

Retrieval caching 会为 repeated queries 存储 retrieved chunks 或 document IDs。

Cache key 必须包含 filters、permissions、 index version、tenant 和 freshness requirements。

否则 retrieval caching 可能造成 stale results 或 permission leaks。


9️⃣ Tool Result Cache


为什么 Cache Tool Results?

Agents 经常重复调用相同 tools。

Examples:


Cache Carefully

不要盲目 cache:


Rule

Cache stable read-only results.
Avoid caching sensitive live state.

👉 面试回答

Tool result caching 可以减少重复 external calls, 但必须尊重 freshness、permissions 和 sensitivity。

Stable read-only results 是好的 cache candidates, 但 live user-specific 或 security-sensitive state 应该 fresh fetch。


🔟 Cache Invalidation


为什么 Invalidation 很难?

Cached answers 可能变 stale。


Invalidation Triggers


Common Strategies


👉 面试回答

Cache invalidation 是 LLM caching 最难的部分之一。

当 documents、prompts、model versions、 permissions、tools 或 business rules 改变时, cached responses 应该 invalidated。

Versioned cache keys 和 TTL 是常用策略。


1️⃣1️⃣ Safety and Privacy Risks


Main Risks

Caching 可能引入严重问题:


Example

User A asks about private account data.
Response cached.
User B gets same cached response.

这是严重 security bug。


Controls


👉 面试回答

Caching LLM responses 会带来 privacy 和 safety risks。

Cache keys 必须在需要时包含 tenant、user、 permission 和 context signals。

Sensitive 或 user-specific responses 通常不应该 cache。


1️⃣2️⃣ Determinism and Temperature


为什么 Determinism 重要?

Caching 最适合 deterministic outputs。


Low Temperature

temperature = 0
→ More stable output
→ Better cache hit correctness

High Temperature

temperature = 1
→ More creative output
→ Cached answer may feel wrong

Rule

对 deterministic tasks 更积极 cache。


👉 面试回答

Caching 最适合 deterministic 或 low-temperature requests。

对 creative、high-temperature 或 user-specific generation, caching final responses 风险更高, 用处也更有限。


1️⃣3️⃣ Streaming Cache


Streaming 让 Caching 更难

Responses 是 token by token 生成的。


Options

Cache Final Response

Stream to user
→ Store final completed response

Partial Cache

Cache prefix or completed chunks

Important Rule

不要把 incomplete failed responses 当作 complete outputs cache。


👉 面试回答

对 streaming responses, 最简单的方法是只 cache final completed response。

Partial caching 是可能的, 但系统必须明确区分 complete、partial、 cancelled 和 failed responses。


1️⃣4️⃣ Cache Metrics


What to Monitor


为什么重要?

High hit rate 不够。

Cache 必须正确。


👉 面试回答

Cache metrics 应同时衡量 efficiency 和 correctness。

我会监控 hit rate、latency saved、 token cost saved、stale hits、 permission errors、eviction rate 和 storage cost。


1️⃣5️⃣ Decision Framework


Cache Full Responses When


Use Semantic Cache When


Avoid Caching When


👉 面试回答

我会对 stable、deterministic、 low-risk tasks 积极 cache。

对 sensitive、fast-changing、 user-specific 或 high-risk outputs, 我会避免 caching。


1️⃣6️⃣ Best Practices


Practical Rules


Design Principle

Cache stable computation,
not unstable truth.

👉 面试回答

最好的 LLM caching strategy 是 layered and risk-aware。

Cache deterministic and stable computation, 但避免 cache sensitive、stale、 permission-dependent 或 correctness-critical live state。


🧠 Staff-Level Answer Final


👉 面试回答完整版本

Caching 对 LLM systems 很重要, 因为 model inference 昂贵、慢, 且 GPU-intensive。

但 LLM caching 比普通 web caching 更复杂, 因为 responses 取决于 prompts、 model versions、parameters、 retrieved context、tools、 user permissions 和 safety policies。

有多层 caching 可以使用。

Exact response caching 为 identical requests 存储 final answers。

Semantic caching 使用 embeddings 为 similar questions 复用 answers。

Prompt prefix caching 复用 shared system prompts、instructions、 tools 或 static context 的计算。

Embedding caching 避免对 repeated text 重新计算 vectors。

Retrieval caching 为 repeated queries 存储 retrieved chunks。

Tool result caching 避免对 stable read-only data 重复 external calls。

每一层都有不同风险。

最大风险是 stale answers、 permission leaks、cross-user data leakage、 incorrect personalization 和 unsafe cached outputs。

Cache keys 必须在相关场景包含 model version、prompt version、temperature、 output schema、tenant、permissions、 index version、filters 和 freshness requirements。

对 RAG systems, retrieval caches 必须 permission-aware 和 index-version-aware。

对 user-specific 或 sensitive data, 应该避免 caching 或严格限制 caching。

Cache invalidation 非常关键。

TTL、versioned keys、event-based invalidation、 namespace invalidation 和 do-not-cache policies 是常见策略。

Caching 最适合 deterministic、 low-temperature、stable 和 repeated tasks。

对 high-temperature creative tasks、 live state、financial decisions、 security decisions 或 personalized private data, caching 风险较高。

核心设计原则是: cache stable computation, not unstable truth。


⭐ Final Insight

LLM Caching 的核心不是简单地:

“相同问题返回相同答案”

真正的 LLM caching 是 layered caching:

Response Cache

  • Semantic Cache
  • Prompt Prefix Cache
  • Embedding Cache
  • Retrieval Cache
  • Tool Result Cache。

但最大的风险是:

stale answer、 permission leak、 cross-user data leakage。

最重要的一句话:

Cache stable computation, not unstable truth.


Implement