d&d-t System Design Deep Dive ·

🎯 Design Distributed Cache

1️⃣ Core Framework

When discussing Distributed Cache design, I frame it as:

Core purpose and access patterns
Cache placement and caching strategies
Data partitioning and replication
Eviction policy and TTL
Consistency and invalidation
Hot key and cache stampede handling
Scaling and failure handling
Trade-offs: latency vs consistency vs cost

2️⃣ Core Requirements

Functional Requirements

Store key-value data
Support fast read and write
Support TTL expiration
Support cache invalidation
Support distributed nodes
Support eviction when memory is full
Support high QPS
Support basic observability

Non-functional Requirements

Very low latency
High availability
Horizontal scalability
High throughput
Memory efficient
Graceful degradation
Eventually consistent cache is usually acceptable

👉 Interview Answer

A distributed cache stores frequently accessed data in memory to reduce database load and improve read latency.

The main challenges are partitioning data across nodes, handling failures, preventing hot keys, and keeping cached data reasonably consistent with the source of truth.

3️⃣ Main APIs

Get

GET /cache/{key}

Response:

{
  "key": "user:123",
  "value": {
    "name": "Alice",
    "tier": "premium"
  },
  "ttlSeconds": 300
}

Set

PUT /cache/{key}

Request:

{
  "value": {
    "name": "Alice",
    "tier": "premium"
  },
  "ttlSeconds": 300
}

Delete / Invalidate

DELETE /cache/{key}

Batch Get

POST /cache/batch-get

Request:

{
  "keys": ["user:123", "user:456", "product:999"]
}

👉 Interview Answer

A distributed cache usually exposes simple key-value operations: get, set, delete, and batch get.

In practice, the cache is usually accessed by application services through a client library, rather than directly through public APIs.

4️⃣ Cache Placement

Client-side Cache

Cache lives inside application process.

Pros

Extremely low latency
No network call
Good for small static data

Cons

Memory duplicated across clients
Harder to invalidate
Not globally consistent

Server-side Distributed Cache

Cache cluster shared by many services.

Examples:

Redis
Memcached
DynamoDB Accelerator

Pros

Shared cache
Centralized control
Easier invalidation
More scalable capacity

Cons

Network hop
Cluster management complexity

Multi-layer Cache

Common production pattern:

Local in-process cache
→ Distributed cache
→ Database

👉 Interview Answer

I would usually use a multi-layer caching strategy.

Local in-process cache provides extremely low latency for hot small data, while distributed cache provides shared capacity across services.

The database remains the source of truth.

5️⃣ Caching Strategies

Strategy 1: Cache-aside / Lazy Loading

Flow:

Application checks cache
→ Cache miss
→ Read database
→ Write result to cache
→ Return result

Pros

Simple
Cache only stores requested data
Works well for read-heavy systems

Cons

Cache miss is slower
Risk of stale data
Cache stampede possible

👉 Interview Answer

Cache-aside is the most common pattern.

The application first checks the cache. On cache miss, it reads from the database, writes the result back to cache, and returns the data.

This keeps the cache simple, but we need to handle stale data and cache stampede.

Strategy 2: Read-through Cache

Flow:

Application → Cache
Cache loads from database on miss

Pros

Application logic is simpler
Cache layer owns loading logic

Cons

Cache becomes more complex
Tighter coupling between cache and database

Strategy 3: Write-through Cache

Flow:

Application writes cache
→ Cache writes database synchronously

Pros

Cache and database stay more consistent
Future reads are fast

Cons

Write latency increases
Cache depends on database write success

Strategy 4: Write-behind / Write-back Cache

Flow:

Application writes cache
→ Cache asynchronously writes database

Pros

Very fast writes
Good for high write throughput

Cons

Risk of data loss if cache fails
More complex durability requirements

6️⃣ Data Partitioning

Why Partition?

A single cache node cannot handle all data or traffic.

We need to split keys across many nodes.

Hash-based Partitioning

hash(key) % number_of_nodes

Pros

Simple
Even distribution

Cons

Adding or removing nodes remaps many keys

Consistent Hashing

hash ring with virtual nodes

Pros

Reduces key movement during scaling
Better for dynamic clusters
Supports horizontal scaling

Cons

More complex
Needs good virtual node distribution

Rendezvous Hashing

Alternative consistent hashing strategy.

Pros

Simple client-side implementation
Good distribution
Minimal movement on membership changes

👉 Interview Answer

I would use consistent hashing or rendezvous hashing to distribute keys across cache nodes.

This avoids massive key movement when nodes are added or removed, and allows the cluster to scale horizontally.

7️⃣ Replication and Availability

Why Replicate?

Node failure should not lose hot data completely
Improve availability
Improve read throughput

Primary-replica Model

key → primary node + replica nodes

Writes go to primary.

Reads can go to:

primary only
replicas
nearest replica

Replication Factor

Example:

replication_factor = 2 or 3

Trade-off

Choice	Pros	Cons
No replication	Simple, cheaper	Cache loss on node failure
Async replication	Faster	Temporary inconsistency
Sync replication	More consistent	Higher write latency

👉 Interview Answer

I would replicate cache entries across multiple nodes to improve availability.

Since the database is still the source of truth, cache replication can usually be asynchronous.

If a cache node fails, the system can read from a replica or fall back to the database.

8️⃣ Eviction Policy and TTL

Why Eviction?

Cache memory is limited.

When memory is full, the cache must remove some entries.

Common Eviction Policies

Policy	Meaning	Use Case
LRU	Evict least recently used	General-purpose cache
LFU	Evict least frequently used	Stable hot keys
FIFO	Evict oldest item	Simple systems
Random	Evict random item	Low overhead

TTL

TTL automatically expires data.

Example:

user_profile TTL = 5 minutes
product_catalog TTL = 1 hour
feature_flags TTL = 30 seconds

TTL Jitter

Add randomization:

TTL = 300s ± random(0, 60s)

Why?

Prevent many keys from expiring at the same time
Reduce cache stampede

👉 Interview Answer

I would use TTLs to prevent stale data from living forever, and an eviction policy like LRU or LFU when memory is full.

I would also add TTL jitter so many hot keys do not expire at exactly the same time, which helps prevent cache stampede.

9️⃣ Consistency and Invalidation

Cache Consistency Problem

Database update happens, but cache may still contain old value.

Option 1: Delete Cache on Write

Flow:

Update database
→ Delete cache key

Next read reloads from DB.

Pros

Simple
Common with cache-aside

Cons

Small window of stale reads
Race conditions possible

Option 2: Update Cache on Write

Flow:

Update database
→ Update cache value

Pros

Cache stays warm
Fewer misses

Cons

More complex
Risk of cache/database mismatch

Option 3: Event-based Invalidation

Flow:

Database update
→ Publish event
→ Cache invalidation worker deletes keys

Pros

Decoupled
Good for multi-service systems

Cons

Event delay
Event loss must be handled

👉 Interview Answer

For cache-aside, I would usually update the database first, then delete the cache key.

This keeps the database as the source of truth and avoids writing stale values into the cache.

In larger systems, cache invalidation can be event-driven, where database updates publish invalidation events.

🔟 Cache Stampede and Thundering Herd

Problem

A hot key expires.

Many requests miss cache at the same time.

All requests hit the database.

hot key expires
→ thousands of requests miss
→ database overload

Solutions

1. Request Coalescing

Only one request rebuilds cache.

Others wait or serve stale value.

2. Distributed Lock

acquire lock for key
→ only lock holder loads DB
→ others wait/retry

3. Serve Stale While Revalidate

return stale cache value
→ refresh cache asynchronously

4. TTL Jitter

Prevent simultaneous expiration.

👉 Interview Answer

Cache stampede happens when a hot key expires and many requests hit the database at the same time.

I would use request coalescing, distributed locks, stale-while-revalidate, and TTL jitter to prevent database overload.

1️⃣1️⃣ Hot Key Problem

What Is a Hot Key?

One key receives extremely high traffic.

Examples:

celebrity_profile:123
product:iphone_launch
homepage_config

Problems

One cache node overloaded
Increased latency
Node failure affects many requests

Solutions

1. Replicate Hot Keys

Store hot key on multiple nodes.

2. Local Cache

Cache hot key inside application process.

3. Key Splitting

Create multiple physical keys:

hot_key:1
hot_key:2
hot_key:3

Requests randomly read one copy.

4. CDN / Edge Cache

For public data.

👉 Interview Answer

Hot keys can overload a single cache node even if the cluster is large.

To handle this, I would replicate hot keys, use local in-process cache, split hot keys into multiple physical keys, or cache public data at the edge.

1️⃣2️⃣ Cache Penetration

Problem

Requests repeatedly ask for non-existing keys.

Example:

user:invalid_id
product:not_found

Each request misses cache and hits DB.

Solutions

1. Negative Caching

Cache “not found” result.

user:invalid_id → NULL, TTL = 60s

2. Bloom Filter

Before querying DB, check whether key may exist.

3. Input Validation

Reject invalid keys early.

👉 Interview Answer

Cache penetration happens when many requests ask for keys that do not exist.

I would use negative caching, Bloom filters, and input validation to avoid repeatedly hitting the database for invalid keys.

1️⃣3️⃣ Scaling Patterns

Pattern 1: Consistent Hashing

Distribute keys across nodes with minimal movement.

Pattern 2: Client-side Routing

Cache client decides which node owns a key.

Pros:

Avoids proxy bottleneck
Lower latency

Cons:

Client needs cluster membership info

Pattern 3: Proxy-based Routing

Application talks to cache proxy.

Proxy routes request to correct node.

Pros:

Simpler clients
Centralized routing

Cons:

Proxy can become bottleneck

Pattern 4: Multi-layer Cache

Local cache → Distributed cache → Database

Pattern 5: Shard by Tenant or Region

Useful for isolation and compliance.

👉 Interview Answer

To scale a distributed cache, I would shard keys using consistent hashing, replicate important data, use multi-layer caching, and choose between client-side routing and proxy-based routing.

Client-side routing gives lower latency, while proxy routing simplifies application clients.

1️⃣4️⃣ Failure Handling

Common Failures

Cache node unavailable
Network timeout
Cache cluster partition
Hot key overload
Memory pressure
Redis failover
Cache data loss
Database overload after cache failure

Strategies

Fall back to database
Use circuit breaker
Use timeout budget
Serve stale value
Use replicas
Apply load shedding
Warm up cache after failure
Protect database with rate limiting

Cache Failure Rule

Cache is an optimization, not the source of truth.

👉 Interview Answer

The system should work when cache fails, although with higher latency.

I would use short timeouts, circuit breakers, fallback to database, stale reads when acceptable, and rate limiting to protect the database.

Cache should not be treated as the source of truth unless we are explicitly designing a durable cache.

1️⃣5️⃣ Observability

Key Metrics

Cache hit rate
Cache miss rate
Latency p50 / p95 / p99
Eviction count
Expired key count
Memory usage
Hot key distribution
Error rate
Replication lag
Database fallback rate
Stampede events

Important Dashboards

Cluster health
Node memory usage
Hit/miss ratio
Hot keys
Evictions
DB fallback traffic
Cache latency

👉 Interview Answer

Observability is critical for cache systems.

I would monitor hit rate, miss rate, cache latency, eviction count, memory usage, hot keys, replication lag, and database fallback traffic.

A falling hit rate or sudden DB fallback spike can indicate cache failure or bad TTL configuration.

1️⃣6️⃣ Consistency Model

Stronger Consistency Needed For

Financial balances
Inventory counts
Authentication / authorization decisions
Critical configuration

Eventual Consistency Acceptable For

User profile display
Product details
Feed objects
Search results
Recommendation features
Analytics counters

👉 Interview Answer

Cache data is usually eventually consistent.

For most read-heavy data, slightly stale values are acceptable.

But for critical data like financial balances, inventory, or authorization, we should either avoid caching, use very short TTLs, or enforce stronger invalidation and read-through checks.

1️⃣7️⃣ Security and Access Control

Requirements

Encrypt traffic between clients and cache
Restrict cache access by service
Avoid storing secrets unless necessary
Support tenant isolation
Audit admin operations
Protect against cache poisoning

👉 Interview Answer

A distributed cache can contain sensitive data, so access control matters.

I would restrict which services can access which key namespaces, encrypt traffic, avoid storing secrets, and protect against cache poisoning by validating keys and values before writing.

1️⃣8️⃣ End-to-End Flow

Cache-aside Read Flow

Application receives request
→ Check local cache
→ Check distributed cache
→ Cache miss
→ Read database
→ Write value to distributed cache
→ Optionally write local cache
→ Return response

Write Flow with Invalidation

Application updates database
→ Delete cache key
→ Publish invalidation event
→ Other services remove local cache

Hot Key Flow

Detect hot key
→ Replicate to multiple cache nodes
→ Enable local cache
→ Add TTL jitter
→ Serve stale while revalidate if needed

Key Insight

Distributed Cache is not just faster storage — it is a consistency and traffic-shaping layer in front of the source of truth.

🧠 Staff-Level Answer (Final)

👉 Interview Answer (Full Version)

When designing a distributed cache, I think of it as a low-latency key-value layer that reduces database load and improves read performance.

The database remains the source of truth, while the cache stores frequently accessed data.

I would usually use a cache-aside pattern: the application first checks the cache, reads from the database on cache miss, then writes the result back to the cache.

For data distribution, I would use consistent hashing or rendezvous hashing to spread keys across cache nodes while minimizing key movement during scaling.

To improve availability, important cache entries can be replicated asynchronously.

I would use TTLs and eviction policies like LRU or LFU to control memory usage, and add TTL jitter to avoid many keys expiring at the same time.

Cache consistency is one of the hardest parts. For cache-aside, I would update the database first, then delete or invalidate the cache key.

For large systems, invalidation can be event-driven.

To handle cache stampede, I would use request coalescing, distributed locks, stale-while-revalidate, and TTL jitter.

To handle hot keys, I would use local cache, hot key replication, key splitting, or edge caching for public data.

The main trade-offs are latency, consistency, availability, memory cost, and operational complexity.

Ultimately, the goal is to reduce backend load and serve hot data with very low latency, while keeping stale data and failure impact under control.

⭐ Final Insight

Distributed Cache 的核心不是“更快的数据库”，而是在 source of truth 前面建立一个低延迟、可扩展、可降级的流量保护层。

中文部分

🎯 Design Distributed Cache

1️⃣ 核心框架

在设计 Distributed Cache 时，我通常从以下几个方面来分析：

核心目的和访问模式
Cache 位置和缓存策略
数据分片和副本
Eviction policy 和 TTL
一致性和 invalidation
Hot key 和 cache stampede 处理
扩展和故障处理
核心权衡：延迟 vs 一致性 vs 成本

2️⃣ 核心需求

功能需求

存储 key-value 数据
支持快速读写
支持 TTL 过期
支持 cache invalidation
支持分布式节点
内存满时支持 eviction
支持高 QPS
支持基本可观测性

非功能需求

极低延迟
高可用
水平扩展
高吞吐
内存使用高效
支持优雅降级
Cache 通常可以接受最终一致

👉 面试回答

Distributed Cache 会将频繁访问的数据存储在内存中，用来降低数据库负载并提升读取延迟。

核心挑战包括如何在多个节点间分配数据、如何处理节点失败、如何解决 hot key，以及如何让缓存数据和 source of truth 保持合理一致。

3️⃣ 主要 API

Get

GET /cache/{key}

Response:

{
  "key": "user:123",
  "value": {
    "name": "Alice",
    "tier": "premium"
  },
  "ttlSeconds": 300
}

Set

PUT /cache/{key}

Request:

{
  "value": {
    "name": "Alice",
    "tier": "premium"
  },
  "ttlSeconds": 300
}

Delete / Invalidate

DELETE /cache/{key}

Batch Get

POST /cache/batch-get

Request:

{
  "keys": ["user:123", "user:456", "product:999"]
}

👉 面试回答

Distributed Cache 通常提供简单的 key-value 操作： get、set、delete 和 batch get。

在实际系统中，应用服务通常通过 client library 访问 cache，而不是直接暴露成 public API。

4️⃣ Cache Placement

Client-side Cache

Cache 存在 application process 内部。

优点

延迟极低
没有网络调用
适合小规模静态数据

缺点

内存在多个 client 中重复
Invalidation 更难
全局不一致

Server-side Distributed Cache

多个服务共享的 cache cluster。

例如：

Redis
Memcached
DynamoDB Accelerator

优点

共享缓存
集中控制
Invalidation 更容易
容量更容易扩展

缺点

多一次网络调用
Cluster 管理复杂

Multi-layer Cache

生产系统常见模式：

Local in-process cache
→ Distributed cache
→ Database

👉 面试回答

我通常会使用 multi-layer caching strategy。

Local in-process cache 可以为小型热点数据提供极低延迟， distributed cache 则提供跨服务共享的缓存能力。

Database 仍然是 source of truth。

5️⃣ 缓存策略

Strategy 1: Cache-aside / Lazy Loading

流程：

Application checks cache
→ Cache miss
→ Read database
→ Write result to cache
→ Return result

优点

简单
Cache 只存被请求过的数据
适合 read-heavy 系统

缺点

Cache miss 较慢
有 stale data 风险
可能发生 cache stampede

👉 面试回答

Cache-aside 是最常见的缓存模式。

应用先检查 cache。如果 cache miss，就读取数据库，再将结果写回 cache，最后返回数据。

这种方式简单，但需要处理 stale data 和 cache stampede。

Strategy 2: Read-through Cache

流程：

Application → Cache
Cache loads from database on miss

优点

应用逻辑更简单
Cache layer 负责加载数据

缺点

Cache 层更复杂
Cache 和数据库耦合更强

Strategy 3: Write-through Cache

流程：

Application writes cache
→ Cache writes database synchronously

优点

Cache 和 database 更一致
后续读取更快

缺点

写入延迟增加
Cache 依赖 database 写入成功

Strategy 4: Write-behind / Write-back Cache

流程：

Application writes cache
→ Cache asynchronously writes database

优点

写入非常快
适合高写入吞吐

缺点

如果 cache 失败，可能丢数据
对 durability 要求更复杂

6️⃣ 数据分片

为什么需要分片？

单个 cache node 无法承载所有数据和流量。

我们需要将 keys 分布到多个节点。

Hash-based Partitioning

hash(key) % number_of_nodes

优点

简单
分布均匀

缺点

添加或删除节点会导致大量 key 重新映射

Consistent Hashing

hash ring with virtual nodes

优点

扩容或缩容时减少 key movement
适合动态 cluster
支持水平扩展

缺点

更复杂
需要合理分布 virtual nodes

Rendezvous Hashing

另一种 consistent hashing 策略。

优点

Client-side 实现简单
分布效果好
节点变化时 key movement 少

👉 面试回答

我会使用 consistent hashing 或 rendezvous hashing 将 keys 分布到不同 cache nodes。

这样在添加或删除节点时，可以避免大量 key 重新映射，并支持 cache cluster 水平扩展。

7️⃣ Replication and Availability

为什么需要副本？

节点失败时不完全丢失 hot data
提升可用性
提升读吞吐

Primary-replica Model

key → primary node + replica nodes

Writes 写入 primary。

Reads 可以读：

primary only
replicas
nearest replica

Replication Factor

示例：

replication_factor = 2 or 3

Trade-off

Choice	优点	缺点
No replication	简单、便宜	节点失败导致 cache 丢失
Async replication	快	可能短暂不一致
Sync replication	更一致	写入延迟更高

👉 面试回答

我会将 cache entries 复制到多个节点，用来提升可用性。

因为 database 仍然是 source of truth， cache replication 通常可以是异步的。

如果某个 cache node 失败，系统可以从 replica 读取，或者回退到 database。

8️⃣ Eviction Policy and TTL

为什么需要 Eviction？

Cache memory 是有限的。

当内存满时， cache 必须移除一部分 entries。

常见 Eviction Policies

Policy	含义	使用场景
LRU	移除最近最少使用	通用 cache
LFU	移除最不常使用	稳定 hot keys
FIFO	移除最早进入的数据	简单系统
Random	随机移除	低开销

TTL

TTL 自动让数据过期。

示例：

user_profile TTL = 5 minutes
product_catalog TTL = 1 hour
feature_flags TTL = 30 seconds

TTL Jitter

添加随机扰动：

TTL = 300s ± random(0, 60s)

原因：

防止大量 keys 同时过期
减少 cache stampede

👉 面试回答

我会使用 TTL 防止 stale data 永久存在，并在内存满时使用 LRU 或 LFU 这类 eviction policy。

我也会加入 TTL jitter，避免大量热点 key 在同一时间过期，从而降低 cache stampede 风险。

9️⃣ Consistency and Invalidation

Cache Consistency Problem

数据库更新后， cache 里可能仍然有旧值。

Option 1: Delete Cache on Write

流程：

Update database
→ Delete cache key

下一次读请求重新从 DB 加载。

优点

简单
Cache-aside 中常用

缺点

有短暂 stale read 窗口
可能有 race condition

Option 2: Update Cache on Write

流程：

Update database
→ Update cache value

优点

Cache 保持 warm
减少 cache miss

缺点

更复杂
可能导致 cache / database 不一致

Option 3: Event-based Invalidation

流程：

Database update
→ Publish event
→ Cache invalidation worker deletes keys

优点

解耦
适合多服务系统

缺点

Event 可能有延迟
需要处理 event loss

👉 面试回答

对于 cache-aside，我通常会先更新 database，然后删除 cache key。

这样可以让 database 保持 source of truth，并避免把旧值写进 cache。

在更大的系统中， cache invalidation 可以用 event-driven 方式实现，由 database update 事件触发缓存删除。

🔟 Cache Stampede and Thundering Herd

问题

一个 hot key 过期。

大量请求同时 cache miss。

所有请求都打到 database。

hot key expires
→ thousands of requests miss
→ database overload

解决方案

1. Request Coalescing

只允许一个请求重建 cache。

其他请求等待或返回 stale value。

2. Distributed Lock

acquire lock for key
→ only lock holder loads DB
→ others wait/retry

3. Serve Stale While Revalidate

return stale cache value
→ refresh cache asynchronously

4. TTL Jitter

防止同时过期。

👉 面试回答

Cache stampede 发生在热点 key 过期时，大量请求同时 miss cache，导致 database 被打爆。

我会使用 request coalescing、 distributed locks、stale-while-revalidate 和 TTL jitter 来保护数据库。

1️⃣1️⃣ Hot Key Problem

什么是 Hot Key？

一个 key 收到极高流量。

示例：

celebrity_profile:123
product:iphone_launch
homepage_config

问题

单个 cache node 过载
延迟升高
节点失败影响大量请求

解决方案

1. Replicate Hot Keys

将 hot key 存到多个节点。

2. Local Cache

在 application process 内缓存 hot key。

3. Key Splitting

创建多个 physical keys：

hot_key:1
hot_key:2
hot_key:3

请求随机读一个副本。

4. CDN / Edge Cache

适合 public data。

👉 面试回答

Hot key 会让单个 cache node 过载，即使整个 cache cluster 很大也没有用。

为了解决这个问题，我会复制 hot keys、使用 local in-process cache、将 hot key 拆成多个 physical keys，或者对 public data 使用 edge cache。

1️⃣2️⃣ Cache Penetration

问题

请求不断访问不存在的 keys。

示例：

user:invalid_id
product:not_found

每次都 miss cache，然后打到 DB。

解决方案

1. Negative Caching

缓存 “not found” 结果。

user:invalid_id → NULL, TTL = 60s

2. Bloom Filter

查询 DB 前，先判断 key 是否可能存在。

3. Input Validation

提前拒绝非法 key。

👉 面试回答

Cache penetration 指的是大量请求访问不存在的 key，导致每次都 miss cache 并查询 database。

我会使用 negative caching、Bloom filter 和 input validation，避免无效 key 重复打到数据库。

1️⃣3️⃣ Scaling Patterns

Pattern 1: Consistent Hashing

使用 consistent hashing 分布 keys，减少节点变化时的数据迁移。

Pattern 2: Client-side Routing

Cache client 决定某个 key 属于哪个 node。

优点：

避免 proxy bottleneck
延迟更低

缺点：

Client 需要 cluster membership info

Pattern 3: Proxy-based Routing

Application 访问 cache proxy。

Proxy 再路由到正确 node。

优点：

Client 更简单
路由逻辑集中

缺点：

Proxy 可能成为瓶颈

Pattern 4: Multi-layer Cache

Local cache → Distributed cache → Database

Pattern 5: Shard by Tenant or Region

适合 isolation 和 compliance。

👉 面试回答

为了扩展 distributed cache，我会使用 consistent hashing 对 keys 分片，对重要数据做副本，使用 multi-layer caching，并在 client-side routing 和 proxy-based routing 之间做选择。

Client-side routing 延迟更低， proxy routing 则可以简化应用侧逻辑。

1️⃣4️⃣ Failure Handling

常见故障

Cache node unavailable
Network timeout
Cache cluster partition
Hot key overload
Memory pressure
Redis failover
Cache data loss
Cache 故障后 database overload

策略

Fall back to database
使用 circuit breaker
设置 timeout budget
Serve stale value
使用 replicas
Load shedding
故障后 warm up cache
用 rate limiting 保护 database

Cache Failure Rule

Cache 是优化层，不是 source of truth。

👉 面试回答

系统应该在 cache 失败时仍然可以工作，只是延迟会更高。

我会使用短 timeout、circuit breaker、 fallback to database、可接受时读取 stale value，并用 rate limiting 保护 database。

除非明确设计 durable cache，否则 cache 不应该被当作 source of truth。

1️⃣5️⃣ Observability

Key Metrics

Cache hit rate
Cache miss rate
Latency p50 / p95 / p99
Eviction count
Expired key count
Memory usage
Hot key distribution
Error rate
Replication lag
Database fallback rate
Stampede events

Important Dashboards

Cluster health
Node memory usage
Hit / miss ratio
Hot keys
Evictions
DB fallback traffic
Cache latency

👉 面试回答

可观测性对 cache system 非常关键。

我会监控 hit rate、miss rate、 cache latency、eviction count、 memory usage、hot keys、replication lag 和 database fallback traffic。

如果 hit rate 下降或 DB fallback 突然升高，可能说明 cache 故障或 TTL 配置有问题。

1️⃣6️⃣ Consistency Model

需要较强一致性的场景

Financial balances
Inventory counts
Authentication / authorization decisions
Critical configuration

可以最终一致的场景

User profile display
Product details
Feed objects
Search results
Recommendation features
Analytics counters

👉 面试回答

Cache data 通常是最终一致的。

对大多数 read-heavy 数据来说，轻微 stale value 是可以接受的。

但是对于 financial balance、inventory 或 authorization 这类关键数据，要么避免缓存，要么使用很短 TTL，要么使用更强的 invalidation 和 read-through checks。

1️⃣7️⃣ Security and Access Control

Requirements

加密 client 和 cache 之间的流量
限制服务访问不同 key namespace
避免存储 secrets，除非必要
支持 tenant isolation
Audit admin operations
防止 cache poisoning

👉 面试回答

Distributed cache 可能包含敏感数据，所以 access control 很重要。

我会限制不同服务能访问的 key namespaces，加密网络传输，避免存储 secrets，并通过写入前校验 key 和 value 来防止 cache poisoning。

1️⃣8️⃣ End-to-End Flow

Cache-aside Read Flow

Application receives request
→ Check local cache
→ Check distributed cache
→ Cache miss
→ Read database
→ Write value to distributed cache
→ Optionally write local cache
→ Return response

Write Flow with Invalidation

Application updates database
→ Delete cache key
→ Publish invalidation event
→ Other services remove local cache

Hot Key Flow

Detect hot key
→ Replicate to multiple cache nodes
→ Enable local cache
→ Add TTL jitter
→ Serve stale while revalidate if needed

Key Insight

Distributed Cache 不只是更快的存储，它是 source of truth 前面的 consistency 和 traffic-shaping layer。

🧠 Staff-Level Answer（最终版）

👉 面试回答（完整背诵版）

在设计 Distributed Cache 时，我会把它看作一个低延迟 key-value layer，用来降低 database load 并提升读取性能。

Database 仍然是 source of truth， cache 只保存频繁访问的数据。

我通常会使用 cache-aside pattern：应用先检查 cache， cache miss 时读取 database，然后将结果写回 cache。

对于数据分布，我会使用 consistent hashing 或 rendezvous hashing 将 keys 分散到多个 cache nodes，并在扩容或缩容时减少 key movement。

为了提升可用性，重要 cache entries 可以进行异步复制。

我会使用 TTL 和 LRU / LFU 等 eviction policies 控制内存使用，并加入 TTL jitter，避免大量 key 同时过期。

Cache consistency 是最难的问题之一。对于 cache-aside，我会先更新 database，再删除或 invalidate cache key。

对大型系统来说， invalidation 可以通过 event-driven 方式完成。

为了解决 cache stampede，我会使用 request coalescing、distributed locks、 stale-while-revalidate 和 TTL jitter。

为了解决 hot key，我会使用 local cache、hot key replication、 key splitting 或 edge caching。

核心权衡包括延迟、一致性、可用性、内存成本和运维复杂度。

最终目标是在控制 stale data 和故障影响的前提下，降低后端负载，并以极低延迟服务热点数据。

⭐ Final Insight

Distributed Cache 的核心不是“更快的数据库”，而是在 source of truth 前面建立一个低延迟、可扩展、可降级的流量保护层。

🎯 Design Distributed Cache

1️⃣ Core Framework

2️⃣ Core Requirements

Functional Requirements

Non-functional Requirements

3️⃣ Main APIs

Get

Set

Delete / Invalidate

Batch Get

4️⃣ Cache Placement

Client-side Cache

Pros

Cons

Server-side Distributed Cache

Pros

Cons

Multi-layer Cache

5️⃣ Caching Strategies

Strategy 1: Cache-aside / Lazy Loading

Pros

Cons

Strategy 2: Read-through Cache

Pros

Cons

Strategy 3: Write-through Cache

Pros

Cons

Strategy 4: Write-behind / Write-back Cache

Pros

Cons

Recommended

6️⃣ Data Partitioning

Why Partition?

Hash-based Partitioning

Pros

Cons

Consistent Hashing

Pros

Cons

Rendezvous Hashing

Pros

7️⃣ Replication and Availability

Why Replicate?

Primary-replica Model

Replication Factor

Trade-off

8️⃣ Eviction Policy and TTL

Why Eviction?

Common Eviction Policies

TTL

TTL Jitter

9️⃣ Consistency and Invalidation

Cache Consistency Problem

Option 1: Delete Cache on Write

Pros

Cons

Option 2: Update Cache on Write

Pros

Cons

Option 3: Event-based Invalidation

Pros

Cons

🔟 Cache Stampede and Thundering Herd

Problem

Solutions

1. Request Coalescing

2. Distributed Lock

3. Serve Stale While Revalidate

4. TTL Jitter

1️⃣1️⃣ Hot Key Problem

What Is a Hot Key?

Problems

Solutions

1. Replicate Hot Keys

2. Local Cache

3. Key Splitting

4. CDN / Edge Cache

1️⃣2️⃣ Cache Penetration

Problem