🎯 Design Distributed Cache
1️⃣ Core Framework
When discussing Distributed Cache design, I frame it as:
- Core purpose and access patterns
- Cache placement and caching strategies
- Data partitioning and replication
- Eviction policy and TTL
- Consistency and invalidation
- Hot key and cache stampede handling
- Scaling and failure handling
- Trade-offs: latency vs consistency vs cost
2️⃣ Core Requirements
Functional Requirements
- Store key-value data
- Support fast read and write
- Support TTL expiration
- Support cache invalidation
- Support distributed nodes
- Support eviction when memory is full
- Support high QPS
- Support basic observability
Non-functional Requirements
- Very low latency
- High availability
- Horizontal scalability
- High throughput
- Memory efficient
- Graceful degradation
- Eventually consistent cache is usually acceptable
👉 Interview Answer
A distributed cache stores frequently accessed data in memory to reduce database load and improve read latency.
The main challenges are partitioning data across nodes, handling failures, preventing hot keys, and keeping cached data reasonably consistent with the source of truth.
3️⃣ Main APIs
Get
GET /cache/{key}
Response:
{
"key": "user:123",
"value": {
"name": "Alice",
"tier": "premium"
},
"ttlSeconds": 300
}
Set
PUT /cache/{key}
Request:
{
"value": {
"name": "Alice",
"tier": "premium"
},
"ttlSeconds": 300
}
Delete / Invalidate
DELETE /cache/{key}
Batch Get
POST /cache/batch-get
Request:
{
"keys": ["user:123", "user:456", "product:999"]
}
👉 Interview Answer
A distributed cache usually exposes simple key-value operations: get, set, delete, and batch get.
In practice, the cache is usually accessed by application services through a client library, rather than directly through public APIs.
4️⃣ Cache Placement
Client-side Cache
Cache lives inside application process.
Pros
- Extremely low latency
- No network call
- Good for small static data
Cons
- Memory duplicated across clients
- Harder to invalidate
- Not globally consistent
Server-side Distributed Cache
Cache cluster shared by many services.
Examples:
Redis
Memcached
DynamoDB Accelerator
Pros
- Shared cache
- Centralized control
- Easier invalidation
- More scalable capacity
Cons
- Network hop
- Cluster management complexity
Multi-layer Cache
Common production pattern:
Local in-process cache
→ Distributed cache
→ Database
👉 Interview Answer
I would usually use a multi-layer caching strategy.
Local in-process cache provides extremely low latency for hot small data, while distributed cache provides shared capacity across services.
The database remains the source of truth.
5️⃣ Caching Strategies
Strategy 1: Cache-aside / Lazy Loading
Flow:
Application checks cache
→ Cache miss
→ Read database
→ Write result to cache
→ Return result
Pros
- Simple
- Cache only stores requested data
- Works well for read-heavy systems
Cons
- Cache miss is slower
- Risk of stale data
- Cache stampede possible
👉 Interview Answer
Cache-aside is the most common pattern.
The application first checks the cache. On cache miss, it reads from the database, writes the result back to cache, and returns the data.
This keeps the cache simple, but we need to handle stale data and cache stampede.
Strategy 2: Read-through Cache
Flow:
Application → Cache
Cache loads from database on miss
Pros
- Application logic is simpler
- Cache layer owns loading logic
Cons
- Cache becomes more complex
- Tighter coupling between cache and database
Strategy 3: Write-through Cache
Flow:
Application writes cache
→ Cache writes database synchronously
Pros
- Cache and database stay more consistent
- Future reads are fast
Cons
- Write latency increases
- Cache depends on database write success
Strategy 4: Write-behind / Write-back Cache
Flow:
Application writes cache
→ Cache asynchronously writes database
Pros
- Very fast writes
- Good for high write throughput
Cons
- Risk of data loss if cache fails
- More complex durability requirements
Recommended
For most backend systems:
Cache-aside for read-heavy data
Write-through only when consistency is more important
Write-behind only when data loss risk is acceptable or durable queue exists
👉 Interview Answer
For most systems, I would start with cache-aside because it is simple and keeps the database as the source of truth.
If stronger consistency is needed, write-through can be used.
Write-behind is faster, but it requires careful durability guarantees.
6️⃣ Data Partitioning
Why Partition?
A single cache node cannot handle all data or traffic.
We need to split keys across many nodes.
Hash-based Partitioning
hash(key) % number_of_nodes
Pros
- Simple
- Even distribution
Cons
- Adding or removing nodes remaps many keys
Consistent Hashing
hash ring with virtual nodes
Pros
- Reduces key movement during scaling
- Better for dynamic clusters
- Supports horizontal scaling
Cons
- More complex
- Needs good virtual node distribution
Rendezvous Hashing
Alternative consistent hashing strategy.
Pros
- Simple client-side implementation
- Good distribution
- Minimal movement on membership changes
👉 Interview Answer
I would use consistent hashing or rendezvous hashing to distribute keys across cache nodes.
This avoids massive key movement when nodes are added or removed, and allows the cluster to scale horizontally.
7️⃣ Replication and Availability
Why Replicate?
- Node failure should not lose hot data completely
- Improve availability
- Improve read throughput
Primary-replica Model
key → primary node + replica nodes
Writes go to primary.
Reads can go to:
- primary only
- replicas
- nearest replica
Replication Factor
Example:
replication_factor = 2 or 3
Trade-off
| Choice | Pros | Cons |
|---|---|---|
| No replication | Simple, cheaper | Cache loss on node failure |
| Async replication | Faster | Temporary inconsistency |
| Sync replication | More consistent | Higher write latency |
👉 Interview Answer
I would replicate cache entries across multiple nodes to improve availability.
Since the database is still the source of truth, cache replication can usually be asynchronous.
If a cache node fails, the system can read from a replica or fall back to the database.
8️⃣ Eviction Policy and TTL
Why Eviction?
Cache memory is limited.
When memory is full, the cache must remove some entries.
Common Eviction Policies
| Policy | Meaning | Use Case |
|---|---|---|
| LRU | Evict least recently used | General-purpose cache |
| LFU | Evict least frequently used | Stable hot keys |
| FIFO | Evict oldest item | Simple systems |
| Random | Evict random item | Low overhead |
TTL
TTL automatically expires data.
Example:
user_profile TTL = 5 minutes
product_catalog TTL = 1 hour
feature_flags TTL = 30 seconds
TTL Jitter
Add randomization:
TTL = 300s ± random(0, 60s)
Why?
- Prevent many keys from expiring at the same time
- Reduce cache stampede
👉 Interview Answer
I would use TTLs to prevent stale data from living forever, and an eviction policy like LRU or LFU when memory is full.
I would also add TTL jitter so many hot keys do not expire at exactly the same time, which helps prevent cache stampede.
9️⃣ Consistency and Invalidation
Cache Consistency Problem
Database update happens, but cache may still contain old value.
Option 1: Delete Cache on Write
Flow:
Update database
→ Delete cache key
Next read reloads from DB.
Pros
- Simple
- Common with cache-aside
Cons
- Small window of stale reads
- Race conditions possible
Option 2: Update Cache on Write
Flow:
Update database
→ Update cache value
Pros
- Cache stays warm
- Fewer misses
Cons
- More complex
- Risk of cache/database mismatch
Option 3: Event-based Invalidation
Flow:
Database update
→ Publish event
→ Cache invalidation worker deletes keys
Pros
- Decoupled
- Good for multi-service systems
Cons
- Event delay
- Event loss must be handled
👉 Interview Answer
For cache-aside, I would usually update the database first, then delete the cache key.
This keeps the database as the source of truth and avoids writing stale values into the cache.
In larger systems, cache invalidation can be event-driven, where database updates publish invalidation events.
🔟 Cache Stampede and Thundering Herd
Problem
A hot key expires.
Many requests miss cache at the same time.
All requests hit the database.
hot key expires
→ thousands of requests miss
→ database overload
Solutions
1. Request Coalescing
Only one request rebuilds cache.
Others wait or serve stale value.
2. Distributed Lock
acquire lock for key
→ only lock holder loads DB
→ others wait/retry
3. Serve Stale While Revalidate
return stale cache value
→ refresh cache asynchronously
4. TTL Jitter
Prevent simultaneous expiration.
👉 Interview Answer
Cache stampede happens when a hot key expires and many requests hit the database at the same time.
I would use request coalescing, distributed locks, stale-while-revalidate, and TTL jitter to prevent database overload.
1️⃣1️⃣ Hot Key Problem
What Is a Hot Key?
One key receives extremely high traffic.
Examples:
celebrity_profile:123
product:iphone_launch
homepage_config
Problems
- One cache node overloaded
- Increased latency
- Node failure affects many requests
Solutions
1. Replicate Hot Keys
Store hot key on multiple nodes.
2. Local Cache
Cache hot key inside application process.
3. Key Splitting
Create multiple physical keys:
hot_key:1
hot_key:2
hot_key:3
Requests randomly read one copy.
4. CDN / Edge Cache
For public data.
👉 Interview Answer
Hot keys can overload a single cache node even if the cluster is large.
To handle this, I would replicate hot keys, use local in-process cache, split hot keys into multiple physical keys, or cache public data at the edge.
1️⃣2️⃣ Cache Penetration
Problem
Requests repeatedly ask for non-existing keys.
Example:
user:invalid_id
product:not_found
Each request misses cache and hits DB.
Solutions
1. Negative Caching
Cache “not found” result.
user:invalid_id → NULL, TTL = 60s
2. Bloom Filter
Before querying DB, check whether key may exist.
3. Input Validation
Reject invalid keys early.
👉 Interview Answer
Cache penetration happens when many requests ask for keys that do not exist.
I would use negative caching, Bloom filters, and input validation to avoid repeatedly hitting the database for invalid keys.
1️⃣3️⃣ Scaling Patterns
Pattern 1: Consistent Hashing
Distribute keys across nodes with minimal movement.
Pattern 2: Client-side Routing
Cache client decides which node owns a key.
Pros:
- Avoids proxy bottleneck
- Lower latency
Cons:
- Client needs cluster membership info
Pattern 3: Proxy-based Routing
Application talks to cache proxy.
Proxy routes request to correct node.
Pros:
- Simpler clients
- Centralized routing
Cons:
- Proxy can become bottleneck
Pattern 4: Multi-layer Cache
Local cache → Distributed cache → Database
Pattern 5: Shard by Tenant or Region
Useful for isolation and compliance.
👉 Interview Answer
To scale a distributed cache, I would shard keys using consistent hashing, replicate important data, use multi-layer caching, and choose between client-side routing and proxy-based routing.
Client-side routing gives lower latency, while proxy routing simplifies application clients.
1️⃣4️⃣ Failure Handling
Common Failures
- Cache node unavailable
- Network timeout
- Cache cluster partition
- Hot key overload
- Memory pressure
- Redis failover
- Cache data loss
- Database overload after cache failure
Strategies
- Fall back to database
- Use circuit breaker
- Use timeout budget
- Serve stale value
- Use replicas
- Apply load shedding
- Warm up cache after failure
- Protect database with rate limiting
Cache Failure Rule
Cache is an optimization, not the source of truth.
👉 Interview Answer
The system should work when cache fails, although with higher latency.
I would use short timeouts, circuit breakers, fallback to database, stale reads when acceptable, and rate limiting to protect the database.
Cache should not be treated as the source of truth unless we are explicitly designing a durable cache.
1️⃣5️⃣ Observability
Key Metrics
- Cache hit rate
- Cache miss rate
- Latency p50 / p95 / p99
- Eviction count
- Expired key count
- Memory usage
- Hot key distribution
- Error rate
- Replication lag
- Database fallback rate
- Stampede events
Important Dashboards
- Cluster health
- Node memory usage
- Hit/miss ratio
- Hot keys
- Evictions
- DB fallback traffic
- Cache latency
👉 Interview Answer
Observability is critical for cache systems.
I would monitor hit rate, miss rate, cache latency, eviction count, memory usage, hot keys, replication lag, and database fallback traffic.
A falling hit rate or sudden DB fallback spike can indicate cache failure or bad TTL configuration.
1️⃣6️⃣ Consistency Model
Stronger Consistency Needed For
- Financial balances
- Inventory counts
- Authentication / authorization decisions
- Critical configuration
Eventual Consistency Acceptable For
- User profile display
- Product details
- Feed objects
- Search results
- Recommendation features
- Analytics counters
👉 Interview Answer
Cache data is usually eventually consistent.
For most read-heavy data, slightly stale values are acceptable.
But for critical data like financial balances, inventory, or authorization, we should either avoid caching, use very short TTLs, or enforce stronger invalidation and read-through checks.
1️⃣7️⃣ Security and Access Control
Requirements
- Encrypt traffic between clients and cache
- Restrict cache access by service
- Avoid storing secrets unless necessary
- Support tenant isolation
- Audit admin operations
- Protect against cache poisoning
👉 Interview Answer
A distributed cache can contain sensitive data, so access control matters.
I would restrict which services can access which key namespaces, encrypt traffic, avoid storing secrets, and protect against cache poisoning by validating keys and values before writing.
1️⃣8️⃣ End-to-End Flow
Cache-aside Read Flow
Application receives request
→ Check local cache
→ Check distributed cache
→ Cache miss
→ Read database
→ Write value to distributed cache
→ Optionally write local cache
→ Return response
Write Flow with Invalidation
Application updates database
→ Delete cache key
→ Publish invalidation event
→ Other services remove local cache
Hot Key Flow
Detect hot key
→ Replicate to multiple cache nodes
→ Enable local cache
→ Add TTL jitter
→ Serve stale while revalidate if needed
Key Insight
Distributed Cache is not just faster storage — it is a consistency and traffic-shaping layer in front of the source of truth.
🧠 Staff-Level Answer (Final)
👉 Interview Answer (Full Version)
When designing a distributed cache, I think of it as a low-latency key-value layer that reduces database load and improves read performance.
The database remains the source of truth, while the cache stores frequently accessed data.
I would usually use a cache-aside pattern: the application first checks the cache, reads from the database on cache miss, then writes the result back to the cache.
For data distribution, I would use consistent hashing or rendezvous hashing to spread keys across cache nodes while minimizing key movement during scaling.
To improve availability, important cache entries can be replicated asynchronously.
I would use TTLs and eviction policies like LRU or LFU to control memory usage, and add TTL jitter to avoid many keys expiring at the same time.
Cache consistency is one of the hardest parts. For cache-aside, I would update the database first, then delete or invalidate the cache key.
For large systems, invalidation can be event-driven.
To handle cache stampede, I would use request coalescing, distributed locks, stale-while-revalidate, and TTL jitter.
To handle hot keys, I would use local cache, hot key replication, key splitting, or edge caching for public data.
The main trade-offs are latency, consistency, availability, memory cost, and operational complexity.
Ultimately, the goal is to reduce backend load and serve hot data with very low latency, while keeping stale data and failure impact under control.
⭐ Final Insight
Distributed Cache 的核心不是“更快的数据库”, 而是在 source of truth 前面建立一个低延迟、可扩展、可降级的流量保护层。
中文部分
🎯 Design Distributed Cache
1️⃣ 核心框架
在设计 Distributed Cache 时,我通常从以下几个方面来分析:
- 核心目的和访问模式
- Cache 位置和缓存策略
- 数据分片和副本
- Eviction policy 和 TTL
- 一致性和 invalidation
- Hot key 和 cache stampede 处理
- 扩展和故障处理
- 核心权衡:延迟 vs 一致性 vs 成本
2️⃣ 核心需求
功能需求
- 存储 key-value 数据
- 支持快速读写
- 支持 TTL 过期
- 支持 cache invalidation
- 支持分布式节点
- 内存满时支持 eviction
- 支持高 QPS
- 支持基本可观测性
非功能需求
- 极低延迟
- 高可用
- 水平扩展
- 高吞吐
- 内存使用高效
- 支持优雅降级
- Cache 通常可以接受最终一致
👉 面试回答
Distributed Cache 会将频繁访问的数据存储在内存中, 用来降低数据库负载并提升读取延迟。
核心挑战包括如何在多个节点间分配数据、 如何处理节点失败、 如何解决 hot key, 以及如何让缓存数据和 source of truth 保持合理一致。
3️⃣ 主要 API
Get
GET /cache/{key}
Response:
{
"key": "user:123",
"value": {
"name": "Alice",
"tier": "premium"
},
"ttlSeconds": 300
}
Set
PUT /cache/{key}
Request:
{
"value": {
"name": "Alice",
"tier": "premium"
},
"ttlSeconds": 300
}
Delete / Invalidate
DELETE /cache/{key}
Batch Get
POST /cache/batch-get
Request:
{
"keys": ["user:123", "user:456", "product:999"]
}
👉 面试回答
Distributed Cache 通常提供简单的 key-value 操作: get、set、delete 和 batch get。
在实际系统中, 应用服务通常通过 client library 访问 cache, 而不是直接暴露成 public API。
4️⃣ Cache Placement
Client-side Cache
Cache 存在 application process 内部。
优点
- 延迟极低
- 没有网络调用
- 适合小规模静态数据
缺点
- 内存在多个 client 中重复
- Invalidation 更难
- 全局不一致
Server-side Distributed Cache
多个服务共享的 cache cluster。
例如:
Redis
Memcached
DynamoDB Accelerator
优点
- 共享缓存
- 集中控制
- Invalidation 更容易
- 容量更容易扩展
缺点
- 多一次网络调用
- Cluster 管理复杂
Multi-layer Cache
生产系统常见模式:
Local in-process cache
→ Distributed cache
→ Database
👉 面试回答
我通常会使用 multi-layer caching strategy。
Local in-process cache 可以为小型热点数据提供极低延迟, distributed cache 则提供跨服务共享的缓存能力。
Database 仍然是 source of truth。
5️⃣ 缓存策略
Strategy 1: Cache-aside / Lazy Loading
流程:
Application checks cache
→ Cache miss
→ Read database
→ Write result to cache
→ Return result
优点
- 简单
- Cache 只存被请求过的数据
- 适合 read-heavy 系统
缺点
- Cache miss 较慢
- 有 stale data 风险
- 可能发生 cache stampede
👉 面试回答
Cache-aside 是最常见的缓存模式。
应用先检查 cache。 如果 cache miss, 就读取数据库, 再将结果写回 cache, 最后返回数据。
这种方式简单, 但需要处理 stale data 和 cache stampede。
Strategy 2: Read-through Cache
流程:
Application → Cache
Cache loads from database on miss
优点
- 应用逻辑更简单
- Cache layer 负责加载数据
缺点
- Cache 层更复杂
- Cache 和数据库耦合更强
Strategy 3: Write-through Cache
流程:
Application writes cache
→ Cache writes database synchronously
优点
- Cache 和 database 更一致
- 后续读取更快
缺点
- 写入延迟增加
- Cache 依赖 database 写入成功
Strategy 4: Write-behind / Write-back Cache
流程:
Application writes cache
→ Cache asynchronously writes database
优点
- 写入非常快
- 适合高写入吞吐
缺点
- 如果 cache 失败,可能丢数据
- 对 durability 要求更复杂
推荐方案
对大多数后端系统:
Cache-aside for read-heavy data
Write-through only when consistency is more important
Write-behind only when data loss risk is acceptable or durable queue exists
👉 面试回答
对大多数系统, 我会先使用 cache-aside, 因为它简单,并且让 database 保持 source of truth。
如果需要更强一致性, 可以使用 write-through。
Write-behind 速度更快, 但需要非常谨慎地处理持久性。
6️⃣ 数据分片
为什么需要分片?
单个 cache node 无法承载所有数据和流量。
我们需要将 keys 分布到多个节点。
Hash-based Partitioning
hash(key) % number_of_nodes
优点
- 简单
- 分布均匀
缺点
- 添加或删除节点会导致大量 key 重新映射
Consistent Hashing
hash ring with virtual nodes
优点
- 扩容或缩容时减少 key movement
- 适合动态 cluster
- 支持水平扩展
缺点
- 更复杂
- 需要合理分布 virtual nodes
Rendezvous Hashing
另一种 consistent hashing 策略。
优点
- Client-side 实现简单
- 分布效果好
- 节点变化时 key movement 少
👉 面试回答
我会使用 consistent hashing 或 rendezvous hashing 将 keys 分布到不同 cache nodes。
这样在添加或删除节点时, 可以避免大量 key 重新映射, 并支持 cache cluster 水平扩展。
7️⃣ Replication and Availability
为什么需要副本?
- 节点失败时不完全丢失 hot data
- 提升可用性
- 提升读吞吐
Primary-replica Model
key → primary node + replica nodes
Writes 写入 primary。
Reads 可以读:
- primary only
- replicas
- nearest replica
Replication Factor
示例:
replication_factor = 2 or 3
Trade-off
| Choice | 优点 | 缺点 |
|---|---|---|
| No replication | 简单、便宜 | 节点失败导致 cache 丢失 |
| Async replication | 快 | 可能短暂不一致 |
| Sync replication | 更一致 | 写入延迟更高 |
👉 面试回答
我会将 cache entries 复制到多个节点, 用来提升可用性。
因为 database 仍然是 source of truth, cache replication 通常可以是异步的。
如果某个 cache node 失败, 系统可以从 replica 读取, 或者回退到 database。
8️⃣ Eviction Policy and TTL
为什么需要 Eviction?
Cache memory 是有限的。
当内存满时, cache 必须移除一部分 entries。
常见 Eviction Policies
| Policy | 含义 | 使用场景 |
|---|---|---|
| LRU | 移除最近最少使用 | 通用 cache |
| LFU | 移除最不常使用 | 稳定 hot keys |
| FIFO | 移除最早进入的数据 | 简单系统 |
| Random | 随机移除 | 低开销 |
TTL
TTL 自动让数据过期。
示例:
user_profile TTL = 5 minutes
product_catalog TTL = 1 hour
feature_flags TTL = 30 seconds
TTL Jitter
添加随机扰动:
TTL = 300s ± random(0, 60s)
原因:
- 防止大量 keys 同时过期
- 减少 cache stampede
👉 面试回答
我会使用 TTL 防止 stale data 永久存在, 并在内存满时使用 LRU 或 LFU 这类 eviction policy。
我也会加入 TTL jitter, 避免大量热点 key 在同一时间过期, 从而降低 cache stampede 风险。
9️⃣ Consistency and Invalidation
Cache Consistency Problem
数据库更新后, cache 里可能仍然有旧值。
Option 1: Delete Cache on Write
流程:
Update database
→ Delete cache key
下一次读请求重新从 DB 加载。
优点
- 简单
- Cache-aside 中常用
缺点
- 有短暂 stale read 窗口
- 可能有 race condition
Option 2: Update Cache on Write
流程:
Update database
→ Update cache value
优点
- Cache 保持 warm
- 减少 cache miss
缺点
- 更复杂
- 可能导致 cache / database 不一致
Option 3: Event-based Invalidation
流程:
Database update
→ Publish event
→ Cache invalidation worker deletes keys
优点
- 解耦
- 适合多服务系统
缺点
- Event 可能有延迟
- 需要处理 event loss
👉 面试回答
对于 cache-aside, 我通常会先更新 database, 然后删除 cache key。
这样可以让 database 保持 source of truth, 并避免把旧值写进 cache。
在更大的系统中, cache invalidation 可以用 event-driven 方式实现, 由 database update 事件触发缓存删除。
🔟 Cache Stampede and Thundering Herd
问题
一个 hot key 过期。
大量请求同时 cache miss。
所有请求都打到 database。
hot key expires
→ thousands of requests miss
→ database overload
解决方案
1. Request Coalescing
只允许一个请求重建 cache。
其他请求等待或返回 stale value。
2. Distributed Lock
acquire lock for key
→ only lock holder loads DB
→ others wait/retry
3. Serve Stale While Revalidate
return stale cache value
→ refresh cache asynchronously
4. TTL Jitter
防止同时过期。
👉 面试回答
Cache stampede 发生在热点 key 过期时, 大量请求同时 miss cache, 导致 database 被打爆。
我会使用 request coalescing、 distributed locks、stale-while-revalidate 和 TTL jitter 来保护数据库。
1️⃣1️⃣ Hot Key Problem
什么是 Hot Key?
一个 key 收到极高流量。
示例:
celebrity_profile:123
product:iphone_launch
homepage_config
问题
- 单个 cache node 过载
- 延迟升高
- 节点失败影响大量请求
解决方案
1. Replicate Hot Keys
将 hot key 存到多个节点。
2. Local Cache
在 application process 内缓存 hot key。
3. Key Splitting
创建多个 physical keys:
hot_key:1
hot_key:2
hot_key:3
请求随机读一个副本。
4. CDN / Edge Cache
适合 public data。
👉 面试回答
Hot key 会让单个 cache node 过载, 即使整个 cache cluster 很大也没有用。
为了解决这个问题, 我会复制 hot keys、 使用 local in-process cache、 将 hot key 拆成多个 physical keys, 或者对 public data 使用 edge cache。
1️⃣2️⃣ Cache Penetration
问题
请求不断访问不存在的 keys。
示例:
user:invalid_id
product:not_found
每次都 miss cache, 然后打到 DB。
解决方案
1. Negative Caching
缓存 “not found” 结果。
user:invalid_id → NULL, TTL = 60s
2. Bloom Filter
查询 DB 前, 先判断 key 是否可能存在。
3. Input Validation
提前拒绝非法 key。
👉 面试回答
Cache penetration 指的是大量请求访问不存在的 key, 导致每次都 miss cache 并查询 database。
我会使用 negative caching、Bloom filter 和 input validation, 避免无效 key 重复打到数据库。
1️⃣3️⃣ Scaling Patterns
Pattern 1: Consistent Hashing
使用 consistent hashing 分布 keys, 减少节点变化时的数据迁移。
Pattern 2: Client-side Routing
Cache client 决定某个 key 属于哪个 node。
优点:
- 避免 proxy bottleneck
- 延迟更低
缺点:
- Client 需要 cluster membership info
Pattern 3: Proxy-based Routing
Application 访问 cache proxy。
Proxy 再路由到正确 node。
优点:
- Client 更简单
- 路由逻辑集中
缺点:
- Proxy 可能成为瓶颈
Pattern 4: Multi-layer Cache
Local cache → Distributed cache → Database
Pattern 5: Shard by Tenant or Region
适合 isolation 和 compliance。
👉 面试回答
为了扩展 distributed cache, 我会使用 consistent hashing 对 keys 分片, 对重要数据做副本, 使用 multi-layer caching, 并在 client-side routing 和 proxy-based routing 之间做选择。
Client-side routing 延迟更低, proxy routing 则可以简化应用侧逻辑。
1️⃣4️⃣ Failure Handling
常见故障
- Cache node unavailable
- Network timeout
- Cache cluster partition
- Hot key overload
- Memory pressure
- Redis failover
- Cache data loss
- Cache 故障后 database overload
策略
- Fall back to database
- 使用 circuit breaker
- 设置 timeout budget
- Serve stale value
- 使用 replicas
- Load shedding
- 故障后 warm up cache
- 用 rate limiting 保护 database
Cache Failure Rule
Cache 是优化层, 不是 source of truth。
👉 面试回答
系统应该在 cache 失败时仍然可以工作, 只是延迟会更高。
我会使用短 timeout、circuit breaker、 fallback to database、可接受时读取 stale value, 并用 rate limiting 保护 database。
除非明确设计 durable cache, 否则 cache 不应该被当作 source of truth。
1️⃣5️⃣ Observability
Key Metrics
- Cache hit rate
- Cache miss rate
- Latency p50 / p95 / p99
- Eviction count
- Expired key count
- Memory usage
- Hot key distribution
- Error rate
- Replication lag
- Database fallback rate
- Stampede events
Important Dashboards
- Cluster health
- Node memory usage
- Hit / miss ratio
- Hot keys
- Evictions
- DB fallback traffic
- Cache latency
👉 面试回答
可观测性对 cache system 非常关键。
我会监控 hit rate、miss rate、 cache latency、eviction count、 memory usage、hot keys、replication lag 和 database fallback traffic。
如果 hit rate 下降或 DB fallback 突然升高, 可能说明 cache 故障或 TTL 配置有问题。
1️⃣6️⃣ Consistency Model
需要较强一致性的场景
- Financial balances
- Inventory counts
- Authentication / authorization decisions
- Critical configuration
可以最终一致的场景
- User profile display
- Product details
- Feed objects
- Search results
- Recommendation features
- Analytics counters
👉 面试回答
Cache data 通常是最终一致的。
对大多数 read-heavy 数据来说, 轻微 stale value 是可以接受的。
但是对于 financial balance、inventory 或 authorization 这类关键数据, 要么避免缓存, 要么使用很短 TTL, 要么使用更强的 invalidation 和 read-through checks。
1️⃣7️⃣ Security and Access Control
Requirements
- 加密 client 和 cache 之间的流量
- 限制服务访问不同 key namespace
- 避免存储 secrets,除非必要
- 支持 tenant isolation
- Audit admin operations
- 防止 cache poisoning
👉 面试回答
Distributed cache 可能包含敏感数据, 所以 access control 很重要。
我会限制不同服务能访问的 key namespaces, 加密网络传输, 避免存储 secrets, 并通过写入前校验 key 和 value 来防止 cache poisoning。
1️⃣8️⃣ End-to-End Flow
Cache-aside Read Flow
Application receives request
→ Check local cache
→ Check distributed cache
→ Cache miss
→ Read database
→ Write value to distributed cache
→ Optionally write local cache
→ Return response
Write Flow with Invalidation
Application updates database
→ Delete cache key
→ Publish invalidation event
→ Other services remove local cache
Hot Key Flow
Detect hot key
→ Replicate to multiple cache nodes
→ Enable local cache
→ Add TTL jitter
→ Serve stale while revalidate if needed
Key Insight
Distributed Cache 不只是更快的存储, 它是 source of truth 前面的 consistency 和 traffic-shaping layer。
🧠 Staff-Level Answer(最终版)
👉 面试回答(完整背诵版)
在设计 Distributed Cache 时, 我会把它看作一个低延迟 key-value layer, 用来降低 database load 并提升读取性能。
Database 仍然是 source of truth, cache 只保存频繁访问的数据。
我通常会使用 cache-aside pattern: 应用先检查 cache, cache miss 时读取 database, 然后将结果写回 cache。
对于数据分布, 我会使用 consistent hashing 或 rendezvous hashing 将 keys 分散到多个 cache nodes, 并在扩容或缩容时减少 key movement。
为了提升可用性, 重要 cache entries 可以进行异步复制。
我会使用 TTL 和 LRU / LFU 等 eviction policies 控制内存使用, 并加入 TTL jitter, 避免大量 key 同时过期。
Cache consistency 是最难的问题之一。 对于 cache-aside, 我会先更新 database, 再删除或 invalidate cache key。
对大型系统来说, invalidation 可以通过 event-driven 方式完成。
为了解决 cache stampede, 我会使用 request coalescing、distributed locks、 stale-while-revalidate 和 TTL jitter。
为了解决 hot key, 我会使用 local cache、hot key replication、 key splitting 或 edge caching。
核心权衡包括延迟、一致性、可用性、 内存成本和运维复杂度。
最终目标是在控制 stale data 和故障影响的前提下, 降低后端负载, 并以极低延迟服务热点数据。
⭐ Final Insight
Distributed Cache 的核心不是“更快的数据库”, 而是在 source of truth 前面建立一个低延迟、可扩展、可降级的流量保护层。
Implement