System Design Deep Dive - 10 Design Distributed Cache

Post by ailswan May. 03, 2026

中文 ↓

🎯 Design Distributed Cache


1️⃣ Core Framework

When discussing Distributed Cache design, I frame it as:

  1. Core purpose and access patterns
  2. Cache placement and caching strategies
  3. Data partitioning and replication
  4. Eviction policy and TTL
  5. Consistency and invalidation
  6. Hot key and cache stampede handling
  7. Scaling and failure handling
  8. Trade-offs: latency vs consistency vs cost

2️⃣ Core Requirements


Functional Requirements


Non-functional Requirements


👉 Interview Answer

A distributed cache stores frequently accessed data in memory to reduce database load and improve read latency.

The main challenges are partitioning data across nodes, handling failures, preventing hot keys, and keeping cached data reasonably consistent with the source of truth.


3️⃣ Main APIs


Get

GET /cache/{key}

Response:

{
  "key": "user:123",
  "value": {
    "name": "Alice",
    "tier": "premium"
  },
  "ttlSeconds": 300
}

Set

PUT /cache/{key}

Request:

{
  "value": {
    "name": "Alice",
    "tier": "premium"
  },
  "ttlSeconds": 300
}

Delete / Invalidate

DELETE /cache/{key}

Batch Get

POST /cache/batch-get

Request:

{
  "keys": ["user:123", "user:456", "product:999"]
}

👉 Interview Answer

A distributed cache usually exposes simple key-value operations: get, set, delete, and batch get.

In practice, the cache is usually accessed by application services through a client library, rather than directly through public APIs.


4️⃣ Cache Placement


Client-side Cache

Cache lives inside application process.

Pros

Cons


Server-side Distributed Cache

Cache cluster shared by many services.

Examples:

Redis
Memcached
DynamoDB Accelerator

Pros

Cons


Multi-layer Cache

Common production pattern:

Local in-process cache
→ Distributed cache
→ Database

👉 Interview Answer

I would usually use a multi-layer caching strategy.

Local in-process cache provides extremely low latency for hot small data, while distributed cache provides shared capacity across services.

The database remains the source of truth.


5️⃣ Caching Strategies


Strategy 1: Cache-aside / Lazy Loading

Flow:

Application checks cache
→ Cache miss
→ Read database
→ Write result to cache
→ Return result

Pros


Cons


👉 Interview Answer

Cache-aside is the most common pattern.

The application first checks the cache. On cache miss, it reads from the database, writes the result back to cache, and returns the data.

This keeps the cache simple, but we need to handle stale data and cache stampede.


Strategy 2: Read-through Cache

Flow:

Application → Cache
Cache loads from database on miss

Pros


Cons


Strategy 3: Write-through Cache

Flow:

Application writes cache
→ Cache writes database synchronously

Pros


Cons


Strategy 4: Write-behind / Write-back Cache

Flow:

Application writes cache
→ Cache asynchronously writes database

Pros


Cons


For most backend systems:

Cache-aside for read-heavy data
Write-through only when consistency is more important
Write-behind only when data loss risk is acceptable or durable queue exists

👉 Interview Answer

For most systems, I would start with cache-aside because it is simple and keeps the database as the source of truth.

If stronger consistency is needed, write-through can be used.

Write-behind is faster, but it requires careful durability guarantees.


6️⃣ Data Partitioning


Why Partition?

A single cache node cannot handle all data or traffic.

We need to split keys across many nodes.


Hash-based Partitioning

hash(key) % number_of_nodes

Pros

Cons


Consistent Hashing

hash ring with virtual nodes

Pros

Cons


Rendezvous Hashing

Alternative consistent hashing strategy.

Pros


👉 Interview Answer

I would use consistent hashing or rendezvous hashing to distribute keys across cache nodes.

This avoids massive key movement when nodes are added or removed, and allows the cluster to scale horizontally.


7️⃣ Replication and Availability


Why Replicate?


Primary-replica Model

key → primary node + replica nodes

Writes go to primary.

Reads can go to:


Replication Factor

Example:

replication_factor = 2 or 3

Trade-off

Choice Pros Cons
No replication Simple, cheaper Cache loss on node failure
Async replication Faster Temporary inconsistency
Sync replication More consistent Higher write latency

👉 Interview Answer

I would replicate cache entries across multiple nodes to improve availability.

Since the database is still the source of truth, cache replication can usually be asynchronous.

If a cache node fails, the system can read from a replica or fall back to the database.


8️⃣ Eviction Policy and TTL


Why Eviction?

Cache memory is limited.

When memory is full, the cache must remove some entries.


Common Eviction Policies

Policy Meaning Use Case
LRU Evict least recently used General-purpose cache
LFU Evict least frequently used Stable hot keys
FIFO Evict oldest item Simple systems
Random Evict random item Low overhead

TTL

TTL automatically expires data.

Example:

user_profile TTL = 5 minutes
product_catalog TTL = 1 hour
feature_flags TTL = 30 seconds

TTL Jitter

Add randomization:

TTL = 300s ± random(0, 60s)

Why?


👉 Interview Answer

I would use TTLs to prevent stale data from living forever, and an eviction policy like LRU or LFU when memory is full.

I would also add TTL jitter so many hot keys do not expire at exactly the same time, which helps prevent cache stampede.


9️⃣ Consistency and Invalidation


Cache Consistency Problem

Database update happens, but cache may still contain old value.


Option 1: Delete Cache on Write

Flow:

Update database
→ Delete cache key

Next read reloads from DB.

Pros

Cons


Option 2: Update Cache on Write

Flow:

Update database
→ Update cache value

Pros

Cons


Option 3: Event-based Invalidation

Flow:

Database update
→ Publish event
→ Cache invalidation worker deletes keys

Pros

Cons


👉 Interview Answer

For cache-aside, I would usually update the database first, then delete the cache key.

This keeps the database as the source of truth and avoids writing stale values into the cache.

In larger systems, cache invalidation can be event-driven, where database updates publish invalidation events.


🔟 Cache Stampede and Thundering Herd


Problem

A hot key expires.

Many requests miss cache at the same time.

All requests hit the database.

hot key expires
→ thousands of requests miss
→ database overload

Solutions

1. Request Coalescing

Only one request rebuilds cache.

Others wait or serve stale value.


2. Distributed Lock

acquire lock for key
→ only lock holder loads DB
→ others wait/retry

3. Serve Stale While Revalidate

return stale cache value
→ refresh cache asynchronously

4. TTL Jitter

Prevent simultaneous expiration.


👉 Interview Answer

Cache stampede happens when a hot key expires and many requests hit the database at the same time.

I would use request coalescing, distributed locks, stale-while-revalidate, and TTL jitter to prevent database overload.


1️⃣1️⃣ Hot Key Problem


What Is a Hot Key?

One key receives extremely high traffic.

Examples:

celebrity_profile:123
product:iphone_launch
homepage_config

Problems


Solutions

1. Replicate Hot Keys

Store hot key on multiple nodes.


2. Local Cache

Cache hot key inside application process.


3. Key Splitting

Create multiple physical keys:

hot_key:1
hot_key:2
hot_key:3

Requests randomly read one copy.


4. CDN / Edge Cache

For public data.


👉 Interview Answer

Hot keys can overload a single cache node even if the cluster is large.

To handle this, I would replicate hot keys, use local in-process cache, split hot keys into multiple physical keys, or cache public data at the edge.


1️⃣2️⃣ Cache Penetration


Problem

Requests repeatedly ask for non-existing keys.

Example:

user:invalid_id
product:not_found

Each request misses cache and hits DB.


Solutions

1. Negative Caching

Cache “not found” result.

user:invalid_id → NULL, TTL = 60s

2. Bloom Filter

Before querying DB, check whether key may exist.


3. Input Validation

Reject invalid keys early.


👉 Interview Answer

Cache penetration happens when many requests ask for keys that do not exist.

I would use negative caching, Bloom filters, and input validation to avoid repeatedly hitting the database for invalid keys.


1️⃣3️⃣ Scaling Patterns


Pattern 1: Consistent Hashing

Distribute keys across nodes with minimal movement.


Pattern 2: Client-side Routing

Cache client decides which node owns a key.

Pros:

Cons:


Pattern 3: Proxy-based Routing

Application talks to cache proxy.

Proxy routes request to correct node.

Pros:

Cons:


Pattern 4: Multi-layer Cache

Local cache → Distributed cache → Database

Pattern 5: Shard by Tenant or Region

Useful for isolation and compliance.


👉 Interview Answer

To scale a distributed cache, I would shard keys using consistent hashing, replicate important data, use multi-layer caching, and choose between client-side routing and proxy-based routing.

Client-side routing gives lower latency, while proxy routing simplifies application clients.


1️⃣4️⃣ Failure Handling


Common Failures


Strategies


Cache Failure Rule

Cache is an optimization, not the source of truth.


👉 Interview Answer

The system should work when cache fails, although with higher latency.

I would use short timeouts, circuit breakers, fallback to database, stale reads when acceptable, and rate limiting to protect the database.

Cache should not be treated as the source of truth unless we are explicitly designing a durable cache.


1️⃣5️⃣ Observability


Key Metrics


Important Dashboards


👉 Interview Answer

Observability is critical for cache systems.

I would monitor hit rate, miss rate, cache latency, eviction count, memory usage, hot keys, replication lag, and database fallback traffic.

A falling hit rate or sudden DB fallback spike can indicate cache failure or bad TTL configuration.


1️⃣6️⃣ Consistency Model


Stronger Consistency Needed For


Eventual Consistency Acceptable For


👉 Interview Answer

Cache data is usually eventually consistent.

For most read-heavy data, slightly stale values are acceptable.

But for critical data like financial balances, inventory, or authorization, we should either avoid caching, use very short TTLs, or enforce stronger invalidation and read-through checks.


1️⃣7️⃣ Security and Access Control


Requirements


👉 Interview Answer

A distributed cache can contain sensitive data, so access control matters.

I would restrict which services can access which key namespaces, encrypt traffic, avoid storing secrets, and protect against cache poisoning by validating keys and values before writing.


1️⃣8️⃣ End-to-End Flow


Cache-aside Read Flow

Application receives request
→ Check local cache
→ Check distributed cache
→ Cache miss
→ Read database
→ Write value to distributed cache
→ Optionally write local cache
→ Return response

Write Flow with Invalidation

Application updates database
→ Delete cache key
→ Publish invalidation event
→ Other services remove local cache

Hot Key Flow

Detect hot key
→ Replicate to multiple cache nodes
→ Enable local cache
→ Add TTL jitter
→ Serve stale while revalidate if needed

Key Insight

Distributed Cache is not just faster storage — it is a consistency and traffic-shaping layer in front of the source of truth.


🧠 Staff-Level Answer (Final)


👉 Interview Answer (Full Version)

When designing a distributed cache, I think of it as a low-latency key-value layer that reduces database load and improves read performance.

The database remains the source of truth, while the cache stores frequently accessed data.

I would usually use a cache-aside pattern: the application first checks the cache, reads from the database on cache miss, then writes the result back to the cache.

For data distribution, I would use consistent hashing or rendezvous hashing to spread keys across cache nodes while minimizing key movement during scaling.

To improve availability, important cache entries can be replicated asynchronously.

I would use TTLs and eviction policies like LRU or LFU to control memory usage, and add TTL jitter to avoid many keys expiring at the same time.

Cache consistency is one of the hardest parts. For cache-aside, I would update the database first, then delete or invalidate the cache key.

For large systems, invalidation can be event-driven.

To handle cache stampede, I would use request coalescing, distributed locks, stale-while-revalidate, and TTL jitter.

To handle hot keys, I would use local cache, hot key replication, key splitting, or edge caching for public data.

The main trade-offs are latency, consistency, availability, memory cost, and operational complexity.

Ultimately, the goal is to reduce backend load and serve hot data with very low latency, while keeping stale data and failure impact under control.


⭐ Final Insight

Distributed Cache 的核心不是“更快的数据库”, 而是在 source of truth 前面建立一个低延迟、可扩展、可降级的流量保护层。



中文部分


🎯 Design Distributed Cache


1️⃣ 核心框架

在设计 Distributed Cache 时,我通常从以下几个方面来分析:

  1. 核心目的和访问模式
  2. Cache 位置和缓存策略
  3. 数据分片和副本
  4. Eviction policy 和 TTL
  5. 一致性和 invalidation
  6. Hot key 和 cache stampede 处理
  7. 扩展和故障处理
  8. 核心权衡:延迟 vs 一致性 vs 成本

2️⃣ 核心需求


功能需求


非功能需求


👉 面试回答

Distributed Cache 会将频繁访问的数据存储在内存中, 用来降低数据库负载并提升读取延迟。

核心挑战包括如何在多个节点间分配数据、 如何处理节点失败、 如何解决 hot key, 以及如何让缓存数据和 source of truth 保持合理一致。


3️⃣ 主要 API


Get

GET /cache/{key}

Response:

{
  "key": "user:123",
  "value": {
    "name": "Alice",
    "tier": "premium"
  },
  "ttlSeconds": 300
}

Set

PUT /cache/{key}

Request:

{
  "value": {
    "name": "Alice",
    "tier": "premium"
  },
  "ttlSeconds": 300
}

Delete / Invalidate

DELETE /cache/{key}

Batch Get

POST /cache/batch-get

Request:

{
  "keys": ["user:123", "user:456", "product:999"]
}

👉 面试回答

Distributed Cache 通常提供简单的 key-value 操作: get、set、delete 和 batch get。

在实际系统中, 应用服务通常通过 client library 访问 cache, 而不是直接暴露成 public API。


4️⃣ Cache Placement


Client-side Cache

Cache 存在 application process 内部。

优点

缺点


Server-side Distributed Cache

多个服务共享的 cache cluster。

例如:

Redis
Memcached
DynamoDB Accelerator

优点

缺点


Multi-layer Cache

生产系统常见模式:

Local in-process cache
→ Distributed cache
→ Database

👉 面试回答

我通常会使用 multi-layer caching strategy。

Local in-process cache 可以为小型热点数据提供极低延迟, distributed cache 则提供跨服务共享的缓存能力。

Database 仍然是 source of truth。


5️⃣ 缓存策略


Strategy 1: Cache-aside / Lazy Loading

流程:

Application checks cache
→ Cache miss
→ Read database
→ Write result to cache
→ Return result

优点


缺点


👉 面试回答

Cache-aside 是最常见的缓存模式。

应用先检查 cache。 如果 cache miss, 就读取数据库, 再将结果写回 cache, 最后返回数据。

这种方式简单, 但需要处理 stale data 和 cache stampede。


Strategy 2: Read-through Cache

流程:

Application → Cache
Cache loads from database on miss

优点


缺点


Strategy 3: Write-through Cache

流程:

Application writes cache
→ Cache writes database synchronously

优点


缺点


Strategy 4: Write-behind / Write-back Cache

流程:

Application writes cache
→ Cache asynchronously writes database

优点


缺点


推荐方案

对大多数后端系统:

Cache-aside for read-heavy data
Write-through only when consistency is more important
Write-behind only when data loss risk is acceptable or durable queue exists

👉 面试回答

对大多数系统, 我会先使用 cache-aside, 因为它简单,并且让 database 保持 source of truth。

如果需要更强一致性, 可以使用 write-through。

Write-behind 速度更快, 但需要非常谨慎地处理持久性。


6️⃣ 数据分片


为什么需要分片?

单个 cache node 无法承载所有数据和流量。

我们需要将 keys 分布到多个节点。


Hash-based Partitioning

hash(key) % number_of_nodes

优点

缺点


Consistent Hashing

hash ring with virtual nodes

优点

缺点


Rendezvous Hashing

另一种 consistent hashing 策略。

优点


👉 面试回答

我会使用 consistent hashing 或 rendezvous hashing 将 keys 分布到不同 cache nodes。

这样在添加或删除节点时, 可以避免大量 key 重新映射, 并支持 cache cluster 水平扩展。


7️⃣ Replication and Availability


为什么需要副本?


Primary-replica Model

key → primary node + replica nodes

Writes 写入 primary。

Reads 可以读:


Replication Factor

示例:

replication_factor = 2 or 3

Trade-off

Choice 优点 缺点
No replication 简单、便宜 节点失败导致 cache 丢失
Async replication 可能短暂不一致
Sync replication 更一致 写入延迟更高

👉 面试回答

我会将 cache entries 复制到多个节点, 用来提升可用性。

因为 database 仍然是 source of truth, cache replication 通常可以是异步的。

如果某个 cache node 失败, 系统可以从 replica 读取, 或者回退到 database。


8️⃣ Eviction Policy and TTL


为什么需要 Eviction?

Cache memory 是有限的。

当内存满时, cache 必须移除一部分 entries。


常见 Eviction Policies

Policy 含义 使用场景
LRU 移除最近最少使用 通用 cache
LFU 移除最不常使用 稳定 hot keys
FIFO 移除最早进入的数据 简单系统
Random 随机移除 低开销

TTL

TTL 自动让数据过期。

示例:

user_profile TTL = 5 minutes
product_catalog TTL = 1 hour
feature_flags TTL = 30 seconds

TTL Jitter

添加随机扰动:

TTL = 300s ± random(0, 60s)

原因:


👉 面试回答

我会使用 TTL 防止 stale data 永久存在, 并在内存满时使用 LRU 或 LFU 这类 eviction policy。

我也会加入 TTL jitter, 避免大量热点 key 在同一时间过期, 从而降低 cache stampede 风险。


9️⃣ Consistency and Invalidation


Cache Consistency Problem

数据库更新后, cache 里可能仍然有旧值。


Option 1: Delete Cache on Write

流程:

Update database
→ Delete cache key

下一次读请求重新从 DB 加载。

优点

缺点


Option 2: Update Cache on Write

流程:

Update database
→ Update cache value

优点

缺点


Option 3: Event-based Invalidation

流程:

Database update
→ Publish event
→ Cache invalidation worker deletes keys

优点

缺点


👉 面试回答

对于 cache-aside, 我通常会先更新 database, 然后删除 cache key。

这样可以让 database 保持 source of truth, 并避免把旧值写进 cache。

在更大的系统中, cache invalidation 可以用 event-driven 方式实现, 由 database update 事件触发缓存删除。


🔟 Cache Stampede and Thundering Herd


问题

一个 hot key 过期。

大量请求同时 cache miss。

所有请求都打到 database。

hot key expires
→ thousands of requests miss
→ database overload

解决方案

1. Request Coalescing

只允许一个请求重建 cache。

其他请求等待或返回 stale value。


2. Distributed Lock

acquire lock for key
→ only lock holder loads DB
→ others wait/retry

3. Serve Stale While Revalidate

return stale cache value
→ refresh cache asynchronously

4. TTL Jitter

防止同时过期。


👉 面试回答

Cache stampede 发生在热点 key 过期时, 大量请求同时 miss cache, 导致 database 被打爆。

我会使用 request coalescing、 distributed locks、stale-while-revalidate 和 TTL jitter 来保护数据库。


1️⃣1️⃣ Hot Key Problem


什么是 Hot Key?

一个 key 收到极高流量。

示例:

celebrity_profile:123
product:iphone_launch
homepage_config

问题


解决方案

1. Replicate Hot Keys

将 hot key 存到多个节点。


2. Local Cache

在 application process 内缓存 hot key。


3. Key Splitting

创建多个 physical keys:

hot_key:1
hot_key:2
hot_key:3

请求随机读一个副本。


4. CDN / Edge Cache

适合 public data。


👉 面试回答

Hot key 会让单个 cache node 过载, 即使整个 cache cluster 很大也没有用。

为了解决这个问题, 我会复制 hot keys、 使用 local in-process cache、 将 hot key 拆成多个 physical keys, 或者对 public data 使用 edge cache。


1️⃣2️⃣ Cache Penetration


问题

请求不断访问不存在的 keys。

示例:

user:invalid_id
product:not_found

每次都 miss cache, 然后打到 DB。


解决方案

1. Negative Caching

缓存 “not found” 结果。

user:invalid_id → NULL, TTL = 60s

2. Bloom Filter

查询 DB 前, 先判断 key 是否可能存在。


3. Input Validation

提前拒绝非法 key。


👉 面试回答

Cache penetration 指的是大量请求访问不存在的 key, 导致每次都 miss cache 并查询 database。

我会使用 negative caching、Bloom filter 和 input validation, 避免无效 key 重复打到数据库。


1️⃣3️⃣ Scaling Patterns


Pattern 1: Consistent Hashing

使用 consistent hashing 分布 keys, 减少节点变化时的数据迁移。


Pattern 2: Client-side Routing

Cache client 决定某个 key 属于哪个 node。

优点:

缺点:


Pattern 3: Proxy-based Routing

Application 访问 cache proxy。

Proxy 再路由到正确 node。

优点:

缺点:


Pattern 4: Multi-layer Cache

Local cache → Distributed cache → Database

Pattern 5: Shard by Tenant or Region

适合 isolation 和 compliance。


👉 面试回答

为了扩展 distributed cache, 我会使用 consistent hashing 对 keys 分片, 对重要数据做副本, 使用 multi-layer caching, 并在 client-side routing 和 proxy-based routing 之间做选择。

Client-side routing 延迟更低, proxy routing 则可以简化应用侧逻辑。


1️⃣4️⃣ Failure Handling


常见故障


策略


Cache Failure Rule

Cache 是优化层, 不是 source of truth。


👉 面试回答

系统应该在 cache 失败时仍然可以工作, 只是延迟会更高。

我会使用短 timeout、circuit breaker、 fallback to database、可接受时读取 stale value, 并用 rate limiting 保护 database。

除非明确设计 durable cache, 否则 cache 不应该被当作 source of truth。


1️⃣5️⃣ Observability


Key Metrics


Important Dashboards


👉 面试回答

可观测性对 cache system 非常关键。

我会监控 hit rate、miss rate、 cache latency、eviction count、 memory usage、hot keys、replication lag 和 database fallback traffic。

如果 hit rate 下降或 DB fallback 突然升高, 可能说明 cache 故障或 TTL 配置有问题。


1️⃣6️⃣ Consistency Model


需要较强一致性的场景


可以最终一致的场景


👉 面试回答

Cache data 通常是最终一致的。

对大多数 read-heavy 数据来说, 轻微 stale value 是可以接受的。

但是对于 financial balance、inventory 或 authorization 这类关键数据, 要么避免缓存, 要么使用很短 TTL, 要么使用更强的 invalidation 和 read-through checks。


1️⃣7️⃣ Security and Access Control


Requirements


👉 面试回答

Distributed cache 可能包含敏感数据, 所以 access control 很重要。

我会限制不同服务能访问的 key namespaces, 加密网络传输, 避免存储 secrets, 并通过写入前校验 key 和 value 来防止 cache poisoning。


1️⃣8️⃣ End-to-End Flow


Cache-aside Read Flow

Application receives request
→ Check local cache
→ Check distributed cache
→ Cache miss
→ Read database
→ Write value to distributed cache
→ Optionally write local cache
→ Return response

Write Flow with Invalidation

Application updates database
→ Delete cache key
→ Publish invalidation event
→ Other services remove local cache

Hot Key Flow

Detect hot key
→ Replicate to multiple cache nodes
→ Enable local cache
→ Add TTL jitter
→ Serve stale while revalidate if needed

Key Insight

Distributed Cache 不只是更快的存储, 它是 source of truth 前面的 consistency 和 traffic-shaping layer。


🧠 Staff-Level Answer(最终版)


👉 面试回答(完整背诵版)

在设计 Distributed Cache 时, 我会把它看作一个低延迟 key-value layer, 用来降低 database load 并提升读取性能。

Database 仍然是 source of truth, cache 只保存频繁访问的数据。

我通常会使用 cache-aside pattern: 应用先检查 cache, cache miss 时读取 database, 然后将结果写回 cache。

对于数据分布, 我会使用 consistent hashing 或 rendezvous hashing 将 keys 分散到多个 cache nodes, 并在扩容或缩容时减少 key movement。

为了提升可用性, 重要 cache entries 可以进行异步复制。

我会使用 TTL 和 LRU / LFU 等 eviction policies 控制内存使用, 并加入 TTL jitter, 避免大量 key 同时过期。

Cache consistency 是最难的问题之一。 对于 cache-aside, 我会先更新 database, 再删除或 invalidate cache key。

对大型系统来说, invalidation 可以通过 event-driven 方式完成。

为了解决 cache stampede, 我会使用 request coalescing、distributed locks、 stale-while-revalidate 和 TTL jitter。

为了解决 hot key, 我会使用 local cache、hot key replication、 key splitting 或 edge caching。

核心权衡包括延迟、一致性、可用性、 内存成本和运维复杂度。

最终目标是在控制 stale data 和故障影响的前提下, 降低后端负载, 并以极低延迟服务热点数据。


⭐ Final Insight

Distributed Cache 的核心不是“更快的数据库”, 而是在 source of truth 前面建立一个低延迟、可扩展、可降级的流量保护层。

Implement