System Design Deep Dive - 06 Design Rate Limiter

Post by ailswan April. 30, 2026

中文 ↓

🎯 Design Rate Limiter


1️⃣ Core Framework

When discussing Rate Limiter design, I frame it as:

  1. Core purpose and requirements
  2. Rate limiting algorithms
  3. Key design dimensions: user, IP, API, tenant
  4. Storage and counter strategy
  5. Distributed rate limiting
  6. Handling burst traffic
  7. Failure handling and fallback
  8. Trade-offs: accuracy vs latency vs availability

2️⃣ Core Requirements


Functional Requirements


Non-functional Requirements


👉 Interview Answer

A rate limiter controls how many requests a client can make within a given time window.

It protects backend services from abuse, traffic spikes, accidental client bugs, and unfair resource usage.

The main challenge is enforcing limits accurately while keeping latency very low and availability high.


3️⃣ Main APIs / Integration Points


Check Limit API

POST /api/rate-limit/check

Request:

{
  "key": "user:u123",
  "api": "/checkout",
  "cost": 1
}

Response:

{
  "allowed": true,
  "remaining": 99,
  "resetAt": "2026-05-02T10:01:00Z"
}

Gateway Integration

Most commonly, rate limiter is integrated at:

Client
→ API Gateway / Load Balancer
→ Rate Limiter
→ Backend Service

Response When Limited

HTTP/1.1 429 Too Many Requests
Retry-After: 30

👉 Interview Answer

I would usually place the rate limiter at the API gateway layer, because it can block abusive traffic before it reaches backend services.

When a request exceeds the limit, the system should return HTTP 429 with a Retry-After header.


4️⃣ Rate Limiting Algorithms


Algorithm 1: Fixed Window Counter

Example:

100 requests per minute

Implementation:

key = user_id + current_minute
counter++

Pros


Cons

Example:

100 requests at 00:59
100 requests at 01:00
→ 200 requests in 2 seconds

👉 Interview Answer

Fixed window counter is simple and efficient, but it has a boundary problem.

A client can send requests at the end of one window and the beginning of the next window, causing a burst much larger than the intended rate.


Algorithm 2: Sliding Window Log

Store timestamps of each request.

user:u123 → [t1, t2, t3, ...]

On each request:

  1. Remove timestamps older than window
  2. Count remaining timestamps
  3. Allow if count < limit

Pros


Cons


👉 Interview Answer

Sliding window log is accurate because it tracks individual request timestamps.

However, it is expensive in memory and compute, especially for high-QPS clients.


Algorithm 3: Sliding Window Counter

Approximation using current and previous window.

Formula:

effective_count =
current_window_count +
previous_window_count * overlap_ratio

Pros


Cons


👉 Interview Answer

Sliding window counter is a practical compromise.

It avoids most fixed-window burst problems while using much less memory than sliding window log.

It is commonly used when we need reasonable accuracy at scale.


Algorithm 4: Token Bucket

Each key has a bucket.

bucket capacity = burst size
refill rate = allowed rate

Example:

capacity = 100 tokens
refill = 10 tokens / second

Each request consumes one token.


Pros


Cons


👉 Interview Answer

Token bucket is useful when we want to allow controlled bursts while enforcing a long-term average rate.

Each request consumes tokens, and tokens are refilled at a fixed rate.

If the bucket is empty, the request is rejected or delayed.


Algorithm 5: Leaky Bucket

Requests enter a queue and are processed at fixed rate.


Pros


Cons


👉 Interview Answer

Leaky bucket smooths traffic by processing requests at a constant rate.

It is useful for protecting downstream systems, but it may add latency because requests can wait in a queue.


For most API rate limiters:

Token Bucket or Sliding Window Counter

Core Insight

Fixed window is simple, sliding window is more accurate, token bucket handles bursts better.


5️⃣ Rate Limit Key Design


Common Keys

user_id
api_key
ip_address
tenant_id
endpoint
device_id

Composite Keys

Examples:

user:u123:/checkout
ip:1.2.3.4:/login
tenant:t1:/api/orders
api_key:k123:/search

Why Key Design Matters


👉 Interview Answer

Rate limit key design is very important.

We may limit by user ID, IP address, API key, endpoint, tenant, or a combination of them.

For example, login APIs may be limited by IP and username, while paid APIs may be limited by tenant or API key.


6️⃣ Data Storage


In-memory Local Counter

gateway node memory

Pros

Cons


Redis Counter

Redis key → counter / token bucket state

Pros

Cons


Persistent Database

Usually not used for hot path.

Use Cases


👉 Interview Answer

For the hot path, I would usually use Redis or local memory.

Redis gives centralized counters with atomic operations and TTL support, while local memory is faster but less accurate in distributed systems.

Persistent databases are better for configuration, auditing, and billing, not per-request rate limit checks.


7️⃣ Distributed Rate Limiting


The Challenge

In distributed systems, traffic is spread across many gateway nodes.

Client → Gateway A
Client → Gateway B
Client → Gateway C

Each node only sees part of the traffic.


Option 1: Centralized Redis

All gateways check Redis.

Gateway → Redis → decision

Pros

Cons


Option 2: Local Limiters

Each gateway enforces a fraction of the limit.

global limit = 1000/sec
10 gateways
each gateway allows 100/sec

Pros

Cons


Option 3: Hybrid

Use:

local limiter for fast protection
global Redis limiter for accuracy

👉 Interview Answer

In distributed systems, a single user’s traffic may hit multiple gateway nodes.

A centralized Redis limiter gives better global accuracy, but adds latency and dependency.

A local limiter is faster and more available, but less accurate.

In production, I would often use a hybrid approach: local limiting for fast protection, and Redis-based global limiting for stronger enforcement.


8️⃣ Redis + Lua Atomic Operation


Why Lua?

A rate limiter needs atomic operations:

read current count
increment count
set TTL
return allow/reject

Without atomicity, concurrent requests can exceed limit.


Example Logic

local current = redis.call("GET", key)

if current == false then
  redis.call("SET", key, 1, "EX", window)
  return 1
end

if tonumber(current) < limit then
  redis.call("INCR", key)
  return 1
else
  return 0
end

👉 Interview Answer

I would use Redis Lua scripts to make the rate limit check atomic.

The script reads the counter, increments it, sets TTL if needed, and returns allow or reject in one atomic operation.

This prevents race conditions under high concurrency.


9️⃣ Handling Burst Traffic


Why Burst Matters

Some clients legitimately send bursts.

Examples:


Strategies


Example

Short-term: 20 requests / second
Long-term: 1000 requests / hour

👉 Interview Answer

I would handle bursts using token bucket or layered limits.

For example, a user may be allowed 20 requests per second but also capped at 1000 requests per hour.

This supports normal bursts while preventing long-term abuse.


🔟 Config Management


Config Examples

{
  "endpoint": "/checkout",
  "limit": 100,
  "windowSeconds": 60,
  "algorithm": "token_bucket",
  "scope": "user_id",
  "priority": "critical"
}

Requirements


👉 Interview Answer

Rate limits should be configurable, because different APIs and user tiers have different traffic patterns.

I would store configs in a central config service and cache them in gateway nodes, with safe rollout and emergency override support.


1️⃣1️⃣ Response Behavior


When Allowed

Forward request to backend.

Add headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1714653600

When Limited

Return:

HTTP 429 Too Many Requests
Retry-After: 30

Hard Block vs Soft Limit


👉 Interview Answer

When the request is allowed, the system forwards it to the backend and may return rate-limit headers.

When the limit is exceeded, it should return HTTP 429 with Retry-After.

Before enforcing new limits, I would often run them in shadow mode to understand impact.


1️⃣2️⃣ Failure Handling


Common Failures


Fail Open vs Fail Closed

Mode Meaning Use Case
Fail open Allow traffic when limiter fails User-facing low-risk APIs
Fail closed Reject traffic when limiter fails Security-sensitive APIs


👉 Interview Answer

Failure behavior depends on the API.

For security-sensitive APIs like login or payment, I may fail closed or use a strict local fallback.

For normal user-facing APIs, I may fail open to avoid blocking legitimate users.

I would also cache configs locally and use circuit breakers around Redis or external dependencies.


1️⃣3️⃣ Observability


Metrics


Logs

Track:

request_id
rate_limit_key
algorithm
limit
remaining
decision
reason

👉 Interview Answer

Observability is important because rate limiters directly affect user traffic.

I would track allowed and rejected requests, Redis latency, hot keys, limit hit rates, and the impact on backend errors.

Detailed logs help debug false positives and tune limits safely.


1️⃣4️⃣ Consistency Model


Stronger Consistency Needed For


Eventual / Approximate Consistency Acceptable For


👉 Interview Answer

Rate limiting usually does not require perfect consistency.

For most APIs, approximate limits are acceptable because the goal is protection and fairness.

But for security-sensitive or billing-related limits, stronger consistency and auditability are more important.


1️⃣5️⃣ End-to-End Flow


Request Flow

Client request
→ API Gateway
→ Build rate limit key
→ Load config
→ Check local limiter
→ Check global Redis limiter
→ Allow or reject
→ Forward to backend if allowed

Redis Limiter Flow

Gateway
→ Execute Redis Lua script
→ Atomically update counter/token state
→ Return allow/reject decision

Limited Flow

Request exceeds limit
→ Return HTTP 429
→ Include Retry-After header
→ Log decision
→ Emit metric

Key Insight

A rate limiter is not just a counter — it is a low-latency distributed control system.


🧠 Staff-Level Answer (Final)


👉 Interview Answer (Full Version)

When designing a rate limiter, I think of it as a low-latency distributed control system that protects backend services from abuse, traffic spikes, and unfair usage.

I would usually place it at the API gateway layer so abusive traffic can be blocked before reaching backend services.

The system should support different limits by user, IP, API key, tenant, endpoint, and user tier.

For algorithms, fixed window is simple but has boundary burst issues. Sliding window log is accurate but expensive. Sliding window counter is a good compromise. Token bucket is often preferred when we want to allow controlled bursts while enforcing a long-term average rate.

In a distributed setup, I would use a hybrid design: local limiters for low-latency protection and Redis-based global limiters for stronger accuracy.

Redis Lua scripts can make counter updates atomic and prevent race conditions under high concurrency.

Rate limit configs should be centrally managed and cached locally in gateway nodes.

When a request exceeds the limit, the system should return HTTP 429 with Retry-After.

Failure behavior depends on the API: security-sensitive APIs may fail closed, while normal user-facing APIs may fail open with local fallback.

The main trade-offs are accuracy, latency, availability, memory usage, and operational complexity.

Ultimately, the goal is to protect backend systems while minimizing impact on legitimate users.


⭐ Final Insight

Rate Limiter 的核心不是简单计数, 而是在分布式系统中用极低延迟实现公平性、保护性和可控性。



中文部分


🎯 Design Rate Limiter


1️⃣ 核心框架

在设计 Rate Limiter 时,我通常从以下几个方面来分析:

  1. 核心目的和需求
  2. 限流算法
  3. 限流维度:用户、IP、API、租户
  4. 存储和计数策略
  5. 分布式限流
  6. 突发流量处理
  7. 故障处理和 fallback
  8. 核心权衡:准确性 vs 延迟 vs 可用性

2️⃣ 核心需求


功能需求


非功能需求


👉 面试回答

Rate Limiter 用来控制某个 client 在指定时间窗口内可以发送多少请求。

它可以保护后端服务, 防止恶意攻击、突发流量、客户端 bug 以及不公平的资源使用。

核心挑战是在保持极低延迟和高可用的同时, 尽可能准确地执行限流规则。


3️⃣ 主要 API / 集成点


Check Limit API

POST /api/rate-limit/check

Request:

{
  "key": "user:u123",
  "api": "/checkout",
  "cost": 1
}

Response:

{
  "allowed": true,
  "remaining": 99,
  "resetAt": "2026-05-02T10:01:00Z"
}

Gateway 集成

Rate limiter 通常集成在:

Client
→ API Gateway / Load Balancer
→ Rate Limiter
→ Backend Service

超限响应

HTTP/1.1 429 Too Many Requests
Retry-After: 30

👉 面试回答

我通常会将 rate limiter 放在 API Gateway 层, 因为这样可以在流量进入后端服务之前 就拦截异常或恶意请求。

当请求超过限制时, 系统应该返回 HTTP 429, 并带上 Retry-After header。


4️⃣ 限流算法


算法 1:Fixed Window Counter

示例:

100 requests per minute

实现:

key = user_id + current_minute
counter++

优点


缺点

示例:

00:59 发送 100 个请求
01:00 再发送 100 个请求
→ 2 秒内实际发送 200 个请求

👉 面试回答

Fixed window counter 简单高效, 但存在窗口边界问题。

客户端可以在一个窗口末尾和下一个窗口开头连续发送请求, 造成超过预期的突发流量。


算法 2:Sliding Window Log

存储每个请求的时间戳:

user:u123 → [t1, t2, t3, ...]

每次请求时:

  1. 删除窗口外的旧 timestamp
  2. 统计窗口内请求数
  3. 如果 count < limit,则允许

优点


缺点


👉 面试回答

Sliding window log 非常准确, 因为它记录每个请求的时间戳。

但是它的内存和计算成本较高, 特别是对于高 QPS 用户。


算法 3:Sliding Window Counter

使用当前窗口和上一个窗口近似计算。

公式:

effective_count =
current_window_count +
previous_window_count * overlap_ratio

优点


缺点


👉 面试回答

Sliding window counter 是一个实用的折中方案。

它可以避免大部分 fixed window 的边界突发问题, 同时比 sliding window log 使用更少内存。

当系统需要在大规模下保持合理准确性时, 这个方案很常用。


算法 4:Token Bucket

每个 key 有一个 bucket:

bucket capacity = burst size
refill rate = allowed rate

示例:

capacity = 100 tokens
refill = 10 tokens / second

每个请求消耗一个 token。


优点


缺点


👉 面试回答

Token bucket 适合在允许一定突发流量的同时, 控制长期平均请求速率。

每个请求消耗 token, token 会按照固定速率补充。

如果 bucket 为空, 请求会被拒绝或延迟。


算法 5:Leaky Bucket

请求进入队列,并按照固定速率处理。


优点


缺点


👉 面试回答

Leaky bucket 通过固定速率处理请求来平滑流量。

它适合保护下游系统, 但可能引入排队延迟。


推荐算法

对于大多数 API rate limiter:

Token Bucket or Sliding Window Counter

核心理解

Fixed window 简单, sliding window 更准确, token bucket 更适合处理突发流量。


5️⃣ Rate Limit Key 设计


常见 Key

user_id
api_key
ip_address
tenant_id
endpoint
device_id

组合 Key

示例:

user:u123:/checkout
ip:1.2.3.4:/login
tenant:t1:/api/orders
api_key:k123:/search

为什么 Key 设计重要?


👉 面试回答

Rate limit key 的设计非常重要。

我们可以按 user ID、IP、API key、 endpoint、tenant 或它们的组合进行限流。

例如 login API 可以按 IP 和 username 限流, 而付费 API 更适合按 tenant 或 API key 限流。


6️⃣ 数据存储


Local In-memory Counter

gateway node memory

优点

缺点


Redis Counter

Redis key → counter / token bucket state

优点

缺点


Persistent Database

通常不用于热路径。

适合:


👉 面试回答

对于热路径, 我通常会使用 Redis 或本地内存。

Redis 可以提供集中式计数、 原子操作和 TTL; 本地内存更快, 但在分布式系统中不够准确。

持久化数据库更适合存储配置、审计和 billing 数据, 不适合每个请求都查询。


7️⃣ 分布式限流


挑战

在分布式系统中, 流量会被分散到多个 gateway 节点:

Client → Gateway A
Client → Gateway B
Client → Gateway C

每个节点只能看到一部分流量。


方案 1:集中式 Redis

所有 gateway 都检查 Redis。

Gateway → Redis → decision

优点

缺点


方案 2:Local Limiters

每个 gateway 执行一部分限额。

global limit = 1000/sec
10 gateways
each gateway allows 100/sec

优点

缺点


方案 3:Hybrid

使用:

local limiter for fast protection
global Redis limiter for accuracy

👉 面试回答

在分布式系统中, 同一个用户的流量可能打到多个 gateway 节点。

集中式 Redis limiter 可以提供更好的全局准确性, 但会增加延迟和外部依赖。

Local limiter 更快、更高可用, 但准确性较弱。

在生产系统中, 我通常会采用 hybrid 方案: 使用 local limiter 做快速保护, 再用 Redis-based global limiter 做更强的全局限制。


8️⃣ Redis + Lua 原子操作


为什么用 Lua?

Rate limiter 需要原子操作:

读取当前 count
增加 count
设置 TTL
返回 allow / reject

如果不是原子操作, 高并发下可能会超过限制。


示例逻辑

local current = redis.call("GET", key)

if current == false then
  redis.call("SET", key, 1, "EX", window)
  return 1
end

if tonumber(current) < limit then
  redis.call("INCR", key)
  return 1
else
  return 0
end

👉 面试回答

我会使用 Redis Lua script 来保证 rate limit check 的原子性。

脚本会在一次原子操作中完成读取 counter、 increment、设置 TTL, 并返回 allow 或 reject。

这样可以避免高并发下的 race condition。


9️⃣ 突发流量处理


为什么 Burst 重要?

有些 client 合理地会产生突发请求。

例如:


策略


示例

Short-term: 20 requests / second
Long-term: 1000 requests / hour

👉 面试回答

我会使用 token bucket 或 layered limits 来处理突发流量。

例如,一个用户每秒最多可以发 20 个请求, 但每小时最多只能发 1000 个请求。

这样既能支持正常突发, 又能防止长期滥用。


🔟 配置管理


Config Examples

{
  "endpoint": "/checkout",
  "limit": 100,
  "windowSeconds": 60,
  "algorithm": "token_bucket",
  "scope": "user_id",
  "priority": "critical"
}

Requirements


👉 面试回答

Rate limits 应该是可配置的, 因为不同 API 和不同用户等级有不同流量模式。

我会将配置存储在中心化 config service 中, 并缓存到 gateway 节点。

同时需要支持安全发布和紧急 override。


1️⃣1️⃣ 响应行为


当请求被允许

请求转发到后端。

可以添加 header:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1714653600

当请求被限流

返回:

HTTP 429 Too Many Requests
Retry-After: 30

Hard Block vs Soft Limit


👉 面试回答

当请求被允许时, 系统会将请求转发到后端, 并可以返回 rate-limit headers。

当超过限制时, 系统应该返回 HTTP 429 和 Retry-After。

在强制执行新的限流规则之前, 我通常会先用 shadow mode 观察影响。


1️⃣2️⃣ 故障处理


常见故障


Fail Open vs Fail Closed

Mode Meaning Use Case
Fail open limiter 失败时允许请求 低风险用户 API
Fail closed limiter 失败时拒绝请求 安全敏感 API

推荐策略


👉 面试回答

故障时的行为取决于 API 类型。

对于 login、payment 等安全敏感 API, 可以 fail closed 或使用严格的本地 fallback。

对于普通用户 API, 可以 fail open, 避免误伤正常用户。

同时我会缓存配置, 并对 Redis 等外部依赖使用 circuit breaker。


1️⃣3️⃣ 可观测性


Metrics


Logs

追踪:

request_id
rate_limit_key
algorithm
limit
remaining
decision
reason

👉 面试回答

可观测性非常重要, 因为 rate limiter 会直接影响用户流量。

我会追踪 allowed / rejected requests、 Redis latency、hot keys、limit hit rate, 以及对后端错误率的影响。

详细日志可以帮助排查 false positives, 并安全地调整限流规则。


1️⃣4️⃣ 一致性模型


需要更强一致性的场景


可以接受近似 / 最终一致的场景


👉 面试回答

Rate limiting 通常不需要完美一致性。

对大多数 API 来说, 近似限制是可以接受的, 因为目标是保护系统和保证公平性。

但对于安全敏感或 billing 相关的限制, 更强一致性和可审计性会更重要。


1️⃣5️⃣ End-to-End Flow


Request Flow

Client request
→ API Gateway
→ Build rate limit key
→ Load config
→ Check local limiter
→ Check global Redis limiter
→ Allow or reject
→ Forward to backend if allowed

Redis Limiter Flow

Gateway
→ Execute Redis Lua script
→ Atomically update counter/token state
→ Return allow/reject decision

Limited Flow

Request exceeds limit
→ Return HTTP 429
→ Include Retry-After header
→ Log decision
→ Emit metric

Key Insight

Rate Limiter 不是简单计数器, 它是一个低延迟的分布式流量控制系统。


🧠 Staff-Level Answer(最终版)


👉 面试回答(完整背诵版)

在设计 Rate Limiter 时, 我会把它看作一个低延迟的分布式流量控制系统, 用来保护后端服务免受滥用、突发流量和不公平资源使用的影响。

我通常会将它放在 API Gateway 层, 这样可以在请求到达后端之前进行拦截。

系统需要支持按 user、IP、API key、tenant、 endpoint 和 user tier 等不同维度限流。

在算法选择上, fixed window 简单但有窗口边界突发问题; sliding window log 准确但成本高; sliding window counter 是一个较好的折中; token bucket 则适合允许受控突发, 同时限制长期平均速率。

在分布式环境中, 我会采用 hybrid 设计: 使用 local limiter 做低延迟保护, 使用 Redis-based global limiter 做更准确的全局限制。

Redis Lua script 可以保证 counter 更新的原子性, 避免高并发下的 race condition。

Rate limit config 应该集中管理, 并缓存在 gateway 节点本地。

当请求超过限制时, 系统应该返回 HTTP 429 和 Retry-After。

故障处理取决于 API 类型: 安全敏感 API 可以 fail closed, 普通用户 API 可以 fail open 并使用本地 fallback。

核心权衡包括准确性、延迟、可用性、 内存使用和运维复杂度。

最终目标是在尽量不影响正常用户的前提下, 保护后端系统并保证资源使用公平。


⭐ Final Insight

Rate Limiter 的核心不是简单计数, 而是在分布式系统中用极低延迟实现公平性、保护性和可控性。

Implement