d&d-t System Design Deep Dive ·

🎯 Design Rate Limiter

1️⃣ Core Framework

When discussing Rate Limiter design, I frame it as:

Core purpose and requirements
Rate limiting algorithms
Key design dimensions: user, IP, API, tenant
Storage and counter strategy
Distributed rate limiting
Handling burst traffic
Failure handling and fallback
Trade-offs: accuracy vs latency vs availability

2️⃣ Core Requirements

Functional Requirements

Limit number of requests per user / IP / API key
Support different limits for different APIs
Support different limits for different user tiers
Return proper error when limit is exceeded
Support burst control
Support distributed deployment

Non-functional Requirements

Very low latency
High availability
Scalable to high QPS
Accurate enough under distributed traffic
Configurable limits
Minimal impact on normal request path

👉 Interview Answer

A rate limiter controls how many requests a client can make within a given time window.

It protects backend services from abuse, traffic spikes, accidental client bugs, and unfair resource usage.

The main challenge is enforcing limits accurately while keeping latency very low and availability high.

3️⃣ Main APIs / Integration Points

Check Limit API

POST /api/rate-limit/check

Request:

{
  "key": "user:u123",
  "api": "/checkout",
  "cost": 1
}

Response:

{
  "allowed": true,
  "remaining": 99,
  "resetAt": "2026-05-02T10:01:00Z"
}

Gateway Integration

Most commonly, rate limiter is integrated at:

Client
→ API Gateway / Load Balancer
→ Rate Limiter
→ Backend Service

Response When Limited

HTTP/1.1 429 Too Many Requests
Retry-After: 30

👉 Interview Answer

I would usually place the rate limiter at the API gateway layer, because it can block abusive traffic before it reaches backend services.

When a request exceeds the limit, the system should return HTTP 429 with a Retry-After header.

4️⃣ Rate Limiting Algorithms

Algorithm 1: Fixed Window Counter

Example:

100 requests per minute

Implementation:

key = user_id + current_minute
counter++

Pros

Simple
Fast
Memory efficient

Cons

Boundary problem
Allows burst around window edges

Example:

100 requests at 00:59
100 requests at 01:00
→ 200 requests in 2 seconds

👉 Interview Answer

Fixed window counter is simple and efficient, but it has a boundary problem.

A client can send requests at the end of one window and the beginning of the next window, causing a burst much larger than the intended rate.

Algorithm 2: Sliding Window Log

Store timestamps of each request.

user:u123 → [t1, t2, t3, ...]

On each request:

Remove timestamps older than window
Count remaining timestamps
Allow if count < limit

Pros

Very accurate
No boundary burst problem

Cons

High memory usage
Expensive for high QPS users

👉 Interview Answer

Sliding window log is accurate because it tracks individual request timestamps.

However, it is expensive in memory and compute, especially for high-QPS clients.

Algorithm 3: Sliding Window Counter

Approximation using current and previous window.

Formula:

effective_count =
current_window_count +
previous_window_count * overlap_ratio

Pros

More accurate than fixed window
Cheaper than sliding log
Good practical compromise

Cons

Approximate
Slightly more complex than fixed window

👉 Interview Answer

Sliding window counter is a practical compromise.

It avoids most fixed-window burst problems while using much less memory than sliding window log.

It is commonly used when we need reasonable accuracy at scale.

Algorithm 4: Token Bucket

Each key has a bucket.

bucket capacity = burst size
refill rate = allowed rate

Example:

capacity = 100 tokens
refill = 10 tokens / second

Each request consumes one token.

Pros

Supports burst traffic
Smooth average rate
Widely used in production

Cons

Requires tracking tokens and last refill time
Slightly more complex

👉 Interview Answer

Token bucket is useful when we want to allow controlled bursts while enforcing a long-term average rate.

Each request consumes tokens, and tokens are refilled at a fixed rate.

If the bucket is empty, the request is rejected or delayed.

Algorithm 5: Leaky Bucket

Requests enter a queue and are processed at fixed rate.

Pros

Smooths traffic
Protects downstream systems

Cons

Adds queueing latency
Queue can overflow
Less suitable when low latency is required

👉 Interview Answer

Leaky bucket smooths traffic by processing requests at a constant rate.

It is useful for protecting downstream systems, but it may add latency because requests can wait in a queue.

Recommended Algorithm

For most API rate limiters:

Token Bucket or Sliding Window Counter

Core Insight

Fixed window is simple, sliding window is more accurate, token bucket handles bursts better.

5️⃣ Rate Limit Key Design

Common Keys

user_id
api_key
ip_address
tenant_id
endpoint
device_id

Composite Keys

Examples:

user:u123:/checkout
ip:1.2.3.4:/login
tenant:t1:/api/orders
api_key:k123:/search

Why Key Design Matters

Controls fairness
Prevents abuse
Supports per-endpoint limits
Supports tenant-specific quotas

👉 Interview Answer

Rate limit key design is very important.

We may limit by user ID, IP address, API key, endpoint, tenant, or a combination of them.

For example, login APIs may be limited by IP and username, while paid APIs may be limited by tenant or API key.

6️⃣ Data Storage

In-memory Local Counter

gateway node memory

Pros

Extremely fast
No network call

Cons

Not globally accurate
Each node sees partial traffic
Lost on restart

Redis Counter

Redis key → counter / token bucket state

Pros

Centralized state
Fast
TTL support
Atomic operations with Lua

Cons

Network latency
Redis can become bottleneck
Redis failure affects limiter

Persistent Database

Usually not used for hot path.

Use Cases

Config storage
Audit logs
Usage reporting
Billing quotas

👉 Interview Answer

For the hot path, I would usually use Redis or local memory.

Redis gives centralized counters with atomic operations and TTL support, while local memory is faster but less accurate in distributed systems.

Persistent databases are better for configuration, auditing, and billing, not per-request rate limit checks.

7️⃣ Distributed Rate Limiting

The Challenge

In distributed systems, traffic is spread across many gateway nodes.

Client → Gateway A
Client → Gateway B
Client → Gateway C

Each node only sees part of the traffic.

Option 1: Centralized Redis

All gateways check Redis.

Gateway → Redis → decision

Pros

More accurate
Simple global limit

Cons

Adds latency
Redis hot keys
Redis dependency

Option 2: Local Limiters

Each gateway enforces a fraction of the limit.

global limit = 1000/sec
10 gateways
each gateway allows 100/sec

Pros

Very fast
Highly available

Cons

Less accurate
Bad with uneven traffic distribution

Option 3: Hybrid

Use:

local limiter for fast protection
global Redis limiter for accuracy

👉 Interview Answer

In distributed systems, a single user’s traffic may hit multiple gateway nodes.

A centralized Redis limiter gives better global accuracy, but adds latency and dependency.

A local limiter is faster and more available, but less accurate.

In production, I would often use a hybrid approach: local limiting for fast protection, and Redis-based global limiting for stronger enforcement.

8️⃣ Redis + Lua Atomic Operation

Why Lua?

A rate limiter needs atomic operations:

read current count
increment count
set TTL
return allow/reject

Without atomicity, concurrent requests can exceed limit.

Example Logic

local current = redis.call("GET", key)

if current == false then
  redis.call("SET", key, 1, "EX", window)
  return 1
end

if tonumber(current) < limit then
  redis.call("INCR", key)
  return 1
else
  return 0
end

👉 Interview Answer

I would use Redis Lua scripts to make the rate limit check atomic.

The script reads the counter, increments it, sets TTL if needed, and returns allow or reject in one atomic operation.

This prevents race conditions under high concurrency.

9️⃣ Handling Burst Traffic

Why Burst Matters

Some clients legitimately send bursts.

Examples:

Page load triggers multiple API calls
Mobile app reconnects
Batch job starts
User refreshes dashboard

Strategies

Token bucket with burst capacity
Separate short-term and long-term limits
Per-endpoint limits
Queue low-priority requests
Return Retry-After

Example

Short-term: 20 requests / second
Long-term: 1000 requests / hour

👉 Interview Answer

I would handle bursts using token bucket or layered limits.

For example, a user may be allowed 20 requests per second but also capped at 1000 requests per hour.

This supports normal bursts while preventing long-term abuse.

🔟 Config Management

Config Examples

{
  "endpoint": "/checkout",
  "limit": 100,
  "windowSeconds": 60,
  "algorithm": "token_bucket",
  "scope": "user_id",
  "priority": "critical"
}

Requirements

Dynamic config updates
Per-user-tier limits
Per-endpoint limits
Emergency override
Safe rollout

👉 Interview Answer

Rate limits should be configurable, because different APIs and user tiers have different traffic patterns.

I would store configs in a central config service and cache them in gateway nodes, with safe rollout and emergency override support.

1️⃣1️⃣ Response Behavior

When Allowed

Forward request to backend.

Add headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1714653600

When Limited

Return:

HTTP 429 Too Many Requests
Retry-After: 30

Hard Block vs Soft Limit

Hard block: reject immediately
Soft limit: allow but log / degrade
Shadow mode: evaluate without blocking

👉 Interview Answer

When the request is allowed, the system forwards it to the backend and may return rate-limit headers.

When the limit is exceeded, it should return HTTP 429 with Retry-After.

Before enforcing new limits, I would often run them in shadow mode to understand impact.

1️⃣2️⃣ Failure Handling

Common Failures

Redis unavailable
Config service unavailable
Gateway restart
Hot key overload
Clock skew
Network timeout

Fail Open vs Fail Closed

Mode	Meaning	Use Case
Fail open	Allow traffic when limiter fails	User-facing low-risk APIs
Fail closed	Reject traffic when limiter fails	Security-sensitive APIs

Failure behavior depends on the API.

For security-sensitive APIs like login or payment, I may fail closed or use a strict local fallback.

For normal user-facing APIs, I may fail open to avoid blocking legitimate users.

I would also cache configs locally and use circuit breakers around Redis or external dependencies.

1️⃣3️⃣ Observability

Metrics

Allowed request count
Rejected request count
Rate limit hit rate
Redis latency
Redis error rate
Hot keys
Config update latency
False positive complaints
Backend protection effectiveness

Logs

Track:

request_id
rate_limit_key
algorithm
limit
remaining
decision
reason

👉 Interview Answer

Observability is important because rate limiters directly affect user traffic.

I would track allowed and rejected requests, Redis latency, hot keys, limit hit rates, and the impact on backend errors.

Detailed logs help debug false positives and tune limits safely.

1️⃣4️⃣ Consistency Model

Stronger Consistency Needed For

Security-sensitive limits
Payment or login abuse prevention
Billing quota enforcement

Eventual / Approximate Consistency Acceptable For

General API traffic shaping
Soft quotas
Best-effort abuse prevention
Internal service protection

👉 Interview Answer

Rate limiting usually does not require perfect consistency.

For most APIs, approximate limits are acceptable because the goal is protection and fairness.

But for security-sensitive or billing-related limits, stronger consistency and auditability are more important.

1️⃣5️⃣ End-to-End Flow

Request Flow

Client request
→ API Gateway
→ Build rate limit key
→ Load config
→ Check local limiter
→ Check global Redis limiter
→ Allow or reject
→ Forward to backend if allowed

Redis Limiter Flow

Gateway
→ Execute Redis Lua script
→ Atomically update counter/token state
→ Return allow/reject decision

Limited Flow

Request exceeds limit
→ Return HTTP 429
→ Include Retry-After header
→ Log decision
→ Emit metric

Key Insight

A rate limiter is not just a counter — it is a low-latency distributed control system.

🧠 Staff-Level Answer (Final)

👉 Interview Answer (Full Version)

When designing a rate limiter, I think of it as a low-latency distributed control system that protects backend services from abuse, traffic spikes, and unfair usage.

I would usually place it at the API gateway layer so abusive traffic can be blocked before reaching backend services.

The system should support different limits by user, IP, API key, tenant, endpoint, and user tier.

For algorithms, fixed window is simple but has boundary burst issues. Sliding window log is accurate but expensive. Sliding window counter is a good compromise. Token bucket is often preferred when we want to allow controlled bursts while enforcing a long-term average rate.

In a distributed setup, I would use a hybrid design: local limiters for low-latency protection and Redis-based global limiters for stronger accuracy.

Redis Lua scripts can make counter updates atomic and prevent race conditions under high concurrency.

Rate limit configs should be centrally managed and cached locally in gateway nodes.

When a request exceeds the limit, the system should return HTTP 429 with Retry-After.

Failure behavior depends on the API: security-sensitive APIs may fail closed, while normal user-facing APIs may fail open with local fallback.

The main trade-offs are accuracy, latency, availability, memory usage, and operational complexity.

Ultimately, the goal is to protect backend systems while minimizing impact on legitimate users.

⭐ Final Insight

Rate Limiter 的核心不是简单计数，而是在分布式系统中用极低延迟实现公平性、保护性和可控性。

中文部分

🎯 Design Rate Limiter

1️⃣ 核心框架

在设计 Rate Limiter 时，我通常从以下几个方面来分析：

核心目的和需求
限流算法
限流维度：用户、IP、API、租户
存储和计数策略
分布式限流
突发流量处理
故障处理和 fallback
核心权衡：准确性 vs 延迟 vs 可用性

2️⃣ 核心需求

功能需求

限制每个用户 / IP / API key 的请求数
支持不同 API 配置不同限流规则
支持不同用户等级配置不同限额
超过限制时返回正确错误
支持突发流量控制
支持分布式部署

非功能需求

极低延迟
高可用
支持高 QPS
在分布式流量下足够准确
配置灵活
对正常请求路径影响很小

👉 面试回答

Rate Limiter 用来控制某个 client 在指定时间窗口内可以发送多少请求。

它可以保护后端服务，防止恶意攻击、突发流量、客户端 bug 以及不公平的资源使用。

核心挑战是在保持极低延迟和高可用的同时，尽可能准确地执行限流规则。

3️⃣ 主要 API / 集成点

Check Limit API

POST /api/rate-limit/check

Request:

{
  "key": "user:u123",
  "api": "/checkout",
  "cost": 1
}

Response:

{
  "allowed": true,
  "remaining": 99,
  "resetAt": "2026-05-02T10:01:00Z"
}

Gateway 集成

Rate limiter 通常集成在：

Client
→ API Gateway / Load Balancer
→ Rate Limiter
→ Backend Service

超限响应

HTTP/1.1 429 Too Many Requests
Retry-After: 30

👉 面试回答

我通常会将 rate limiter 放在 API Gateway 层，因为这样可以在流量进入后端服务之前就拦截异常或恶意请求。

当请求超过限制时，系统应该返回 HTTP 429，并带上 Retry-After header。

4️⃣ 限流算法

算法 1：Fixed Window Counter

示例：

100 requests per minute

实现：

key = user_id + current_minute
counter++

优点

简单
快
内存占用低

缺点

有窗口边界问题
允许窗口边界附近的突发流量

示例：

00:59 发送 100 个请求
01:00 再发送 100 个请求
→ 2 秒内实际发送 200 个请求

👉 面试回答

Fixed window counter 简单高效，但存在窗口边界问题。

客户端可以在一个窗口末尾和下一个窗口开头连续发送请求，造成超过预期的突发流量。

算法 2：Sliding Window Log

存储每个请求的时间戳：

user:u123 → [t1, t2, t3, ...]

每次请求时：

删除窗口外的旧 timestamp
统计窗口内请求数
如果 count < limit，则允许

优点

非常准确
没有 fixed window 的边界突发问题

缺点

内存开销大
对高 QPS 用户成本高

👉 面试回答

Sliding window log 非常准确，因为它记录每个请求的时间戳。

但是它的内存和计算成本较高，特别是对于高 QPS 用户。

算法 3：Sliding Window Counter

使用当前窗口和上一个窗口近似计算。

公式：

effective_count =
current_window_count +
previous_window_count * overlap_ratio

优点

比 fixed window 更准确
比 sliding window log 更省内存
是实践中常用的折中方案

缺点

是近似值
比 fixed window 稍微复杂

👉 面试回答

Sliding window counter 是一个实用的折中方案。

它可以避免大部分 fixed window 的边界突发问题，同时比 sliding window log 使用更少内存。

当系统需要在大规模下保持合理准确性时，这个方案很常用。

算法 4：Token Bucket

每个 key 有一个 bucket：

bucket capacity = burst size
refill rate = allowed rate

示例：

capacity = 100 tokens
refill = 10 tokens / second

每个请求消耗一个 token。

优点

支持突发流量
平滑长期平均速率
生产系统常用

缺点

需要记录 token 数和上次 refill 时间
实现稍微复杂

👉 面试回答

Token bucket 适合在允许一定突发流量的同时，控制长期平均请求速率。

每个请求消耗 token， token 会按照固定速率补充。

如果 bucket 为空，请求会被拒绝或延迟。

算法 5：Leaky Bucket

请求进入队列，并按照固定速率处理。

优点

平滑流量
保护下游系统

缺点

增加排队延迟
队列可能溢出
不太适合低延迟要求强的请求路径

👉 面试回答

Leaky bucket 通过固定速率处理请求来平滑流量。

它适合保护下游系统，但可能引入排队延迟。

5️⃣ Rate Limit Key 设计

常见 Key

user_id
api_key
ip_address
tenant_id
endpoint
device_id

组合 Key

示例：

user:u123:/checkout
ip:1.2.3.4:/login
tenant:t1:/api/orders
api_key:k123:/search

为什么 Key 设计重要？

控制公平性
防止滥用
支持 endpoint 级别限流
支持 tenant-specific quota

👉 面试回答

Rate limit key 的设计非常重要。

我们可以按 user ID、IP、API key、 endpoint、tenant 或它们的组合进行限流。

例如 login API 可以按 IP 和 username 限流，而付费 API 更适合按 tenant 或 API key 限流。

6️⃣ 数据存储

Local In-memory Counter

gateway node memory

优点

极快
没有网络调用

缺点

全局不准确
每个节点只看到部分流量
节点重启后数据丢失

Redis Counter

Redis key → counter / token bucket state

优点

计数状态集中
快
支持 TTL
可以用 Lua 实现原子操作

缺点

有网络延迟
Redis 可能成为瓶颈
Redis 故障会影响 limiter

Persistent Database

通常不用于热路径。

适合：

配置存储
审计日志
用量报表
Billing quotas

👉 面试回答

对于热路径，我通常会使用 Redis 或本地内存。

Redis 可以提供集中式计数、原子操作和 TTL；本地内存更快，但在分布式系统中不够准确。

持久化数据库更适合存储配置、审计和 billing 数据，不适合每个请求都查询。

7️⃣ 分布式限流

挑战

在分布式系统中，流量会被分散到多个 gateway 节点：

Client → Gateway A
Client → Gateway B
Client → Gateway C

每个节点只能看到一部分流量。

方案 1：集中式 Redis

所有 gateway 都检查 Redis。

Gateway → Redis → decision

优点

更准确
简单实现全局限制

缺点

增加延迟
可能出现 Redis hot key
依赖 Redis 可用性

方案 2：Local Limiters

每个 gateway 执行一部分限额。

global limit = 1000/sec
10 gateways
each gateway allows 100/sec

优点

极快
高可用

缺点

不够准确
流量分布不均时效果不好

方案 3：Hybrid

使用：

local limiter for fast protection
global Redis limiter for accuracy

👉 面试回答

在分布式系统中，同一个用户的流量可能打到多个 gateway 节点。

集中式 Redis limiter 可以提供更好的全局准确性，但会增加延迟和外部依赖。

Local limiter 更快、更高可用，但准确性较弱。

在生产系统中，我通常会采用 hybrid 方案：使用 local limiter 做快速保护，再用 Redis-based global limiter 做更强的全局限制。

8️⃣ Redis + Lua 原子操作

为什么用 Lua？

Rate limiter 需要原子操作：

读取当前 count
增加 count
设置 TTL
返回 allow / reject

如果不是原子操作，高并发下可能会超过限制。

示例逻辑

local current = redis.call("GET", key)

if current == false then
  redis.call("SET", key, 1, "EX", window)
  return 1
end

if tonumber(current) < limit then
  redis.call("INCR", key)
  return 1
else
  return 0
end

👉 面试回答

我会使用 Redis Lua script 来保证 rate limit check 的原子性。

脚本会在一次原子操作中完成读取 counter、 increment、设置 TTL，并返回 allow 或 reject。

这样可以避免高并发下的 race condition。

9️⃣ 突发流量处理

为什么 Burst 重要？

有些 client 合理地会产生突发请求。

例如：

页面加载触发多个 API calls
移动 app 重新连接
Batch job 启动
用户刷新 dashboard

策略

Token bucket with burst capacity
短期限制 + 长期限制
Per-endpoint limits
低优先级请求排队
返回 Retry-After

示例

Short-term: 20 requests / second
Long-term: 1000 requests / hour

👉 面试回答

我会使用 token bucket 或 layered limits 来处理突发流量。

例如，一个用户每秒最多可以发 20 个请求，但每小时最多只能发 1000 个请求。

这样既能支持正常突发，又能防止长期滥用。

🔟 配置管理

Config Examples

{
  "endpoint": "/checkout",
  "limit": 100,
  "windowSeconds": 60,
  "algorithm": "token_bucket",
  "scope": "user_id",
  "priority": "critical"
}

Requirements

Dynamic config updates
Per-user-tier limits
Per-endpoint limits
Emergency override
Safe rollout

👉 面试回答

Rate limits 应该是可配置的，因为不同 API 和不同用户等级有不同流量模式。

我会将配置存储在中心化 config service 中，并缓存到 gateway 节点。

同时需要支持安全发布和紧急 override。

1️⃣1️⃣ 响应行为

当请求被允许

请求转发到后端。

可以添加 header：

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1714653600

当请求被限流

HTTP 429 Too Many Requests
Retry-After: 30

Hard Block vs Soft Limit

Hard block：立即拒绝
Soft limit：允许但记录 / 降级
Shadow mode：只评估，不拦截

👉 面试回答

当请求被允许时，系统会将请求转发到后端，并可以返回 rate-limit headers。

当超过限制时，系统应该返回 HTTP 429 和 Retry-After。

在强制执行新的限流规则之前，我通常会先用 shadow mode 观察影响。

1️⃣2️⃣ 故障处理

常见故障

Redis unavailable
Config service unavailable
Gateway restart
Hot key overload
Clock skew
Network timeout

Fail Open vs Fail Closed

Mode	Meaning	Use Case
Fail open	limiter 失败时允许请求	低风险用户 API
Fail closed	limiter 失败时拒绝请求	安全敏感 API

1️⃣3️⃣ 可观测性

Metrics

Allowed request count
Rejected request count
Rate limit hit rate
Redis latency
Redis error rate
Hot keys
Config update latency
False positive complaints
Backend protection effectiveness

Logs

追踪：

request_id
rate_limit_key
algorithm
limit
remaining
decision
reason

👉 面试回答

可观测性非常重要，因为 rate limiter 会直接影响用户流量。

我会追踪 allowed / rejected requests、 Redis latency、hot keys、limit hit rate，以及对后端错误率的影响。

详细日志可以帮助排查 false positives，并安全地调整限流规则。

1️⃣4️⃣ 一致性模型

需要更强一致性的场景

安全敏感限流
Payment 或 login abuse prevention
Billing quota enforcement

可以接受近似 / 最终一致的场景

通用 API 流量整形
Soft quotas
Best-effort abuse prevention
Internal service protection

👉 面试回答

Rate limiting 通常不需要完美一致性。

对大多数 API 来说，近似限制是可以接受的，因为目标是保护系统和保证公平性。

但对于安全敏感或 billing 相关的限制，更强一致性和可审计性会更重要。

1️⃣5️⃣ End-to-End Flow

Request Flow

Client request
→ API Gateway
→ Build rate limit key
→ Load config
→ Check local limiter
→ Check global Redis limiter
→ Allow or reject
→ Forward to backend if allowed

Redis Limiter Flow

Gateway
→ Execute Redis Lua script
→ Atomically update counter/token state
→ Return allow/reject decision

Limited Flow

Request exceeds limit
→ Return HTTP 429
→ Include Retry-After header
→ Log decision
→ Emit metric

Key Insight

Rate Limiter 不是简单计数器，它是一个低延迟的分布式流量控制系统。

🧠 Staff-Level Answer（最终版）

👉 面试回答（完整背诵版）

在设计 Rate Limiter 时，我会把它看作一个低延迟的分布式流量控制系统，用来保护后端服务免受滥用、突发流量和不公平资源使用的影响。

我通常会将它放在 API Gateway 层，这样可以在请求到达后端之前进行拦截。

系统需要支持按 user、IP、API key、tenant、 endpoint 和 user tier 等不同维度限流。

在算法选择上， fixed window 简单但有窗口边界突发问题； sliding window log 准确但成本高； sliding window counter 是一个较好的折中； token bucket 则适合允许受控突发，同时限制长期平均速率。

在分布式环境中，我会采用 hybrid 设计：使用 local limiter 做低延迟保护，使用 Redis-based global limiter 做更准确的全局限制。

Redis Lua script 可以保证 counter 更新的原子性，避免高并发下的 race condition。

Rate limit config 应该集中管理，并缓存在 gateway 节点本地。

当请求超过限制时，系统应该返回 HTTP 429 和 Retry-After。

故障处理取决于 API 类型：安全敏感 API 可以 fail closed，普通用户 API 可以 fail open 并使用本地 fallback。

核心权衡包括准确性、延迟、可用性、内存使用和运维复杂度。

最终目标是在尽量不影响正常用户的前提下，保护后端系统并保证资源使用公平。

⭐ Final Insight

Rate Limiter 的核心不是简单计数，而是在分布式系统中用极低延迟实现公平性、保护性和可控性。