🎯 Design API Gateway
1️⃣ Core Framework
When discussing API Gateway design, I frame it as:
- Request routing and service discovery
- Authentication and authorization
- Rate limiting and throttling
- Request / response transformation
- Load balancing and resilience
- Observability and logging
- Security and policy enforcement
- Trade-offs: latency vs control vs reliability
2️⃣ Core Requirements
Functional Requirements
- Route client requests to backend services
- Support path-based and host-based routing
- Authenticate requests
- Authorize access to APIs
- Enforce rate limits
- Support TLS termination
- Support request validation
- Support request / response transformation
- Support API versioning
- Support logging, metrics, and tracing
Non-functional Requirements
- Low latency
- High availability
- High throughput
- Scalable routing
- Secure by default
- Fault isolation
- Good observability
- Graceful degradation
👉 Interview Answer
An API Gateway is the entry point for client traffic.
It handles routing, authentication, authorization, rate limiting, TLS termination, request validation, observability, and resilience policies.
The main challenge is enforcing cross-cutting concerns without adding too much latency or becoming a single point of failure.
3️⃣ Core Concepts
API Gateway
A centralized entry layer between clients and backend services.
Client → API Gateway → Backend Services
Route
A route maps an incoming request to a backend service.
Example:
GET /api/orders/{id} → order-service
POST /api/payments → payment-service
Policy
A policy defines behavior applied at the gateway.
Examples:
- Auth policy
- Rate limit policy
- Retry policy
- Timeout policy
- Logging policy
Upstream Service
The backend service that receives the request.
👉 Interview Answer
I would treat the API Gateway as a policy enforcement and routing layer.
It should centralize cross-cutting concerns like auth, rate limiting, TLS, observability, and traffic control, while keeping business logic inside backend services.
4️⃣ Main APIs / Config
Route Config
{
"routeId": "orders-get",
"method": "GET",
"path": "/api/orders/{orderId}",
"upstreamService": "order-service",
"authRequired": true,
"rateLimitPolicy": "standard-user"
}
Rate Limit Policy
{
"policyId": "standard-user",
"limit": 1000,
"window": "1m",
"scope": "userId"
}
Service Registry Entry
{
"serviceName": "order-service",
"instances": [
{
"host": "10.0.1.10",
"port": 8080,
"healthy": true
}
]
}
Gateway Admin API
POST /api/gateway/routes
PATCH /api/gateway/routes/{routeId}
GET /api/gateway/metrics
👉 Interview Answer
The gateway is mostly driven by configuration.
Route configs define where traffic goes, policies define how requests are handled, and service discovery tells the gateway which backend instances are healthy.
5️⃣ High-Level Architecture
Client
→ DNS / Global Load Balancer
→ API Gateway Cluster
→ Auth / Policy Engine
→ Rate Limiter
→ Router
→ Load Balancer
→ Backend Services
Gateway Logs / Metrics / Traces
→ Observability Pipeline
Main Components
Listener
- Accepts HTTP / HTTPS requests
- Handles TLS termination
Auth Module
- Validates tokens or API keys
- Extracts user / tenant context
Policy Engine
- Applies route-specific rules
- Enforces auth, quotas, validation, and transformations
Rate Limiter
- Protects backend services
- Enforces user / tenant / IP limits
Router
- Matches request path and method
- Selects upstream service
Load Balancer
- Chooses healthy backend instance
👉 Interview Answer
The gateway receives the request, terminates TLS, authenticates the caller, applies policies, checks rate limits, routes the request, load balances to a healthy backend, and records logs, metrics, and traces.
6️⃣ Request Flow
Client sends request
→ Gateway terminates TLS
→ Match route
→ Authenticate request
→ Authorize access
→ Validate request
→ Apply rate limit
→ Transform request if needed
→ Select backend service
→ Forward request
→ Receive response
→ Transform response if needed
→ Return response to client
→ Emit logs/metrics/traces
👉 Interview Answer
Request processing should be modular.
Each stage handles one responsibility: route matching, authentication, authorization, rate limiting, validation, transformation, forwarding, and observability.
This makes policies easier to configure and reason about.
7️⃣ Routing
Routing Types
Path-based Routing
/api/users/* → user-service
/api/orders/* → order-service
Host-based Routing
api.example.com → public-api
admin.example.com → admin-api
Header-based Routing
X-Version: v2 → service-v2
Weighted Routing
90% → service-v1
10% → service-v2
Used for:
- Canary release
- Blue-green deployment
- A/B testing
👉 Interview Answer
The gateway should support path-based, host-based, header-based, and weighted routing.
Weighted routing is useful for canary deployments and gradual rollout of new service versions.
8️⃣ Authentication and Authorization
Authentication
Common methods:
- JWT
- OAuth2
- API key
- mTLS
- Session cookie
- Service-to-service token
Gateway Responsibilities
- Validate token signature
- Check token expiration
- Extract claims
- Attach identity context to request
- Reject invalid requests early
Authorization
Can happen at:
Gateway level: coarse-grained access
Service level: fine-grained business permission
👉 Interview Answer
The gateway should handle coarse-grained authentication and basic authorization.
It can validate JWTs or API keys, extract user and tenant context, and reject unauthorized requests early.
Fine-grained business authorization should still live in backend services.
9️⃣ Rate Limiting and Throttling
Why Needed?
Protect system from:
- Abuse
- DDoS-like traffic
- Buggy clients
- Noisy tenants
- Backend overload
Common Limit Dimensions
- IP address
- User ID
- Tenant ID
- API key
- Route
- Service
- Region
Algorithms
- Token bucket
- Leaky bucket
- Fixed window
- Sliding window
- Distributed counters
Example
tenant t123:
1000 requests/minute for /api/orders
👉 Interview Answer
Rate limiting protects backend services and enforces fairness.
I would support limits by user, tenant, API key, IP, and route.
Token bucket is a good default because it allows controlled bursts while enforcing average rate.
🔟 Request Validation and Transformation
Request Validation
Validate:
- Required headers
- Query parameters
- JSON schema
- Payload size
- Content type
- API version
Request Transformation
Examples:
- Add user context headers
- Rewrite path
- Convert external API format to internal format
- Remove sensitive headers
- Add correlation ID
Response Transformation
Examples:
- Normalize error response
- Remove internal fields
- Compress response
- Add caching headers
👉 Interview Answer
The gateway can validate request shape and apply lightweight transformations.
However, heavy business logic should not live in the gateway, because that makes it harder to maintain and scale independently.
1️⃣1️⃣ Service Discovery and Load Balancing
Service Discovery Options
- Static config
- DNS
- Consul / Eureka
- Kubernetes service discovery
- Cloud service registry
Load Balancing Strategies
- Round robin
- Least connections
- Random
- Weighted
- Locality-aware routing
- Health-aware routing
Health Checks
Gateway should avoid unhealthy instances.
only route to healthy endpoints
👉 Interview Answer
The gateway needs service discovery to know where backend services are running.
It should use health-aware load balancing and avoid sending traffic to unhealthy instances.
1️⃣2️⃣ Resilience Policies
Timeout
Every upstream call should have a timeout.
Retry
Retry only safe operations.
Good candidates:
GET
idempotent PUT
idempotent DELETE
Be careful with:
POST payment
POST order
Circuit Breaker
Stop sending traffic to failing service temporarily.
Bulkhead
Limit how many resources one backend can consume.
👉 Interview Answer
The gateway should enforce resilience policies like timeouts, limited retries, circuit breakers, and bulkheads.
Retries must be used carefully, especially for non-idempotent operations like payments or order creation.
1️⃣3️⃣ API Versioning
Versioning Approaches
Path Versioning
/api/v1/orders
/api/v2/orders
Header Versioning
Accept-Version: v2
Weighted Version Routing
5% traffic → v2
95% traffic → v1
👉 Interview Answer
The gateway can help with API versioning by routing different versions to different backend services.
Path versioning is simple, while header-based or weighted routing gives more flexibility for gradual migration.
1️⃣4️⃣ Observability
Gateway Should Emit
- Access logs
- Request count
- Error rate
- Latency
- Upstream latency
- Rate limit rejections
- Auth failures
- Route match failures
- Circuit breaker state
- Trace IDs
Important Fields
request_id
trace_id
user_id
tenant_id
route_id
upstream_service
status_code
latency_ms
👉 Interview Answer
The gateway is a great place for observability because all external traffic passes through it.
I would emit access logs, metrics, and distributed traces with request ID, route ID, user ID, tenant ID, upstream service, status code, and latency.
1️⃣5️⃣ Security
Security Responsibilities
- TLS termination
- mTLS for internal services
- JWT / API key validation
- WAF integration
- Request size limits
- Header sanitization
- CORS policy
- IP allowlist / denylist
- DDoS protection integration
Important Rule
Gateway is not the only security boundary.
Backend services should still validate critical permissions.
👉 Interview Answer
The gateway should enforce common security controls, including TLS, token validation, rate limits, CORS, request size limits, and header sanitization.
But backend services should still validate sensitive business permissions.
1️⃣6️⃣ Caching
What Can Be Cached?
- Public GET responses
- Static metadata
- Auth public keys / JWKS
- Route config
- Service discovery data
- Rate limit counters
Cache Rules
- Respect cache-control headers
- Do not cache user-specific sensitive data accidentally
- Include tenant/user context in cache key if needed
- Use short TTL for dynamic data
👉 Interview Answer
Gateway caching can reduce backend load, especially for public GET requests.
But caching must be safe.
Cache keys must include user or tenant context when responses are personalized.
1️⃣7️⃣ Config Management
Config Includes
- Routes
- Upstreams
- Auth policies
- Rate limit policies
- Timeout / retry policies
- Transform rules
- CORS rules
Requirements
- Versioned config
- Validated before publish
- Rollback support
- Gradual rollout
- Audit trail
- Environment-specific config
👉 Interview Answer
Gateway behavior is configuration-driven.
Config changes can affect production traffic immediately, so they should be validated, versioned, audited, and rollbackable.
1️⃣8️⃣ Scaling Patterns
Pattern 1: Stateless Gateway Nodes
Easy horizontal scaling.
Pattern 2: Global Load Balancer
Routes users to nearest healthy region.
Pattern 3: Local Caches
Cache config, JWKS, service discovery, and policies.
Pattern 4: Distributed Rate Limiter
Needed for global limits across gateway nodes.
Pattern 5: Multi-region Deployment
Avoid single-region dependency.
👉 Interview Answer
API Gateway nodes should be mostly stateless, so they can scale horizontally.
Config and discovery data can be cached locally.
For global rate limits, we need a distributed rate limiter or regional limits with reconciliation.
1️⃣9️⃣ Failure Handling
Common Failures
- Backend service down
- Service discovery stale
- Auth provider unavailable
- Rate limiter unavailable
- Gateway config bad
- Upstream timeout
- Partial regional outage
- DDoS traffic spike
Strategies
- Circuit breaker
- Health-aware routing
- Fallback to cached auth keys
- Last-known-good config
- Graceful degradation
- Retry safe requests
- Regional failover
- Emergency deny / allow rules
👉 Interview Answer
The gateway should fail safely.
If config service is unavailable, use last-known-good config.
If auth key fetching fails, use cached public keys until TTL expires.
If an upstream is unhealthy, route around it or return a controlled error.
2️⃣0️⃣ Consistency Model
Stronger Consistency Needed For
- Security policy changes
- Auth revocation
- Emergency denylist
- Critical route changes
- Audit logs
Eventual Consistency Acceptable For
- Normal route config propagation
- Metrics
- Logs aggregation
- Service discovery updates
- Non-critical rate limit dashboards
👉 Interview Answer
API Gateways use mixed consistency.
Security-sensitive policies and emergency deny rules need fast and reliable propagation.
Normal config changes, metrics, and logs can be eventually consistent.
2️⃣1️⃣ End-to-End Flow
Normal Request Flow
Client request
→ DNS / Load Balancer
→ API Gateway
→ TLS termination
→ Route match
→ Auth validation
→ Rate limit check
→ Request validation
→ Load balance to backend
→ Backend response
→ Gateway logs metrics/traces
→ Return response
Config Update Flow
Admin updates route config
→ Config validation
→ Versioned config saved
→ Config published
→ Gateway nodes pull or receive update
→ Gateways apply new config
→ Metrics monitored
Failure Flow
Backend errors increase
→ Circuit breaker opens
→ Gateway stops routing temporarily
→ Requests fail fast or use fallback
→ Health checks recover service
→ Circuit breaker closes
Key Insight
API Gateway is not just a reverse proxy — it is a centralized traffic control, policy enforcement, and resilience layer.
🧠 Staff-Level Answer (Final)
👉 Interview Answer (Full Version)
When designing an API Gateway, I think of it as the entry point and traffic control layer for backend services.
The gateway handles cross-cutting concerns such as routing, TLS termination, authentication, authorization, rate limiting, request validation, transformation, observability, and resilience policies.
A request first reaches the gateway through DNS or a load balancer. The gateway terminates TLS, matches the route, validates the caller’s token or API key, extracts user and tenant context, checks rate limits, validates the request, and forwards it to a healthy backend instance.
Routing can be path-based, host-based, header-based, or weighted for canary releases.
For authentication, the gateway can validate JWTs, API keys, or mTLS certificates, but backend services should still enforce fine-grained business authorization.
Rate limiting should support dimensions such as IP, user, tenant, API key, route, and service.
Token bucket is a good default because it supports bursts while controlling average rate.
The gateway should enforce timeouts, retries for safe idempotent requests, circuit breakers, and health-aware load balancing.
API Gateway nodes should be mostly stateless and horizontally scalable.
Route config, service discovery data, auth public keys, and policies can be cached locally.
For failure handling, the gateway should use last-known-good config, cached auth keys, health checks, circuit breakers, and regional failover.
The main trade-offs are latency, reliability, security, operational complexity, and how much logic belongs in the gateway versus backend services.
Ultimately, the goal is to provide a secure, reliable, observable, and scalable entry point for all API traffic without turning the gateway into a business-logic bottleneck.
⭐ Final Insight
API Gateway 的核心不是简单反向代理, 而是一个集 routing、auth、rate limiting、observability、resilience 和 traffic control 于一体的入口控制层。
中文部分
🎯 Design API Gateway
1️⃣ 核心框架
在设计 API Gateway 时,我通常从以下几个方面分析:
- Request routing 和 service discovery
- Authentication 和 authorization
- Rate limiting 和 throttling
- Request / response transformation
- Load balancing 和 resilience
- Observability 和 logging
- Security 和 policy enforcement
- 核心权衡:latency vs control vs reliability
2️⃣ 核心需求
功能需求
- 将 client requests 路由到 backend services
- 支持 path-based 和 host-based routing
- 认证 requests
- 授权 API access
- 执行 rate limits
- 支持 TLS termination
- 支持 request validation
- 支持 request / response transformation
- 支持 API versioning
- 支持 logging、metrics 和 tracing
非功能需求
- 低延迟
- 高可用
- 高吞吐
- 可扩展 routing
- 默认安全
- 故障隔离
- 良好 observability
- 优雅降级
👉 面试回答
API Gateway 是 client traffic 的入口。
它处理 routing、authentication、authorization、 rate limiting、TLS termination、request validation、 observability 和 resilience policies。
核心挑战是在执行这些 cross-cutting concerns 的同时, 不引入过多 latency, 也不能成为 single point of failure。
3️⃣ 核心概念
API Gateway
位于 clients 和 backend services 之间的统一入口层。
Client → API Gateway → Backend Services
Route
Route 将 incoming request 映射到 backend service。
示例:
GET /api/orders/{id} → order-service
POST /api/payments → payment-service
Policy
Policy 定义 gateway 上应用的行为。
例如:
- Auth policy
- Rate limit policy
- Retry policy
- Timeout policy
- Logging policy
Upstream Service
接收请求的 backend service。
👉 面试回答
我会把 API Gateway 看作 policy enforcement 和 routing layer。
它应该集中处理 auth、rate limiting、TLS、 observability 和 traffic control 等通用能力, 但 business logic 应该留在 backend services 中。
4️⃣ Main APIs / Config
Route Config
{
"routeId": "orders-get",
"method": "GET",
"path": "/api/orders/{orderId}",
"upstreamService": "order-service",
"authRequired": true,
"rateLimitPolicy": "standard-user"
}
Rate Limit Policy
{
"policyId": "standard-user",
"limit": 1000,
"window": "1m",
"scope": "userId"
}
Service Registry Entry
{
"serviceName": "order-service",
"instances": [
{
"host": "10.0.1.10",
"port": 8080,
"healthy": true
}
]
}
Gateway Admin API
POST /api/gateway/routes
PATCH /api/gateway/routes/{routeId}
GET /api/gateway/metrics
👉 面试回答
Gateway 通常主要由 configuration 驱动。
Route configs 定义 traffic 去哪里, policies 定义如何处理 requests, service discovery 告诉 gateway 哪些 backend instances 是健康的。
5️⃣ High-Level Architecture
Client
→ DNS / Global Load Balancer
→ API Gateway Cluster
→ Auth / Policy Engine
→ Rate Limiter
→ Router
→ Load Balancer
→ Backend Services
Gateway Logs / Metrics / Traces
→ Observability Pipeline
Main Components
Listener
- 接收 HTTP / HTTPS requests
- 处理 TLS termination
Auth Module
- 验证 tokens 或 API keys
- 提取 user / tenant context
Policy Engine
- 应用 route-specific rules
- 执行 auth、quotas、validation 和 transformations
Rate Limiter
- 保护 backend services
- 执行 user / tenant / IP limits
Router
- 匹配 request path 和 method
- 选择 upstream service
Load Balancer
- 选择健康 backend instance
👉 面试回答
Gateway 接收 request, terminate TLS, authenticate caller, 应用 policies, 检查 rate limits, 路由请求, load balance 到健康 backend, 并记录 logs、metrics 和 traces。
6️⃣ Request Flow
Client sends request
→ Gateway terminates TLS
→ Match route
→ Authenticate request
→ Authorize access
→ Validate request
→ Apply rate limit
→ Transform request if needed
→ Select backend service
→ Forward request
→ Receive response
→ Transform response if needed
→ Return response to client
→ Emit logs/metrics/traces
👉 面试回答
Request processing 应该模块化。
每个阶段负责一个职责: route matching、authentication、authorization、 rate limiting、validation、transformation、 forwarding 和 observability。
这样 policies 更容易配置和理解。
7️⃣ Routing
Routing Types
Path-based Routing
/api/users/* → user-service
/api/orders/* → order-service
Host-based Routing
api.example.com → public-api
admin.example.com → admin-api
Header-based Routing
X-Version: v2 → service-v2
Weighted Routing
90% → service-v1
10% → service-v2
用于:
- Canary release
- Blue-green deployment
- A/B testing
👉 面试回答
Gateway 应该支持 path-based、host-based、 header-based 和 weighted routing。
Weighted routing 对 canary deployments 和新 service version 的 gradual rollout 很有用。
8️⃣ Authentication and Authorization
Authentication
常见方式:
- JWT
- OAuth2
- API key
- mTLS
- Session cookie
- Service-to-service token
Gateway Responsibilities
- Validate token signature
- Check token expiration
- Extract claims
- Attach identity context to request
- Reject invalid requests early
Authorization
可以发生在:
Gateway level: coarse-grained access
Service level: fine-grained business permission
👉 面试回答
Gateway 应该处理 coarse-grained authentication 和基础 authorization。
它可以验证 JWTs 或 API keys, 提取 user 和 tenant context, 并提前拒绝 unauthorized requests。
Fine-grained business authorization 仍然应该放在 backend services 中。
9️⃣ Rate Limiting and Throttling
Why Needed?
保护系统免受:
- Abuse
- DDoS-like traffic
- Buggy clients
- Noisy tenants
- Backend overload
Common Limit Dimensions
- IP address
- User ID
- Tenant ID
- API key
- Route
- Service
- Region
Algorithms
- Token bucket
- Leaky bucket
- Fixed window
- Sliding window
- Distributed counters
Example
tenant t123:
1000 requests/minute for /api/orders
👉 面试回答
Rate limiting 用来保护 backend services 并保证 fairness。
我会支持按 user、tenant、API key、IP 和 route 限流。
Token bucket 是好的默认选择, 因为它允许受控 burst, 同时限制平均速率。
🔟 Request Validation and Transformation
Request Validation
验证:
- Required headers
- Query parameters
- JSON schema
- Payload size
- Content type
- API version
Request Transformation
示例:
- Add user context headers
- Rewrite path
- Convert external API format to internal format
- Remove sensitive headers
- Add correlation ID
Response Transformation
示例:
- Normalize error response
- Remove internal fields
- Compress response
- Add caching headers
👉 面试回答
Gateway 可以验证 request shape 并执行轻量 transformations。
但是 heavy business logic 不应该放在 gateway, 否则维护和独立扩展会变得困难。
1️⃣1️⃣ Service Discovery and Load Balancing
Service Discovery Options
- Static config
- DNS
- Consul / Eureka
- Kubernetes service discovery
- Cloud service registry
Load Balancing Strategies
- Round robin
- Least connections
- Random
- Weighted
- Locality-aware routing
- Health-aware routing
Health Checks
Gateway 应该避免 unhealthy instances。
only route to healthy endpoints
👉 面试回答
Gateway 需要 service discovery, 才知道 backend services 运行在哪里。
它应该使用 health-aware load balancing, 避免把流量发送到 unhealthy instances。
1️⃣2️⃣ Resilience Policies
Timeout
每个 upstream call 都应该有 timeout。
Retry
只 retry 安全操作。
适合:
GET
idempotent PUT
idempotent DELETE
谨慎:
POST payment
POST order
Circuit Breaker
临时停止向失败 service 发送流量。
Bulkhead
限制某个 backend 消耗的资源量。
👉 面试回答
Gateway 应该执行 resilience policies, 例如 timeouts、limited retries、 circuit breakers 和 bulkheads。
Retries 必须谨慎使用, 尤其是 payments 或 order creation 这类 non-idempotent operations。
1️⃣3️⃣ API Versioning
Versioning Approaches
Path Versioning
/api/v1/orders
/api/v2/orders
Header Versioning
Accept-Version: v2
Weighted Version Routing
5% traffic → v2
95% traffic → v1
👉 面试回答
Gateway 可以帮助处理 API versioning, 将不同版本路由到不同 backend services。
Path versioning 简单; header-based 或 weighted routing 更适合 gradual migration。
1️⃣4️⃣ Observability
Gateway Should Emit
- Access logs
- Request count
- Error rate
- Latency
- Upstream latency
- Rate limit rejections
- Auth failures
- Route match failures
- Circuit breaker state
- Trace IDs
Important Fields
request_id
trace_id
user_id
tenant_id
route_id
upstream_service
status_code
latency_ms
👉 面试回答
Gateway 是 observability 的好位置, 因为所有 external traffic 都经过它。
我会输出 access logs、metrics 和 distributed traces, 包含 request ID、route ID、user ID、 tenant ID、upstream service、status code 和 latency。
1️⃣5️⃣ Security
Security Responsibilities
- TLS termination
- mTLS for internal services
- JWT / API key validation
- WAF integration
- Request size limits
- Header sanitization
- CORS policy
- IP allowlist / denylist
- DDoS protection integration
Important Rule
Gateway 不是唯一安全边界。
Backend services 仍然应该验证关键权限。
👉 面试回答
Gateway 应该执行通用安全控制, 包括 TLS、token validation、rate limits、 CORS、request size limits 和 header sanitization。
但是 backend services 仍然应该验证敏感业务权限。
1️⃣6️⃣ Caching
What Can Be Cached?
- Public GET responses
- Static metadata
- Auth public keys / JWKS
- Route config
- Service discovery data
- Rate limit counters
Cache Rules
- 尊重 cache-control headers
- 不要误缓存 user-specific sensitive data
- 如有需要,cache key 包含 tenant / user context
- Dynamic data 使用短 TTL
👉 面试回答
Gateway caching 可以降低 backend load, 特别适合 public GET requests。
但 caching 必须安全。
如果 response 是 personalized, cache key 必须包含 user 或 tenant context。
1️⃣7️⃣ Config Management
Config Includes
- Routes
- Upstreams
- Auth policies
- Rate limit policies
- Timeout / retry policies
- Transform rules
- CORS rules
Requirements
- Versioned config
- Validated before publish
- Rollback support
- Gradual rollout
- Audit trail
- Environment-specific config
👉 面试回答
Gateway behavior 是 configuration-driven。
Config changes 可以立即影响 production traffic, 所以它们必须被 validated、versioned、 audited,并支持 rollback。
1️⃣8️⃣ Scaling Patterns
Pattern 1: Stateless Gateway Nodes
方便 horizontal scaling。
Pattern 2: Global Load Balancer
将用户路由到最近健康 region。
Pattern 3: Local Caches
缓存 config、JWKS、service discovery 和 policies。
Pattern 4: Distributed Rate Limiter
用于跨 gateway nodes 的 global limits。
Pattern 5: Multi-region Deployment
避免 single-region dependency。
👉 面试回答
API Gateway nodes 应该尽量 stateless, 这样可以水平扩展。
Config 和 discovery data 可以本地缓存。
对 global rate limits, 需要 distributed rate limiter, 或使用 regional limits 加 reconciliation。
1️⃣9️⃣ Failure Handling
Common Failures
- Backend service down
- Service discovery stale
- Auth provider unavailable
- Rate limiter unavailable
- Gateway config bad
- Upstream timeout
- Partial regional outage
- DDoS traffic spike
Strategies
- Circuit breaker
- Health-aware routing
- Fallback to cached auth keys
- Last-known-good config
- Graceful degradation
- Retry safe requests
- Regional failover
- Emergency deny / allow rules
👉 面试回答
Gateway 应该 fail safely。
如果 config service 不可用, 使用 last-known-good config。
如果 auth key fetching 失败, 在 TTL 内使用 cached public keys。
如果 upstream 不健康, gateway 应该绕开它或返回受控错误。
2️⃣0️⃣ Consistency Model
需要较强一致性的场景
- Security policy changes
- Auth revocation
- Emergency denylist
- Critical route changes
- Audit logs
可以最终一致的场景
- Normal route config propagation
- Metrics
- Logs aggregation
- Service discovery updates
- Non-critical rate limit dashboards
👉 面试回答
API Gateway 使用 mixed consistency。
Security-sensitive policies 和 emergency deny rules 需要快速且可靠地传播。
Normal config changes、metrics 和 logs 可以最终一致。
2️⃣1️⃣ End-to-End Flow
Normal Request Flow
Client request
→ DNS / Load Balancer
→ API Gateway
→ TLS termination
→ Route match
→ Auth validation
→ Rate limit check
→ Request validation
→ Load balance to backend
→ Backend response
→ Gateway logs metrics/traces
→ Return response
Config Update Flow
Admin updates route config
→ Config validation
→ Versioned config saved
→ Config published
→ Gateway nodes pull or receive update
→ Gateways apply new config
→ Metrics monitored
Failure Flow
Backend errors increase
→ Circuit breaker opens
→ Gateway stops routing temporarily
→ Requests fail fast or use fallback
→ Health checks recover service
→ Circuit breaker closes
Key Insight
API Gateway 不是简单 reverse proxy, 而是 centralized traffic control、policy enforcement 和 resilience layer。
🧠 Staff-Level Answer(最终版)
👉 面试回答(完整背诵版)
在设计 API Gateway 时, 我会把它看作 backend services 的入口 和 traffic control layer。
Gateway 负责处理 cross-cutting concerns, 包括 routing、TLS termination、authentication、 authorization、rate limiting、request validation、 transformation、observability 和 resilience policies。
一个 request 首先通过 DNS 或 load balancer 到达 gateway。 Gateway terminate TLS, 匹配 route, 验证 caller token 或 API key, 提取 user 和 tenant context, 检查 rate limits, 验证 request, 然后转发到健康的 backend instance。
Routing 可以是 path-based、host-based、 header-based,或者用于 canary release 的 weighted routing。
对 authentication, gateway 可以验证 JWTs、API keys 或 mTLS certificates, 但 backend services 仍然应该执行 fine-grained business authorization。
Rate limiting 应该支持 IP、user、tenant、 API key、route 和 service 等维度。
Token bucket 是好的默认选择, 因为它支持 burst, 同时控制平均速率。
Gateway 应该执行 timeouts、 对安全幂等请求执行 retries、 使用 circuit breakers, 并执行 health-aware load balancing。
API Gateway nodes 应该尽量 stateless, 方便水平扩展。
Route config、service discovery data、 auth public keys 和 policies 可以本地缓存。
对 failure handling, gateway 应该使用 last-known-good config、 cached auth keys、health checks、 circuit breakers 和 regional failover。
核心权衡包括 latency、reliability、security、 operational complexity, 以及哪些逻辑应该放在 gateway, 哪些应该留在 backend services。
最终目标是为所有 API traffic 提供一个 secure、reliable、observable 和 scalable 的入口, 但不要让 gateway 变成 business-logic bottleneck。
⭐ Final Insight
API Gateway 的核心不是简单反向代理, 而是一个集 routing、auth、rate limiting、observability、resilience 和 traffic control 于一体的入口控制层。
Implement