sd-rps Real Production Systems ·

🎯 How OpenAI Serves LLM APIs at Scale

1️⃣ Core LLM Serving Framework (Staff-Level)

When discussing an OpenAI-like LLM API serving system, I frame it as:

API gateway and authentication
Quotas and admission control
Model routing
Prompt preprocessing and safety checks
GPU scheduling and batching
Streaming token generation
Observability and billing
Trade-offs: latency vs throughput vs fairness vs cost

2️⃣ Core Problem

LLM serving is expensive and highly variable.

Requests differ by:

model
prompt length
output length
streaming vs non-streaming
tenant priority
latency expectations
safety requirements
tool or multimodal usage

👉 Interview Answer

An OpenAI-like API platform is both an API system and a scarce compute scheduling system. The hard part is protecting accelerator capacity while still giving users low-latency, fair, and reliable responses.

3️⃣ High-Level Architecture

Client Request
   ↓
API Gateway
   ↓
Auth / Quota / Rate Limit
   ↓
Request Validation
   ↓
Model Router
   ↓
Safety / Policy Layer
   ↓
Inference Scheduler
   ↓
GPU Workers
   ↓
Streaming Response
   ↓
Usage Metering and Logs

4️⃣ Admission Control

Admission protects scarce compute.

Signals:

tenant quota
request rate
token budget
model capacity
queue depth
priority tier
abuse risk

Possible outcomes:

accept
queue
reject with rate limit
route to smaller model
ask client to retry

👉 Interview Answer

Admission control should be token-aware, not only request-count based. A single long prompt with a long output can consume far more GPU time than many small requests.

5️⃣ Model Routing

Routing considers:

requested model
model version
region
capacity
latency
fallback policy
tenant restrictions

6️⃣ GPU Scheduling and Batching

Inference schedulers optimize:

prefill phase
decode phase
batching
KV cache memory
max sequence length
priority queues
cancellation

Core tension:

Larger batches improve throughput
but can increase latency.

👉 Interview Answer

GPU scheduling is the core data-plane problem. The scheduler tries to batch compatible requests for utilization while respecting latency targets, token limits, memory pressure, and tenant priority.

7️⃣ Streaming Tokens

Streaming improves perceived latency:

Request accepted
  ↓
First token returned
  ↓
Tokens streamed incrementally
  ↓
Finish reason and usage returned

Benefits:

better UX
early feedback
supports interactive apps

Costs:

long-lived connections
cancellation handling
partial output safety and logging

8️⃣ Safety, Billing, and Observability

Production concerns:

content policy checks
abuse detection
prompt and output logging policy
token accounting
latency metrics
GPU utilization
queue time
error rates
per-model cost

9️⃣ Staff-Level Trade-offs

Decision	Benefit	Cost
Aggressive batching	Higher throughput	Higher latency
Strict quotas	Fairness and cost control	Rejections during spikes
Model fallback	Better availability	Quality may differ
Streaming	Better perceived latency	More connection complexity
Long context support	More capability	Higher memory and cost

🔟 Failure Handling

Failures:

GPU worker crash
model pool overloaded
long queue time
client disconnect
partial generation failure
quota service timeout

Fallbacks:

retry safe pre-generation failures
fail fast after generation starts
shed low-priority traffic
route to alternate pool
return clear retry-after metadata
account usage only for completed or billable tokens

中文部分

中文速记

一句话

OpenAI LLM API Serving 本质是 API 平台 + 稀缺 GPU 计算调度系统，核心权衡是 latency、throughput、fairness 和 cost。

背诵要点

admission control 必须 token-aware，不能只按 request count
model routing 要考虑模型、容量、region、tenant 和 fallback
GPU scheduler 要平衡 batching 和 latency
streaming tokens 提升 perceived latency
usage metering、safety、observability 和 billing 都是生产必需

中文面试回答

我会把 OpenAI 类 LLM API 平台拆成 control plane 和 GPU data plane。 API gateway 负责认证、quota、rate limit、payload validation 和 request admission。 Admission control 不能只看请求数，还要看 prompt tokens、max output tokens、模型类型、租户优先级和当前队列深度，因为不同请求消耗的 GPU 时间差异很大。

请求通过 model router 进入合适的模型池后，inference scheduler 负责把请求调度到 GPU worker。 Scheduler 要处理 prefill、decode、batching、KV cache memory、sequence length、priority queue 和 cancellation。 batch 越大吞吐越高，但用户延迟可能越差，所以这是核心 data-plane trade-off。

Staff 级重点是：LLM serving 不只是 REST API。它是昂贵计算资源下的公平调度问题。好的系统要在低延迟、高吞吐、租户公平、成本控制、安全检查和可观测性之间做平衡。

✅ Final Interview Answer

An OpenAI-like LLM API platform has a control plane and a GPU-heavy data plane. The API gateway authenticates requests, enforces quotas, validates payloads, and routes traffic to the right model pool. Admission control must be token-aware because prompt and output lengths drive compute cost. The inference scheduler batches compatible requests onto GPUs while balancing throughput, latency, memory pressure, and tenant fairness.

Responses are often streamed token by token to reduce perceived latency, while usage metering, safety checks, abuse detection, and observability run around the serving path. At staff level, the main trade-off is latency versus throughput under scarce accelerator capacity. A good design protects fairness and reliability while keeping GPU utilization high.

System Design Deep Dive - 20 How OpenAI Serves LLM APIs at Scale

🎯 How OpenAI Serves LLM APIs at Scale

1️⃣ Core LLM Serving Framework (Staff-Level)

2️⃣ Core Problem

3️⃣ High-Level Architecture

4️⃣ Admission Control

5️⃣ Model Routing

6️⃣ GPU Scheduling and Batching

7️⃣ Streaming Tokens

8️⃣ Safety, Billing, and Observability

9️⃣ Staff-Level Trade-offs

🔟 Failure Handling

中文部分

中文速记

一句话

背诵要点

中文面试回答

✅ Final Interview Answer

Implement