·

System Design Deep Dive - 20 How OpenAI Serves LLM APIs at Scale

Post by ailswan May. 26, 2026

中文 ↓

🎯 How OpenAI Serves LLM APIs at Scale


1️⃣ Core LLM Serving Framework (Staff-Level)

When discussing an OpenAI-like LLM API serving system, I frame it as:

  1. API gateway and authentication
  2. Quotas and admission control
  3. Model routing
  4. Prompt preprocessing and safety checks
  5. GPU scheduling and batching
  6. Streaming token generation
  7. Observability and billing
  8. Trade-offs: latency vs throughput vs fairness vs cost

2️⃣ Core Problem

LLM serving is expensive and highly variable.

Requests differ by:


👉 Interview Answer

An OpenAI-like API platform is both an API system and a scarce compute scheduling system. The hard part is protecting accelerator capacity while still giving users low-latency, fair, and reliable responses.


3️⃣ High-Level Architecture

Client Request
   ↓
API Gateway
   ↓
Auth / Quota / Rate Limit
   ↓
Request Validation
   ↓
Model Router
   ↓
Safety / Policy Layer
   ↓
Inference Scheduler
   ↓
GPU Workers
   ↓
Streaming Response
   ↓
Usage Metering and Logs

4️⃣ Admission Control

Admission protects scarce compute.

Signals:

Possible outcomes:


👉 Interview Answer

Admission control should be token-aware, not only request-count based. A single long prompt with a long output can consume far more GPU time than many small requests.


5️⃣ Model Routing

Routing considers:


6️⃣ GPU Scheduling and Batching

Inference schedulers optimize:

Core tension:

Larger batches improve throughput
but can increase latency.

👉 Interview Answer

GPU scheduling is the core data-plane problem. The scheduler tries to batch compatible requests for utilization while respecting latency targets, token limits, memory pressure, and tenant priority.


7️⃣ Streaming Tokens

Streaming improves perceived latency:

Request accepted
  ↓
First token returned
  ↓
Tokens streamed incrementally
  ↓
Finish reason and usage returned

Benefits:

Costs:


8️⃣ Safety, Billing, and Observability

Production concerns:


9️⃣ Staff-Level Trade-offs

Decision Benefit Cost
Aggressive batching Higher throughput Higher latency
Strict quotas Fairness and cost control Rejections during spikes
Model fallback Better availability Quality may differ
Streaming Better perceived latency More connection complexity
Long context support More capability Higher memory and cost

🔟 Failure Handling

Failures:

Fallbacks:


中文部分

中文速记

一句话

OpenAI LLM API Serving 本质是 API 平台 + 稀缺 GPU 计算调度系统,核心权衡是 latency、throughput、fairness 和 cost。


背诵要点


中文面试回答

我会把 OpenAI 类 LLM API 平台拆成 control plane 和 GPU data plane。 API gateway 负责认证、quota、rate limit、payload validation 和 request admission。 Admission control 不能只看请求数,还要看 prompt tokens、max output tokens、模型类型、租户优先级和当前队列深度,因为不同请求消耗的 GPU 时间差异很大。

请求通过 model router 进入合适的模型池后,inference scheduler 负责把请求调度到 GPU worker。 Scheduler 要处理 prefill、decode、batching、KV cache memory、sequence length、priority queue 和 cancellation。 batch 越大吞吐越高,但用户延迟可能越差,所以这是核心 data-plane trade-off。

Staff 级重点是:LLM serving 不只是 REST API。 它是昂贵计算资源下的公平调度问题。 好的系统要在低延迟、高吞吐、租户公平、成本控制、安全检查和可观测性之间做平衡。


✅ Final Interview Answer

An OpenAI-like LLM API platform has a control plane and a GPU-heavy data plane. The API gateway authenticates requests, enforces quotas, validates payloads, and routes traffic to the right model pool. Admission control must be token-aware because prompt and output lengths drive compute cost. The inference scheduler batches compatible requests onto GPUs while balancing throughput, latency, memory pressure, and tenant fairness.

Responses are often streamed token by token to reduce perceived latency, while usage metering, safety checks, abuse detection, and observability run around the serving path. At staff level, the main trade-off is latency versus throughput under scarce accelerator capacity. A good design protects fairness and reliability while keeping GPU utilization high.

Implement