🎯 How OpenAI Serves LLM APIs at Scale
1️⃣ Core LLM Serving Framework (Staff-Level)
When discussing an OpenAI-like LLM API serving system, I frame it as:
- API gateway and authentication
- Quotas and admission control
- Model routing
- Prompt preprocessing and safety checks
- GPU scheduling and batching
- Streaming token generation
- Observability and billing
- Trade-offs: latency vs throughput vs fairness vs cost
2️⃣ Core Problem
LLM serving is expensive and highly variable.
Requests differ by:
- model
- prompt length
- output length
- streaming vs non-streaming
- tenant priority
- latency expectations
- safety requirements
- tool or multimodal usage
👉 Interview Answer
An OpenAI-like API platform is both an API system and a scarce compute scheduling system. The hard part is protecting accelerator capacity while still giving users low-latency, fair, and reliable responses.
3️⃣ High-Level Architecture
Client Request
↓
API Gateway
↓
Auth / Quota / Rate Limit
↓
Request Validation
↓
Model Router
↓
Safety / Policy Layer
↓
Inference Scheduler
↓
GPU Workers
↓
Streaming Response
↓
Usage Metering and Logs
4️⃣ Admission Control
Admission protects scarce compute.
Signals:
- tenant quota
- request rate
- token budget
- model capacity
- queue depth
- priority tier
- abuse risk
Possible outcomes:
- accept
- queue
- reject with rate limit
- route to smaller model
- ask client to retry
👉 Interview Answer
Admission control should be token-aware, not only request-count based. A single long prompt with a long output can consume far more GPU time than many small requests.
5️⃣ Model Routing
Routing considers:
- requested model
- model version
- region
- capacity
- latency
- fallback policy
- tenant restrictions
6️⃣ GPU Scheduling and Batching
Inference schedulers optimize:
- prefill phase
- decode phase
- batching
- KV cache memory
- max sequence length
- priority queues
- cancellation
Core tension:
Larger batches improve throughput
but can increase latency.
👉 Interview Answer
GPU scheduling is the core data-plane problem. The scheduler tries to batch compatible requests for utilization while respecting latency targets, token limits, memory pressure, and tenant priority.
7️⃣ Streaming Tokens
Streaming improves perceived latency:
Request accepted
↓
First token returned
↓
Tokens streamed incrementally
↓
Finish reason and usage returned
Benefits:
- better UX
- early feedback
- supports interactive apps
Costs:
- long-lived connections
- cancellation handling
- partial output safety and logging
8️⃣ Safety, Billing, and Observability
Production concerns:
- content policy checks
- abuse detection
- prompt and output logging policy
- token accounting
- latency metrics
- GPU utilization
- queue time
- error rates
- per-model cost
9️⃣ Staff-Level Trade-offs
| Decision | Benefit | Cost |
|---|---|---|
| Aggressive batching | Higher throughput | Higher latency |
| Strict quotas | Fairness and cost control | Rejections during spikes |
| Model fallback | Better availability | Quality may differ |
| Streaming | Better perceived latency | More connection complexity |
| Long context support | More capability | Higher memory and cost |
🔟 Failure Handling
Failures:
- GPU worker crash
- model pool overloaded
- long queue time
- client disconnect
- partial generation failure
- quota service timeout
Fallbacks:
- retry safe pre-generation failures
- fail fast after generation starts
- shed low-priority traffic
- route to alternate pool
- return clear retry-after metadata
- account usage only for completed or billable tokens
中文部分
中文速记
一句话
OpenAI LLM API Serving 本质是 API 平台 + 稀缺 GPU 计算调度系统,核心权衡是 latency、throughput、fairness 和 cost。
背诵要点
- admission control 必须 token-aware,不能只按 request count
- model routing 要考虑模型、容量、region、tenant 和 fallback
- GPU scheduler 要平衡 batching 和 latency
- streaming tokens 提升 perceived latency
- usage metering、safety、observability 和 billing 都是生产必需
中文面试回答
我会把 OpenAI 类 LLM API 平台拆成 control plane 和 GPU data plane。 API gateway 负责认证、quota、rate limit、payload validation 和 request admission。 Admission control 不能只看请求数,还要看 prompt tokens、max output tokens、模型类型、租户优先级和当前队列深度,因为不同请求消耗的 GPU 时间差异很大。
请求通过 model router 进入合适的模型池后,inference scheduler 负责把请求调度到 GPU worker。 Scheduler 要处理 prefill、decode、batching、KV cache memory、sequence length、priority queue 和 cancellation。 batch 越大吞吐越高,但用户延迟可能越差,所以这是核心 data-plane trade-off。
Staff 级重点是:LLM serving 不只是 REST API。 它是昂贵计算资源下的公平调度问题。 好的系统要在低延迟、高吞吐、租户公平、成本控制、安全检查和可观测性之间做平衡。
✅ Final Interview Answer
An OpenAI-like LLM API platform has a control plane and a GPU-heavy data plane. The API gateway authenticates requests, enforces quotas, validates payloads, and routes traffic to the right model pool. Admission control must be token-aware because prompt and output lengths drive compute cost. The inference scheduler batches compatible requests onto GPUs while balancing throughput, latency, memory pressure, and tenant fairness.
Responses are often streamed token by token to reduce perceived latency, while usage metering, safety checks, abuse detection, and observability run around the serving path. At staff level, the main trade-off is latency versus throughput under scarce accelerator capacity. A good design protects fairness and reliability while keeping GPU utilization high.
Implement