aaa-llm LLM Infrastructure ·

🎯 How OpenAI-like LLM APIs Are Built

1️⃣ Core Framework

When discussing OpenAI-like LLM API architecture, I frame it as:

API gateway and request validation
Authentication and rate limiting
Prompt / message processing
Model routing and scheduling
Inference serving
Streaming response delivery
Safety, logging, and monitoring
Trade-offs: latency vs throughput vs cost

2️⃣ What Is an LLM API?

An LLM API exposes large language model capabilities through a backend service.

Clients send requests like:

Prompt / messages
→ LLM API
→ Model inference
→ Generated response

Example API Use Cases

Chat completion
Text generation
Summarization
Code generation
Embeddings
Tool calling
Structured output
Multimodal input

👉 Interview Answer

An LLM API is a production service that exposes model inference capabilities to applications.

It accepts user prompts or messages, validates requests, routes them to the right model, runs inference, streams or returns outputs, and enforces safety, rate limits, logging, and monitoring.

3️⃣ High-Level Architecture

Architecture

Client
→ API Gateway
→ Auth / Rate Limit
→ Request Validator
→ Prompt Processor
→ Model Router
→ Inference Scheduler
→ Model Server
→ Response Streamer
→ Safety / Logging
→ Client Response

Core Components

API Gateway

Handles external traffic.

Auth Layer

Verifies API keys and permissions.

Model Router

Chooses the right model or cluster.

Inference Scheduler

Batches and schedules requests.

Model Server

Runs actual model inference on GPUs.

Response Streamer

Streams generated tokens back to clients.

👉 Interview Answer

A typical LLM API includes an API gateway, authentication, rate limiting, request validation, prompt processing, model routing, inference scheduling, GPU model serving, response streaming, safety checks, and observability.

4️⃣ API Gateway

Role of API Gateway

The API gateway is the front door.

It handles:

TLS termination
Request routing
API versioning
Request size limits
Basic validation
Load balancing
Abuse protection

Example

POST /v1/chat/completions
→ API Gateway
→ Chat Completion Service

Why Important

The gateway protects backend inference systems from bad or abusive traffic.

👉 Interview Answer

The API gateway is the entry point for LLM API traffic.

It handles request routing, TLS, API versioning, request limits, load balancing, and basic protection before requests reach expensive inference infrastructure.

5️⃣ Authentication and Authorization

Authentication

Most LLM APIs use API keys or tokens.

Authorization: Bearer API_KEY

Authorization Checks

The system checks:

Is the key valid?
Which organization owns it?
Which models are allowed?
What quota applies?
Is the account active?
Are there policy restrictions?

Why Important

LLM inference is expensive and sensitive.

👉 Interview Answer

Authentication verifies who is calling the API.

Authorization determines which models, features, quotas, and capabilities the caller can access.

This is critical because LLM inference is expensive and may involve sensitive data.

6️⃣ Rate Limiting and Quotas

Why Rate Limiting Is Needed

LLM APIs must protect:

GPU capacity
Cost budgets
Service reliability
Fairness across tenants
Abuse prevention

Common Limits

Requests per minute
Tokens per minute
Concurrent requests
Daily spend
Model-specific quotas
Organization-level quotas

Example

User exceeds tokens per minute
→ Return 429 Too Many Requests

👉 Interview Answer

Rate limiting protects GPU capacity and service reliability.

LLM APIs usually limit requests, tokens, concurrency, and spending at the user, organization, and model level.

7️⃣ Request Validation

What Gets Validated?

Before inference, the system validates:

JSON schema
Model name
Message format
Token limits
Tool definitions
Output format
Safety constraints
Unsupported parameters

Example

Input too long
→ Reject before inference

Why Important

Bad requests should fail before consuming GPU resources.

👉 Interview Answer

Request validation ensures the input is well-formed and allowed before reaching the model.

This prevents invalid requests, unsupported parameters, excessive token usage, and unnecessary GPU cost.

8️⃣ Prompt and Message Processing

Processing Steps

The system may process:

System messages
Developer instructions
User messages
Conversation history
Tool definitions
Retrieved context
Output schema

Tokenization

Before inference, text is tokenized.

Text
→ Tokenizer
→ Token IDs
→ Model

Context Window Check

Input tokens + max output tokens
≤ model context limit

👉 Interview Answer

Prompt processing prepares messages for inference.

It formats system, developer, user, tool, and context messages, tokenizes the input, and checks context-window limits before sending the request to the model.

9️⃣ Model Routing

Why Routing Is Needed

Different requests may use different models.

Routing depends on:

Requested model
Tenant permission
Region
Model availability
Load
Latency target
Cost policy
Fallback strategy

Example

Request model = fast-small
→ Route to small model cluster

Request model = reasoning-large
→ Route to large model cluster

👉 Interview Answer

Model routing decides which model cluster should handle a request.

It considers requested model, permissions, availability, load, latency target, cost policy, and fallback rules.

🔟 Inference Scheduling

Why Scheduling Is Hard

GPU inference is expensive.

The scheduler must optimize:

Throughput
Latency
GPU utilization
Fairness
Priority
Batch size
Memory usage

Dynamic Batching

Multiple requests can be batched together.

Request A
Request B
Request C
→ Batch
→ GPU inference

Trade-off

Larger batch
→ Better throughput
→ Higher latency

Smaller batch
→ Lower latency
→ Lower GPU efficiency

👉 Interview Answer

Inference scheduling is responsible for efficiently using GPU capacity.

It batches requests, manages priorities, controls concurrency, and balances latency against throughput and cost.

1️⃣1️⃣ Model Serving

Model Server Role

The model server runs inference.

It handles:

Loading model weights
GPU memory management
Token generation
KV cache management
Batch execution
Sampling
Streaming tokens

Token Generation

Input tokens
→ Model forward pass
→ Next token
→ Repeat until stop condition

Stop Conditions

Max tokens reached
Stop sequence matched
End-of-text token
Safety stop
Client disconnect

👉 Interview Answer

The model server is responsible for running the neural network inference.

It loads model weights, manages GPU memory, handles token generation, maintains KV cache, applies sampling, and produces output tokens.

1️⃣2️⃣ Streaming Responses

Why Streaming Matters

LLM generation can take time.

Streaming improves user experience.

Token 1 → Client
Token 2 → Client
Token 3 → Client

Streaming Flow

Model generates token
→ Server sends token event
→ Client renders partial response

Benefits

Lower perceived latency
Better UX
Useful for long responses
Allows early cancellation

👉 Interview Answer

Streaming sends generated tokens to the client as they are produced.

This reduces perceived latency, improves user experience, and allows clients to display partial responses before generation completes.

1️⃣3️⃣ Safety and Policy Layer

Why Safety Is Needed

LLM APIs may generate unsafe or policy-violating content.

The system may check:

Input safety
Output safety
Tool-use safety
PII leakage
Abuse patterns
Prompt injection risks

Safety Flow

Request
→ Input safety checks
→ Model inference
→ Output safety checks
→ Response

Important Point

Safety can happen before, during, or after inference.

👉 Interview Answer

LLM APIs need safety and policy layers to detect unsafe input, unsafe output, abuse, data leakage, and risky tool use.

Safety checks can happen before inference, during generation, or after the output is produced.

1️⃣4️⃣ Observability

What to Log

Request ID
Organization ID
Model name
Prompt token count
Output token count
Latency
Queue time
GPU time
Error type
Rate limit events
Safety flags
Cost estimate

Why Important

Observability helps debug:

Slow requests
Failed requests
Model regressions
Cost spikes
Abuse patterns
Capacity issues

👉 Interview Answer

Observability is critical for LLM APIs.

I would monitor request volume, token usage, latency, queue time, GPU utilization, errors, safety flags, rate limits, and cost per model.

This makes the platform debuggable and scalable.

1️⃣5️⃣ Reliability and Fallbacks

Common Failures

LLM APIs can fail because of:

GPU overload
Model server crash
Queue timeout
Rate limit spike
Bad request
Network failure
Safety block
Region outage

Fallback Strategies

Retry safe failures
Route to another replica
Route to fallback model
Return partial response
Graceful error message
Circuit breaker
Load shedding

👉 Interview Answer

LLM APIs need strong reliability controls.

The system should handle overload, model server failures, timeouts, and regional issues with retries, fallback routing, circuit breakers, and load shedding.

1️⃣6️⃣ Cost Control

Main Cost Drivers

Model size
Input tokens
Output tokens
Batch size
GPU time
KV cache memory
Streaming duration
Retry count

Cost Controls

Token limits
Rate limits
Model routing
Caching
Prompt compression
Smaller model fallback
Quotas and budgets

👉 Interview Answer

LLM API cost is mainly driven by model size, token count, GPU time, retries, and concurrency.

Production systems control cost through rate limits, token limits, routing policies, caching, quotas, and smaller-model fallbacks.

🧠 Staff-Level Answer Final

👉 Interview Answer Full Version

An OpenAI-like LLM API is a production inference platform that exposes model capabilities through a scalable API.

The system starts with an API gateway that handles routing, TLS, request limits, versioning, and basic protection.

Then authentication and authorization verify the API key, organization, model permissions, quotas, and feature access.

Before inference, the request validator checks JSON format, model name, message structure, tool definitions, token limits, and unsupported parameters.

Prompt processing then formats messages, applies system and developer instructions, tokenizes input, and checks context-window limits.

The model router chooses the right model cluster based on the requested model, permissions, load, availability, latency target, region, and cost policy.

The inference scheduler is critical because GPU inference is expensive.

It batches requests, manages priorities, controls concurrency, and balances latency, throughput, fairness, and GPU utilization.

The model server loads model weights, manages GPU memory, maintains KV cache, performs token generation, applies sampling, and streams tokens back to the response layer.

Streaming improves user experience by returning tokens as they are generated.

Around the inference path, the platform needs safety checks, abuse detection, logging, monitoring, rate limits, quota enforcement, billing, and cost controls.

Reliability requires retries, fallback routing, circuit breakers, load shedding, and regional failover.

The main engineering trade-off is balancing latency, throughput, quality, cost, and safety.

A good LLM API is not just model inference.

It is a full distributed system around expensive GPU serving.

⭐ Final Insight

OpenAI-like LLM API 的核心不是：

“HTTP API 调一个模型”

而是：

API Gateway

Auth

Rate Limit

Request Validation

Prompt Processing

Model Routing

Inference Scheduling

GPU Model Serving

Streaming

Safety

Observability

Billing

Cost Control。

真正难的是：

如何在高并发下平衡：

latency、throughput、quality、cost、safety。

最重要的一句话：

LLM API is not just model inference.

It is GPU-scale distributed systems engineering.

中文部分

🎯 How OpenAI-like LLM APIs Are Built

1️⃣ 核心框架

讨论 OpenAI-like LLM API architecture 时，我通常从这些方面分析：

API gateway and request validation
Authentication and rate limiting
Prompt / message processing
Model routing and scheduling
Inference serving
Streaming response delivery
Safety, logging, and monitoring
核心权衡：latency vs throughput vs cost

2️⃣ 什么是 LLM API？

LLM API 是把 large language model capabilities 通过 backend service 暴露给 applications。

Clients 发送请求：

Prompt / messages
→ LLM API
→ Model inference
→ Generated response

Example API Use Cases

Chat completion
Text generation
Summarization
Code generation
Embeddings
Tool calling
Structured output
Multimodal input

👉 面试回答

LLM API 是一个 production service，用来向 applications 暴露 model inference capabilities。

它接收 user prompts 或 messages， validate requests， route 到正确 model，执行 inference， stream 或 return outputs，并执行 safety、rate limits、 logging 和 monitoring。

3️⃣ High-Level Architecture

Architecture

Client
→ API Gateway
→ Auth / Rate Limit
→ Request Validator
→ Prompt Processor
→ Model Router
→ Inference Scheduler
→ Model Server
→ Response Streamer
→ Safety / Logging
→ Client Response

Core Components

API Gateway

处理 external traffic。

Auth Layer

验证 API keys 和 permissions。

Model Router

选择正确 model 或 cluster。

Inference Scheduler

批处理和调度 requests。

Model Server

在 GPUs 上运行真正 model inference。

Response Streamer

把 generated tokens stream 回 clients。

👉 面试回答

典型 LLM API 包含 API gateway、 authentication、rate limiting、 request validation、prompt processing、 model routing、inference scheduling、 GPU model serving、response streaming、 safety checks 和 observability。

4️⃣ API Gateway

API Gateway 的作用

API gateway 是入口。

它处理：

TLS termination
Request routing
API versioning
Request size limits
Basic validation
Load balancing
Abuse protection

Example

POST /v1/chat/completions
→ API Gateway
→ Chat Completion Service

为什么重要？

Gateway 保护 backend inference systems，避免 bad 或 abusive traffic 直接打到昂贵资源。

👉 面试回答

API gateway 是 LLM API traffic 的入口。

它处理 request routing、TLS、 API versioning、request limits、 load balancing 和 basic protection，避免请求直接打到 expensive inference infrastructure。

5️⃣ Authentication and Authorization

Authentication

大多数 LLM APIs 使用 API keys 或 tokens。

Authorization: Bearer API_KEY

Authorization Checks

系统检查：

Key 是否 valid？
属于哪个 organization？
允许哪些 models？
适用什么 quota？
Account 是否 active？
是否有 policy restrictions？

为什么重要？

LLM inference 昂贵且可能涉及 sensitive data。

👉 面试回答

Authentication 验证谁在调用 API。

Authorization 决定 caller 可以访问哪些 models、 features、quotas 和 capabilities。

这很重要，因为 LLM inference 昂贵，且可能涉及 sensitive data。

6️⃣ Rate Limiting and Quotas

为什么需要 Rate Limiting？

LLM APIs 必须保护：

GPU capacity
Cost budgets
Service reliability
Fairness across tenants
Abuse prevention

Common Limits

Requests per minute
Tokens per minute
Concurrent requests
Daily spend
Model-specific quotas
Organization-level quotas

Example

User exceeds tokens per minute
→ Return 429 Too Many Requests

👉 面试回答

Rate limiting 保护 GPU capacity 和 service reliability。

LLM APIs 通常会在 user、organization 和 model level 限制 requests、tokens、 concurrency 和 spending。

7️⃣ Request Validation

验证什么？

Inference 前，系统会验证：

JSON schema
Model name
Message format
Token limits
Tool definitions
Output format
Safety constraints
Unsupported parameters

Example

Input too long
→ Reject before inference

为什么重要？

Bad requests 应该在消耗 GPU 前失败。

👉 面试回答

Request validation 确保 input 在到达 model 前是 well-formed 且 allowed 的。

这能避免 invalid requests、 unsupported parameters、 excessive token usage 和不必要的 GPU cost。

8️⃣ Prompt and Message Processing

Processing Steps

系统可能处理：

System messages
Developer instructions
User messages
Conversation history
Tool definitions
Retrieved context
Output schema

Tokenization

Inference 前， text 会被 tokenized。

Text
→ Tokenizer
→ Token IDs
→ Model

Context Window Check

Input tokens + max output tokens
≤ model context limit

👉 面试回答

Prompt processing 会为 inference 准备 messages。

它格式化 system、developer、user、 tool 和 context messages， tokenize input，并检查 context-window limits，然后再发送给 model。

9️⃣ Model Routing

为什么需要 Routing？

不同 requests 可能使用不同 models。

Routing 取决于：

Requested model
Tenant permission
Region
Model availability
Load
Latency target
Cost policy
Fallback strategy

Example

Request model = fast-small
→ Route to small model cluster

Request model = reasoning-large
→ Route to large model cluster

👉 面试回答

Model routing 决定哪个 model cluster 处理 request。

它考虑 requested model、permissions、 availability、load、latency target、 cost policy 和 fallback rules。

🔟 Inference Scheduling

为什么 Scheduling 很难？

GPU inference 很昂贵。

Scheduler 必须优化：

Throughput
Latency
GPU utilization
Fairness
Priority
Batch size
Memory usage

Dynamic Batching

多个 requests 可以 batch 在一起。

Request A
Request B
Request C
→ Batch
→ GPU inference

Trade-off

Larger batch
→ Better throughput
→ Higher latency

Smaller batch
→ Lower latency
→ Lower GPU efficiency

👉 面试回答

Inference scheduling 负责高效使用 GPU capacity。

它 batch requests、管理 priorities、控制 concurrency，并在 latency、throughput 和 cost 之间做平衡。

1️⃣1️⃣ Model Serving

Model Server Role

Model server 运行 inference。

它处理：

Loading model weights
GPU memory management
Token generation
KV cache management
Batch execution
Sampling
Streaming tokens

Token Generation

Input tokens
→ Model forward pass
→ Next token
→ Repeat until stop condition

Stop Conditions

Max tokens reached
Stop sequence matched
End-of-text token
Safety stop
Client disconnect

👉 面试回答

Model server 负责运行 neural network inference。

它加载 model weights，管理 GPU memory，处理 token generation，维护 KV cache，应用 sampling，并生成 output tokens。

1️⃣2️⃣ Streaming Responses

为什么 Streaming 重要？

LLM generation 可能需要时间。

Streaming 改善用户体验。

Token 1 → Client
Token 2 → Client
Token 3 → Client

Streaming Flow

Model generates token
→ Server sends token event
→ Client renders partial response

Benefits

Lower perceived latency
Better UX
Useful for long responses
Allows early cancellation

👉 面试回答

Streaming 会在 generated tokens 产生时立即发送给 client。

这降低 perceived latency，改善 user experience，并允许 client 在 generation 完成前展示 partial response。

1️⃣3️⃣ Safety and Policy Layer

为什么需要 Safety？

LLM APIs 可能生成 unsafe 或 policy-violating content。

系统可能检查：

Input safety
Output safety
Tool-use safety
PII leakage
Abuse patterns
Prompt injection risks

Safety Flow

Request
→ Input safety checks
→ Model inference
→ Output safety checks
→ Response

Important Point

Safety 可以发生在 inference 前、过程中或之后。

👉 面试回答

LLM APIs 需要 safety 和 policy layers，用来检测 unsafe input、unsafe output、 abuse、data leakage 和 risky tool use。

Safety checks 可以发生在 inference 前、 generation 中或 output 产生后。

1️⃣4️⃣ Observability

What to Log

Request ID
Organization ID
Model name
Prompt token count
Output token count
Latency
Queue time
GPU time
Error type
Rate limit events
Safety flags
Cost estimate

为什么重要？

Observability 帮助 debug：

Slow requests
Failed requests
Model regressions
Cost spikes
Abuse patterns
Capacity issues

👉 面试回答

Observability 对 LLM APIs 非常关键。

我会监控 request volume、token usage、 latency、queue time、GPU utilization、 errors、safety flags、rate limits 和 cost per model。

这样平台才可以 debug 和 scale。

1️⃣5️⃣ Reliability and Fallbacks

Common Failures

LLM APIs 可能因为这些原因失败：

GPU overload
Model server crash
Queue timeout
Rate limit spike
Bad request
Network failure
Safety block
Region outage

Fallback Strategies

Retry safe failures
Route to another replica
Route to fallback model
Return partial response
Graceful error message
Circuit breaker
Load shedding

👉 面试回答

LLM APIs 需要强 reliability controls。

系统应该用 retries、fallback routing、 circuit breakers 和 load shedding 处理 overload、model server failures、 timeouts 和 regional issues。

1️⃣6️⃣ Cost Control

Main Cost Drivers

Model size
Input tokens
Output tokens
Batch size
GPU time
KV cache memory
Streaming duration
Retry count

Cost Controls

Token limits
Rate limits
Model routing
Caching
Prompt compression
Smaller model fallback
Quotas and budgets

👉 面试回答

LLM API cost 主要由 model size、 token count、GPU time、retries 和 concurrency 驱动。

Production systems 通过 rate limits、 token limits、routing policies、caching、 quotas 和 smaller-model fallbacks 控制 cost。

🧠 Staff-Level Answer Final

👉 面试回答完整版本

OpenAI-like LLM API 是一个通过 scalable API 暴露 model capabilities 的 production inference platform。

系统从 API gateway 开始，处理 routing、TLS、request limits、 versioning 和 basic protection。

接着 authentication 和 authorization 会验证 API key、organization、 model permissions、quotas 和 feature access。

在 inference 前， request validator 会检查 JSON format、 model name、message structure、 tool definitions、token limits 和 unsupported parameters。

Prompt processing 会格式化 messages，应用 system 和 developer instructions， tokenize input，并检查 context-window limits。

Model router 根据 requested model、 permissions、load、availability、 latency target、region 和 cost policy 选择正确 model cluster。

Inference scheduler 非常关键，因为 GPU inference 昂贵。

它会 batch requests、管理 priorities、控制 concurrency，并平衡 latency、throughput、 fairness 和 GPU utilization。

Model server 加载 model weights，管理 GPU memory，维护 KV cache，执行 token generation，应用 sampling，并把 tokens stream 回 response layer。

Streaming 通过在 tokens 生成时返回，改善 user experience。

在 inference path 周围，平台需要 safety checks、abuse detection、 logging、monitoring、rate limits、 quota enforcement、billing 和 cost controls。

Reliability 需要 retries、fallback routing、 circuit breakers、load shedding 和 regional failover。

主要工程权衡是： latency、throughput、quality、cost 和 safety。

一个好的 LLM API 不只是 model inference。

它是围绕 expensive GPU serving 构建的完整 distributed system。

⭐ Final Insight

OpenAI-like LLM API 的核心不是：

“HTTP API 调一个模型”

而是：

API Gateway

Auth

Rate Limit

Request Validation

Prompt Processing

Model Routing

Inference Scheduling

GPU Model Serving

Streaming

Safety

Observability

Billing

Cost Control。

真正难的是：

如何在高并发下平衡：

latency、throughput、quality、cost、safety。

最重要的一句话：

LLM API is not just model inference.

It is GPU-scale distributed systems engineering.