🎯 LLM Inference Pipeline Explained
1️⃣ Core Framework
When discussing the LLM Inference Pipeline, I frame it as:
- Request intake and validation
- Prompt formatting and tokenization
- Model routing and scheduling
- Prefill phase
- Decode phase
- Sampling and stopping
- Streaming response
- Trade-offs: latency vs throughput vs cost
2️⃣ What Is LLM Inference?
LLM inference is the process of running a trained model to generate output for a user request.
User Input
→ Tokens
→ Model Forward Pass
→ Next Token Prediction
→ Generated Output
Important Point
Inference is not training.
Training updates model weights.
Inference uses fixed model weights to generate responses.
👉 Interview Answer
LLM inference is the process of using a trained model to generate output.
The model receives input tokens, runs forward passes, predicts the next token, and repeats this process until the response is complete.
Unlike training, inference does not update model weights.
3️⃣ High-Level Inference Pipeline
Pipeline
Client Request
→ API Gateway
→ Request Validation
→ Prompt Builder
→ Tokenizer
→ Model Router
→ Inference Scheduler
→ GPU Model Server
→ Token Generation
→ Response Streamer
→ Client
Key Components
Prompt Builder
Formats messages into model input.
Tokenizer
Converts text into token IDs.
Scheduler
Batches and schedules requests.
Model Server
Runs the model on GPUs.
Streamer
Sends output tokens back to client.
👉 Interview Answer
The LLM inference pipeline includes request validation, prompt formatting, tokenization, model routing, scheduling, GPU execution, token generation, and response streaming.
Each step affects latency, cost, throughput, and reliability.
4️⃣ Request Intake and Validation
What Happens First?
The API receives the request.
The system validates:
- API key
- Model name
- Message schema
- Token limits
- Tool definitions
- Output format
- Rate limits
- Safety constraints
Why Validate Early?
Bad requests should fail before using GPU resources.
Example
Input exceeds context window
→ Reject before inference
👉 Interview Answer
The inference pipeline starts with request validation.
The system checks authentication, model access, message schema, token limits, rate limits, and unsupported parameters before sending anything to expensive GPU infrastructure.
5️⃣ Prompt Formatting
Prompt Formatting
User messages must be converted into the model’s expected format.
Input May Include
- System instructions
- Developer instructions
- User messages
- Conversation history
- Tool definitions
- Retrieved context
- Output schema
Example
System: You are a helpful assistant.
User: Explain caching.
Assistant:
Why Important
Different models may require different prompt formats.
👉 Interview Answer
Prompt formatting converts structured messages into the exact input format expected by the model.
This may include system instructions, user messages, conversation history, tool definitions, retrieved context, and output schemas.
6️⃣ Tokenization
What Is Tokenization?
Tokenization converts text into token IDs.
"Hello world"
→ [15496, 995]
The model works on tokens, not raw text.
Tokenization Step
Formatted Prompt
→ Tokenizer
→ Input Token IDs
Why It Matters
Token count affects:
- Context length
- Latency
- Cost
- Memory usage
- Maximum output length
👉 Interview Answer
Tokenization converts formatted text into token IDs that the model can process.
Token count is important because it affects context-window limits, latency, memory usage, and cost.
7️⃣ Model Routing
Why Routing Is Needed
The platform may serve many models.
Routing decides where the request should go.
Routing Factors
- Requested model
- User permission
- Region
- Load
- Model availability
- Latency target
- Cost policy
- Fallback strategy
Example
Small summarization request
→ Route to smaller model
Complex reasoning request
→ Route to larger model
👉 Interview Answer
Model routing chooses the right model server or cluster for the request.
It considers model name, permissions, load, availability, latency target, cost policy, and fallback strategy.
8️⃣ Inference Scheduling
Why Scheduling Matters
GPU inference is expensive.
The scheduler decides:
- Which request runs next
- Which requests can be batched
- How to allocate GPU memory
- How to meet latency targets
- How to enforce fairness
Dynamic Batching
Request A
Request B
Request C
→ Batch
→ GPU
Trade-off
| Batch Size | Throughput | Latency |
|---|---|---|
| Small | Lower | Lower |
| Large | Higher | Higher |
👉 Interview Answer
Inference scheduling decides how requests are batched and executed on GPU resources.
The scheduler balances throughput, latency, fairness, priority, memory usage, and GPU utilization.
9️⃣ Prefill Phase
What Is Prefill?
Prefill processes the input prompt.
The model reads all input tokens and builds internal state.
Input Tokens
→ Model Forward Pass
→ KV Cache Created
Why Prefill Can Be Expensive
Long prompts require processing many tokens.
Examples:
- Long conversation history
- Large retrieved context
- Many tool results
- Long documents
Cost Driver
Longer input
→ More prefill compute
→ Higher latency
👉 Interview Answer
Prefill is the phase where the model processes the input prompt tokens and builds the KV cache.
Long prompts increase prefill latency and memory usage, which is why context management is important.
🔟 KV Cache
What Is KV Cache?
KV cache stores intermediate attention states from previous tokens.
It avoids recomputing the entire prompt every time a new token is generated.
Without KV Cache
Generate token 1
→ Recompute all previous tokens
Generate token 2
→ Recompute all previous tokens again
With KV Cache
Generate token 1
→ Reuse previous attention state
Generate token 2
→ Reuse cached state
Why It Matters
KV cache improves decode efficiency, but uses GPU memory.
👉 Interview Answer
KV cache stores attention key-value states from previous tokens.
It allows the model to generate new tokens without recomputing the entire context.
This improves inference speed but increases GPU memory usage.
1️⃣1️⃣ Decode Phase
What Is Decode?
Decode is the token-by-token generation phase.
Previous Tokens
→ Model predicts next token
→ Append token
→ Repeat
Decode Loop
while not stopped:
run model forward pass
sample next token
append token
stream token
Why Decode Is Slow
Tokens are generated sequentially.
The next token depends on previous tokens.
👉 Interview Answer
Decode is the phase where the model generates output tokens one at a time.
It is often latency-sensitive because generation is sequential: each token depends on previously generated tokens.
1️⃣2️⃣ Sampling
What Is Sampling?
Sampling decides which token to output next.
The model produces probabilities.
The sampler chooses a token.
Common Parameters
- Temperature
- Top-p
- Top-k
- Frequency penalty
- Presence penalty
- Stop sequences
Example
Low temperature
→ More deterministic
High temperature
→ More creative
👉 Interview Answer
Sampling converts model probability distributions into actual output tokens.
Parameters like temperature, top-p, top-k, and penalties control randomness, creativity, and repetition.
1️⃣3️⃣ Stop Conditions
When Does Generation Stop?
Generation stops when:
- Max output tokens reached
- Stop sequence appears
- End-of-text token generated
- Safety rule triggers
- Client disconnects
- Timeout reached
Example
Stop sequence = "</answer>"
→ Stop when model generates it
👉 Interview Answer
The decode loop stops when a stopping condition is reached.
This may be max token limit, stop sequence, end-of-text token, timeout, safety stop, or client cancellation.
1️⃣4️⃣ Streaming Response
Why Streaming Is Used
LLM responses may take seconds.
Streaming improves perceived latency.
Token 1 → Client
Token 2 → Client
Token 3 → Client
Streaming Flow
Generated token
→ Serialize event
→ Send to client
→ Client renders partial output
Benefits
- Better user experience
- Faster first-token visibility
- Long responses feel responsive
- Client can cancel early
👉 Interview Answer
Streaming returns tokens as they are generated.
It reduces perceived latency, improves user experience, and allows clients to display partial responses or cancel early.
1️⃣5️⃣ Latency Metrics
Important Metrics
- Time to first token
- Tokens per second
- Total latency
- Queue time
- Prefill latency
- Decode latency
- GPU utilization
- Error rate
What They Mean
Time to First Token
How long before the first output token appears.
Tokens per Second
How fast the model generates.
Total Latency
End-to-end request duration.
👉 Interview Answer
I would measure inference latency using time to first token, tokens per second, total latency, queue time, prefill latency, and decode latency.
These metrics help identify whether bottlenecks come from scheduling, input length, GPU execution, or output generation.
1️⃣6️⃣ Best Practices
Practical Rules
- Validate requests before GPU execution
- Keep prompts concise
- Use routing for model selection
- Batch requests carefully
- Monitor prefill and decode separately
- Use KV cache efficiently
- Stream long responses
- Enforce token limits
- Add fallback and timeout handling
- Track cost per request
Design Principle
Inference performance is dominated by tokens,
batching,
and GPU memory.
👉 Interview Answer
A production LLM inference pipeline should optimize prompt length, batching, scheduling, KV cache usage, streaming, token limits, and GPU utilization.
The main trade-off is latency, throughput, memory, and cost.
🧠 Staff-Level Answer Final
👉 Interview Answer Full Version
LLM inference is the process of running a trained model to generate output.
Unlike training, inference does not update model weights.
It takes input text, formats it into the model’s expected prompt structure, tokenizes it, runs the model, generates tokens, and returns the output.
A production inference pipeline starts with request intake and validation.
The system checks authentication, model access, message schema, token limits, rate limits, tool definitions, and unsupported parameters before using expensive GPU resources.
Then the prompt builder formats system, developer, user, conversation, tool, and context messages.
The tokenizer converts the formatted prompt into token IDs.
After that, the model router sends the request to the right model cluster based on model name, permissions, region, load, latency target, and cost policy.
The inference scheduler then decides how requests are batched and executed on GPUs.
This is important because GPU inference is expensive, and the scheduler must balance latency, throughput, fairness, priority, memory, and utilization.
The actual model execution has two major phases: prefill and decode.
Prefill processes all input tokens and builds the KV cache.
Long prompts increase prefill latency and memory usage.
Decode generates output tokens one by one.
Decode is often latency-sensitive because token generation is sequential.
KV cache is critical because it stores attention states from previous tokens, allowing the model to avoid recomputing the full context during generation.
Sampling chooses the next token from the model’s probability distribution using parameters like temperature, top-p, top-k, penalties, and stop sequences.
Generation stops when max tokens, stop sequence, end token, timeout, safety rule, or client cancellation occurs.
Streaming sends tokens back as they are generated, which improves perceived latency and user experience.
The key metrics are time to first token, tokens per second, total latency, queue time, prefill latency, decode latency, GPU utilization, and cost per request.
The core engineering trade-off is balancing latency, throughput, memory, and cost.
⭐ Final Insight
LLM Inference Pipeline 不是简单的:
“把 prompt 丢给模型,然后拿 response”
真正的 pipeline 是:
Request Validation
- Prompt Formatting
- Tokenization
- Model Routing
- Scheduling
- Prefill
- KV Cache
- Decode
- Sampling
- Streaming
- Monitoring。
其中最重要的性能因素是:
tokens、batching、GPU memory。
最重要的一句话:
Inference performance is dominated by tokens, batching, and GPU memory.
中文部分
🎯 LLM Inference Pipeline Explained
1️⃣ 核心框架
讨论 LLM Inference Pipeline 时,我通常从这些方面分析:
- Request intake and validation
- Prompt formatting and tokenization
- Model routing and scheduling
- Prefill phase
- Decode phase
- Sampling and stopping
- Streaming response
- 核心权衡:latency vs throughput vs cost
2️⃣ 什么是 LLM Inference?
LLM inference 是使用训练好的 model 为 user request 生成 output 的过程。
User Input
→ Tokens
→ Model Forward Pass
→ Next Token Prediction
→ Generated Output
Important Point
Inference 不是 training。
Training 会更新 model weights。
Inference 使用固定 model weights 生成 responses。
👉 面试回答
LLM inference 是使用 trained model 生成 output 的过程。
Model 接收 input tokens, 执行 forward passes, 预测 next token, 并重复这个过程直到 response 完成。
不同于 training, inference 不会更新 model weights。
3️⃣ High-Level Inference Pipeline
Pipeline
Client Request
→ API Gateway
→ Request Validation
→ Prompt Builder
→ Tokenizer
→ Model Router
→ Inference Scheduler
→ GPU Model Server
→ Token Generation
→ Response Streamer
→ Client
Key Components
Prompt Builder
把 messages 格式化成 model input。
Tokenizer
把 text 转换成 token IDs。
Scheduler
批处理并调度 requests。
Model Server
在 GPUs 上运行 model。
Streamer
把 output tokens 返回给 client。
👉 面试回答
LLM inference pipeline 包括 request validation、 prompt formatting、tokenization、 model routing、scheduling、 GPU execution、token generation 和 response streaming。
每个步骤都会影响 latency、cost、 throughput 和 reliability。
4️⃣ Request Intake and Validation
最开始发生什么?
API 接收 request。
系统验证:
- API key
- Model name
- Message schema
- Token limits
- Tool definitions
- Output format
- Rate limits
- Safety constraints
为什么要早验证?
Bad requests 应该在使用 GPU 前失败。
Example
Input exceeds context window
→ Reject before inference
👉 面试回答
Inference pipeline 从 request validation 开始。
系统会在请求进入昂贵 GPU infrastructure 前, 检查 authentication、model access、 message schema、token limits、 rate limits 和 unsupported parameters。
5️⃣ Prompt Formatting
Prompt Formatting
User messages 必须转换成 model 期望的格式。
Input May Include
- System instructions
- Developer instructions
- User messages
- Conversation history
- Tool definitions
- Retrieved context
- Output schema
Example
System: You are a helpful assistant.
User: Explain caching.
Assistant:
为什么重要?
不同 models 可能需要不同 prompt formats。
👉 面试回答
Prompt formatting 会把 structured messages 转换成 model 需要的 input format。
这可能包括 system instructions、 user messages、conversation history、 tool definitions、retrieved context 和 output schemas。
6️⃣ Tokenization
什么是 Tokenization?
Tokenization 把 text 转换成 token IDs。
"Hello world"
→ [15496, 995]
Model 处理的是 tokens, 不是 raw text。
Tokenization Step
Formatted Prompt
→ Tokenizer
→ Input Token IDs
为什么重要?
Token count 影响:
- Context length
- Latency
- Cost
- Memory usage
- Maximum output length
👉 面试回答
Tokenization 会把 formatted text 转换成 model 可以处理的 token IDs。
Token count 很重要, 因为它影响 context-window limits、 latency、memory usage 和 cost。
7️⃣ Model Routing
为什么需要 Routing?
平台可能同时服务很多 models。
Routing 决定 request 发送到哪里。
Routing Factors
- Requested model
- User permission
- Region
- Load
- Model availability
- Latency target
- Cost policy
- Fallback strategy
Example
Small summarization request
→ Route to smaller model
Complex reasoning request
→ Route to larger model
👉 面试回答
Model routing 选择正确的 model server 或 cluster 来处理 request。
它会考虑 model name、permissions、 load、availability、latency target、 cost policy 和 fallback strategy。
8️⃣ Inference Scheduling
为什么 Scheduling 重要?
GPU inference 很昂贵。
Scheduler 决定:
- 哪个 request 先执行
- 哪些 requests 可以 batch
- 如何分配 GPU memory
- 如何满足 latency targets
- 如何保证 fairness
Dynamic Batching
Request A
Request B
Request C
→ Batch
→ GPU
Trade-off
| Batch Size | Throughput | Latency |
|---|---|---|
| Small | Lower | Lower |
| Large | Higher | Higher |
👉 面试回答
Inference scheduling 决定 requests 如何 batch 并在 GPU resources 上执行。
Scheduler 需要平衡 throughput、latency、 fairness、priority、memory usage 和 GPU utilization。
9️⃣ Prefill Phase
什么是 Prefill?
Prefill 负责处理 input prompt。
Model 读取所有 input tokens, 并构建 internal state。
Input Tokens
→ Model Forward Pass
→ KV Cache Created
为什么 Prefill 可能很贵?
Long prompts 需要处理很多 tokens。
Examples:
- Long conversation history
- Large retrieved context
- Many tool results
- Long documents
Cost Driver
Longer input
→ More prefill compute
→ Higher latency
👉 面试回答
Prefill 是 model 处理 input prompt tokens 并构建 KV cache 的阶段。
Long prompts 会增加 prefill latency 和 memory usage, 所以 context management 很重要。
🔟 KV Cache
什么是 KV Cache?
KV cache 存储 previous tokens 的 intermediate attention states。
它避免每生成一个新 token 都重新计算整个 prompt。
Without KV Cache
Generate token 1
→ Recompute all previous tokens
Generate token 2
→ Recompute all previous tokens again
With KV Cache
Generate token 1
→ Reuse previous attention state
Generate token 2
→ Reuse cached state
为什么重要?
KV cache 提升 decode efficiency, 但会占用 GPU memory。
👉 面试回答
KV cache 存储 previous tokens 的 attention key-value states。
它让 model 在生成新 tokens 时, 不需要重新计算整个 context。
这提升 inference speed, 但会增加 GPU memory usage。
1️⃣1️⃣ Decode Phase
什么是 Decode?
Decode 是 token-by-token generation phase。
Previous Tokens
→ Model predicts next token
→ Append token
→ Repeat
Decode Loop
while not stopped:
run model forward pass
sample next token
append token
stream token
为什么 Decode 慢?
Tokens 是 sequentially generated。
下一个 token 依赖之前的 tokens。
👉 面试回答
Decode 是 model 一个 token 一个 token 生成 output 的阶段。
它通常 latency-sensitive, 因为 generation 是 sequential: 每个 token 依赖之前生成的 tokens。
1️⃣2️⃣ Sampling
什么是 Sampling?
Sampling 决定下一个输出哪个 token。
Model 产生 probabilities。
Sampler 选择 token。
Common Parameters
- Temperature
- Top-p
- Top-k
- Frequency penalty
- Presence penalty
- Stop sequences
Example
Low temperature
→ More deterministic
High temperature
→ More creative
👉 面试回答
Sampling 把 model probability distribution 转换成实际 output tokens。
Temperature、top-p、top-k 和 penalties 等参数 控制 randomness、creativity 和 repetition。
1️⃣3️⃣ Stop Conditions
Generation 什么时候停止?
Generation stops when:
- Max output tokens reached
- Stop sequence appears
- End-of-text token generated
- Safety rule triggers
- Client disconnects
- Timeout reached
Example
Stop sequence = "</answer>"
→ Stop when model generates it
👉 面试回答
Decode loop 会在 stopping condition 达成时停止。
可能是 max token limit、stop sequence、 end-of-text token、timeout、 safety stop 或 client cancellation。
1️⃣4️⃣ Streaming Response
为什么使用 Streaming?
LLM responses 可能需要几秒。
Streaming 改善 perceived latency。
Token 1 → Client
Token 2 → Client
Token 3 → Client
Streaming Flow
Generated token
→ Serialize event
→ Send to client
→ Client renders partial output
Benefits
- Better user experience
- Faster first-token visibility
- Long responses feel responsive
- Client can cancel early
👉 面试回答
Streaming 会在 tokens 生成时返回。
它降低 perceived latency, 改善 user experience, 并允许 clients 显示 partial responses 或提前 cancel。
1️⃣5️⃣ Latency Metrics
Important Metrics
- Time to first token
- Tokens per second
- Total latency
- Queue time
- Prefill latency
- Decode latency
- GPU utilization
- Error rate
What They Mean
Time to First Token
第一个 output token 出现前的时间。
Tokens per Second
Model 生成速度。
Total Latency
End-to-end request duration。
👉 面试回答
我会用 time to first token、 tokens per second、total latency、 queue time、prefill latency 和 decode latency 衡量 inference latency。
这些 metrics 可以帮助判断 bottleneck 来自 scheduling、input length、 GPU execution 还是 output generation。
1️⃣6️⃣ Best Practices
Practical Rules
- Validate requests before GPU execution
- Keep prompts concise
- Use routing for model selection
- Batch requests carefully
- Monitor prefill and decode separately
- Use KV cache efficiently
- Stream long responses
- Enforce token limits
- Add fallback and timeout handling
- Track cost per request
Design Principle
Inference performance is dominated by tokens,
batching,
and GPU memory.
👉 面试回答
Production LLM inference pipeline 应优化 prompt length、batching、 scheduling、KV cache usage、 streaming、token limits 和 GPU utilization。
核心权衡是 latency、throughput、 memory 和 cost。
🧠 Staff-Level Answer Final
👉 面试回答完整版本
LLM inference 是使用 trained model 生成 output 的过程。
不同于 training, inference 不会更新 model weights。
它接收 input text, 将其格式化成 model 需要的 prompt structure, tokenize, 运行 model, 生成 tokens, 并返回 output。
Production inference pipeline 从 request intake 和 validation 开始。
系统在使用 expensive GPU resources 前, 会检查 authentication、model access、 message schema、token limits、 rate limits、tool definitions 和 unsupported parameters。
然后 prompt builder 格式化 system、developer、user、 conversation、tool 和 context messages。
Tokenizer 把 formatted prompt 转换成 token IDs。
之后 model router 根据 model name、permissions、 region、load、latency target 和 cost policy, 把 request 发送到正确 model cluster。
Inference scheduler 决定 requests 如何 batch 并在 GPUs 上执行。
这很重要, 因为 GPU inference 昂贵, scheduler 必须平衡 latency、throughput、 fairness、priority、memory 和 utilization。
真正的 model execution 有两个主要阶段: prefill 和 decode。
Prefill 处理所有 input tokens, 并构建 KV cache。
Long prompts 会增加 prefill latency 和 memory usage。
Decode 一个 token 一个 token生成 output。
Decode 通常 latency-sensitive, 因为 token generation 是 sequential。
KV cache 很关键, 因为它存储 previous tokens 的 attention states, 让 model 在 generation 时 不需要重新计算完整 context。
Sampling 使用 temperature、top-p、top-k、 penalties 和 stop sequences 从 model probability distribution 中选择 next token。
Generation 会在 max tokens、stop sequence、 end token、timeout、safety rule 或 client cancellation 时停止。
Streaming 会在 tokens 生成时返回, 改善 perceived latency 和 user experience。
关键 metrics 包括 time to first token、 tokens per second、total latency、 queue time、prefill latency、 decode latency、GPU utilization 和 cost per request。
核心 engineering trade-off 是: latency、throughput、memory 和 cost。
⭐ Final Insight
LLM Inference Pipeline 不是简单的:
“把 prompt 丢给模型,然后拿 response”
真正的 pipeline 是:
Request Validation
- Prompt Formatting
- Tokenization
- Model Routing
- Scheduling
- Prefill
- KV Cache
- Decode
- Sampling
- Streaming
- Monitoring。
其中最重要的性能因素是:
tokens、batching、GPU memory。
最重要的一句话:
Inference performance is dominated by tokens, batching, and GPU memory.
Implement