aaa-llm LLM Infrastructure ·

🎯 LLM Inference Pipeline Explained

1️⃣ Core Framework

When discussing the LLM Inference Pipeline, I frame it as:

Request intake and validation
Prompt formatting and tokenization
Model routing and scheduling
Prefill phase
Decode phase
Sampling and stopping
Streaming response
Trade-offs: latency vs throughput vs cost

2️⃣ What Is LLM Inference?

LLM inference is the process of running a trained model to generate output for a user request.

User Input
→ Tokens
→ Model Forward Pass
→ Next Token Prediction
→ Generated Output

Important Point

Inference is not training.

Training updates model weights.

Inference uses fixed model weights to generate responses.

👉 Interview Answer

LLM inference is the process of using a trained model to generate output.

The model receives input tokens, runs forward passes, predicts the next token, and repeats this process until the response is complete.

Unlike training, inference does not update model weights.

3️⃣ High-Level Inference Pipeline

Pipeline

Client Request
→ API Gateway
→ Request Validation
→ Prompt Builder
→ Tokenizer
→ Model Router
→ Inference Scheduler
→ GPU Model Server
→ Token Generation
→ Response Streamer
→ Client

Key Components

Prompt Builder

Formats messages into model input.

Tokenizer

Converts text into token IDs.

Scheduler

Batches and schedules requests.

Model Server

Runs the model on GPUs.

Streamer

Sends output tokens back to client.

👉 Interview Answer

The LLM inference pipeline includes request validation, prompt formatting, tokenization, model routing, scheduling, GPU execution, token generation, and response streaming.

Each step affects latency, cost, throughput, and reliability.

4️⃣ Request Intake and Validation

What Happens First?

The API receives the request.

The system validates:

API key
Model name
Message schema
Token limits
Tool definitions
Output format
Rate limits
Safety constraints

Why Validate Early?

Bad requests should fail before using GPU resources.

Example

Input exceeds context window
→ Reject before inference

👉 Interview Answer

The inference pipeline starts with request validation.

The system checks authentication, model access, message schema, token limits, rate limits, and unsupported parameters before sending anything to expensive GPU infrastructure.

5️⃣ Prompt Formatting

Prompt Formatting

User messages must be converted into the model’s expected format.

Input May Include

System instructions
Developer instructions
User messages
Conversation history
Tool definitions
Retrieved context
Output schema

Example

System: You are a helpful assistant.
User: Explain caching.
Assistant:

Why Important

Different models may require different prompt formats.

👉 Interview Answer

Prompt formatting converts structured messages into the exact input format expected by the model.

This may include system instructions, user messages, conversation history, tool definitions, retrieved context, and output schemas.

6️⃣ Tokenization

What Is Tokenization?

Tokenization converts text into token IDs.

"Hello world"
→ [15496, 995]

The model works on tokens, not raw text.

Tokenization Step

Formatted Prompt
→ Tokenizer
→ Input Token IDs

Why It Matters

Token count affects:

Context length
Latency
Cost
Memory usage
Maximum output length

👉 Interview Answer

Tokenization converts formatted text into token IDs that the model can process.

Token count is important because it affects context-window limits, latency, memory usage, and cost.

7️⃣ Model Routing

Why Routing Is Needed

The platform may serve many models.

Routing decides where the request should go.

Routing Factors

Requested model
User permission
Region
Load
Model availability
Latency target
Cost policy
Fallback strategy

Example

Small summarization request
→ Route to smaller model

Complex reasoning request
→ Route to larger model

👉 Interview Answer

Model routing chooses the right model server or cluster for the request.

It considers model name, permissions, load, availability, latency target, cost policy, and fallback strategy.

8️⃣ Inference Scheduling

Why Scheduling Matters

GPU inference is expensive.

The scheduler decides:

Which request runs next
Which requests can be batched
How to allocate GPU memory
How to meet latency targets
How to enforce fairness

Dynamic Batching

Request A
Request B
Request C
→ Batch
→ GPU

Trade-off

Batch Size	Throughput	Latency
Small	Lower	Lower
Large	Higher	Higher

👉 Interview Answer

Inference scheduling decides how requests are batched and executed on GPU resources.

The scheduler balances throughput, latency, fairness, priority, memory usage, and GPU utilization.

9️⃣ Prefill Phase

What Is Prefill?

Prefill processes the input prompt.

The model reads all input tokens and builds internal state.

Input Tokens
→ Model Forward Pass
→ KV Cache Created

Why Prefill Can Be Expensive

Long prompts require processing many tokens.

Examples:

Long conversation history
Large retrieved context
Many tool results
Long documents

Cost Driver

Longer input
→ More prefill compute
→ Higher latency

👉 Interview Answer

Prefill is the phase where the model processes the input prompt tokens and builds the KV cache.

Long prompts increase prefill latency and memory usage, which is why context management is important.

🔟 KV Cache

What Is KV Cache?

KV cache stores intermediate attention states from previous tokens.

It avoids recomputing the entire prompt every time a new token is generated.

Without KV Cache

Generate token 1
→ Recompute all previous tokens

Generate token 2
→ Recompute all previous tokens again

With KV Cache

Generate token 1
→ Reuse previous attention state

Generate token 2
→ Reuse cached state

Why It Matters

KV cache improves decode efficiency, but uses GPU memory.

👉 Interview Answer

KV cache stores attention key-value states from previous tokens.

It allows the model to generate new tokens without recomputing the entire context.

This improves inference speed but increases GPU memory usage.

1️⃣1️⃣ Decode Phase

What Is Decode?

Decode is the token-by-token generation phase.

Previous Tokens
→ Model predicts next token
→ Append token
→ Repeat

Decode Loop

while not stopped:
    run model forward pass
    sample next token
    append token
    stream token

Why Decode Is Slow

Tokens are generated sequentially.

The next token depends on previous tokens.

👉 Interview Answer

Decode is the phase where the model generates output tokens one at a time.

It is often latency-sensitive because generation is sequential: each token depends on previously generated tokens.

1️⃣2️⃣ Sampling

What Is Sampling?

Sampling decides which token to output next.

The model produces probabilities.

The sampler chooses a token.

Common Parameters

Temperature
Top-p
Top-k
Frequency penalty
Presence penalty
Stop sequences

Example

Low temperature
→ More deterministic

High temperature
→ More creative

👉 Interview Answer

Sampling converts model probability distributions into actual output tokens.

Parameters like temperature, top-p, top-k, and penalties control randomness, creativity, and repetition.

1️⃣3️⃣ Stop Conditions

When Does Generation Stop?

Generation stops when:

Max output tokens reached
Stop sequence appears
End-of-text token generated
Safety rule triggers
Client disconnects
Timeout reached

Example

Stop sequence = "</answer>"
→ Stop when model generates it

👉 Interview Answer

The decode loop stops when a stopping condition is reached.

This may be max token limit, stop sequence, end-of-text token, timeout, safety stop, or client cancellation.

1️⃣4️⃣ Streaming Response

Why Streaming Is Used

LLM responses may take seconds.

Streaming improves perceived latency.

Token 1 → Client
Token 2 → Client
Token 3 → Client

Streaming Flow

Generated token
→ Serialize event
→ Send to client
→ Client renders partial output

Benefits

Better user experience
Faster first-token visibility
Long responses feel responsive
Client can cancel early

👉 Interview Answer

Streaming returns tokens as they are generated.

It reduces perceived latency, improves user experience, and allows clients to display partial responses or cancel early.

1️⃣5️⃣ Latency Metrics

Important Metrics

Time to first token
Tokens per second
Total latency
Queue time
Prefill latency
Decode latency
GPU utilization
Error rate

What They Mean

Time to First Token

How long before the first output token appears.

Tokens per Second

How fast the model generates.

Total Latency

End-to-end request duration.

👉 Interview Answer

I would measure inference latency using time to first token, tokens per second, total latency, queue time, prefill latency, and decode latency.

These metrics help identify whether bottlenecks come from scheduling, input length, GPU execution, or output generation.

1️⃣6️⃣ Best Practices

Practical Rules

Validate requests before GPU execution
Keep prompts concise
Use routing for model selection
Batch requests carefully
Monitor prefill and decode separately
Use KV cache efficiently
Stream long responses
Enforce token limits
Add fallback and timeout handling
Track cost per request

Design Principle

Inference performance is dominated by tokens,
batching,
and GPU memory.

👉 Interview Answer

A production LLM inference pipeline should optimize prompt length, batching, scheduling, KV cache usage, streaming, token limits, and GPU utilization.

The main trade-off is latency, throughput, memory, and cost.

🧠 Staff-Level Answer Final

👉 Interview Answer Full Version

LLM inference is the process of running a trained model to generate output.

Unlike training, inference does not update model weights.

It takes input text, formats it into the model’s expected prompt structure, tokenizes it, runs the model, generates tokens, and returns the output.

A production inference pipeline starts with request intake and validation.

The system checks authentication, model access, message schema, token limits, rate limits, tool definitions, and unsupported parameters before using expensive GPU resources.

Then the prompt builder formats system, developer, user, conversation, tool, and context messages.

The tokenizer converts the formatted prompt into token IDs.

After that, the model router sends the request to the right model cluster based on model name, permissions, region, load, latency target, and cost policy.

The inference scheduler then decides how requests are batched and executed on GPUs.

This is important because GPU inference is expensive, and the scheduler must balance latency, throughput, fairness, priority, memory, and utilization.

The actual model execution has two major phases: prefill and decode.

Prefill processes all input tokens and builds the KV cache.

Long prompts increase prefill latency and memory usage.

Decode generates output tokens one by one.

Decode is often latency-sensitive because token generation is sequential.

KV cache is critical because it stores attention states from previous tokens, allowing the model to avoid recomputing the full context during generation.

Sampling chooses the next token from the model’s probability distribution using parameters like temperature, top-p, top-k, penalties, and stop sequences.

Generation stops when max tokens, stop sequence, end token, timeout, safety rule, or client cancellation occurs.

Streaming sends tokens back as they are generated, which improves perceived latency and user experience.

The key metrics are time to first token, tokens per second, total latency, queue time, prefill latency, decode latency, GPU utilization, and cost per request.

The core engineering trade-off is balancing latency, throughput, memory, and cost.

⭐ Final Insight

LLM Inference Pipeline 不是简单的：

“把 prompt 丢给模型，然后拿 response”

真正的 pipeline 是：

Request Validation

Prompt Formatting

Tokenization

Model Routing

Scheduling

Prefill

KV Cache

Decode

Sampling

Streaming

Monitoring。

其中最重要的性能因素是：

tokens、batching、GPU memory。

最重要的一句话：

Inference performance is dominated by tokens, batching, and GPU memory.

中文部分

🎯 LLM Inference Pipeline Explained

1️⃣ 核心框架

讨论 LLM Inference Pipeline 时，我通常从这些方面分析：

Request intake and validation
Prompt formatting and tokenization
Model routing and scheduling
Prefill phase
Decode phase
Sampling and stopping
Streaming response
核心权衡：latency vs throughput vs cost

2️⃣ 什么是 LLM Inference？

LLM inference 是使用训练好的 model 为 user request 生成 output 的过程。

User Input
→ Tokens
→ Model Forward Pass
→ Next Token Prediction
→ Generated Output

Important Point

Inference 不是 training。

Training 会更新 model weights。

Inference 使用固定 model weights 生成 responses。

👉 面试回答

LLM inference 是使用 trained model 生成 output 的过程。

Model 接收 input tokens，执行 forward passes，预测 next token，并重复这个过程直到 response 完成。

不同于 training， inference 不会更新 model weights。

3️⃣ High-Level Inference Pipeline

Pipeline

Client Request
→ API Gateway
→ Request Validation
→ Prompt Builder
→ Tokenizer
→ Model Router
→ Inference Scheduler
→ GPU Model Server
→ Token Generation
→ Response Streamer
→ Client

Key Components

Prompt Builder

把 messages 格式化成 model input。

Tokenizer

把 text 转换成 token IDs。

Scheduler

批处理并调度 requests。

Model Server

在 GPUs 上运行 model。

Streamer

把 output tokens 返回给 client。

👉 面试回答

LLM inference pipeline 包括 request validation、 prompt formatting、tokenization、 model routing、scheduling、 GPU execution、token generation 和 response streaming。

每个步骤都会影响 latency、cost、 throughput 和 reliability。

4️⃣ Request Intake and Validation

最开始发生什么？

API 接收 request。

系统验证：

API key
Model name
Message schema
Token limits
Tool definitions
Output format
Rate limits
Safety constraints

为什么要早验证？

Bad requests 应该在使用 GPU 前失败。

Example

Input exceeds context window
→ Reject before inference

👉 面试回答

Inference pipeline 从 request validation 开始。

系统会在请求进入昂贵 GPU infrastructure 前，检查 authentication、model access、 message schema、token limits、 rate limits 和 unsupported parameters。

5️⃣ Prompt Formatting

Prompt Formatting

User messages 必须转换成 model 期望的格式。

Input May Include

System instructions
Developer instructions
User messages
Conversation history
Tool definitions
Retrieved context
Output schema

Example

System: You are a helpful assistant.
User: Explain caching.
Assistant:

为什么重要？

不同 models 可能需要不同 prompt formats。

👉 面试回答

Prompt formatting 会把 structured messages 转换成 model 需要的 input format。

这可能包括 system instructions、 user messages、conversation history、 tool definitions、retrieved context 和 output schemas。

6️⃣ Tokenization

什么是 Tokenization？

Tokenization 把 text 转换成 token IDs。

"Hello world"
→ [15496, 995]

Model 处理的是 tokens，不是 raw text。

Tokenization Step

Formatted Prompt
→ Tokenizer
→ Input Token IDs

为什么重要？

Token count 影响：

Context length
Latency
Cost
Memory usage
Maximum output length

👉 面试回答

Tokenization 会把 formatted text 转换成 model 可以处理的 token IDs。

Token count 很重要，因为它影响 context-window limits、 latency、memory usage 和 cost。

7️⃣ Model Routing

为什么需要 Routing？

平台可能同时服务很多 models。

Routing 决定 request 发送到哪里。

Routing Factors

Requested model
User permission
Region
Load
Model availability
Latency target
Cost policy
Fallback strategy

Example

Small summarization request
→ Route to smaller model

Complex reasoning request
→ Route to larger model

👉 面试回答

Model routing 选择正确的 model server 或 cluster 来处理 request。

它会考虑 model name、permissions、 load、availability、latency target、 cost policy 和 fallback strategy。

8️⃣ Inference Scheduling

为什么 Scheduling 重要？

GPU inference 很昂贵。

Scheduler 决定：

哪个 request 先执行
哪些 requests 可以 batch
如何分配 GPU memory
如何满足 latency targets
如何保证 fairness

Dynamic Batching

Request A
Request B
Request C
→ Batch
→ GPU

Trade-off

Batch Size	Throughput	Latency
Small	Lower	Lower
Large	Higher	Higher

👉 面试回答

Inference scheduling 决定 requests 如何 batch 并在 GPU resources 上执行。

Scheduler 需要平衡 throughput、latency、 fairness、priority、memory usage 和 GPU utilization。

9️⃣ Prefill Phase

什么是 Prefill？

Prefill 负责处理 input prompt。

Model 读取所有 input tokens，并构建 internal state。

Input Tokens
→ Model Forward Pass
→ KV Cache Created

为什么 Prefill 可能很贵？

Long prompts 需要处理很多 tokens。

Examples:

Long conversation history
Large retrieved context
Many tool results
Long documents

Cost Driver

Longer input
→ More prefill compute
→ Higher latency

👉 面试回答

Prefill 是 model 处理 input prompt tokens 并构建 KV cache 的阶段。

Long prompts 会增加 prefill latency 和 memory usage，所以 context management 很重要。

🔟 KV Cache

什么是 KV Cache？

KV cache 存储 previous tokens 的 intermediate attention states。

它避免每生成一个新 token 都重新计算整个 prompt。

Without KV Cache

Generate token 1
→ Recompute all previous tokens

Generate token 2
→ Recompute all previous tokens again

With KV Cache

Generate token 1
→ Reuse previous attention state

Generate token 2
→ Reuse cached state

为什么重要？

KV cache 提升 decode efficiency，但会占用 GPU memory。

👉 面试回答

KV cache 存储 previous tokens 的 attention key-value states。

它让 model 在生成新 tokens 时，不需要重新计算整个 context。

这提升 inference speed，但会增加 GPU memory usage。

1️⃣1️⃣ Decode Phase

什么是 Decode？

Decode 是 token-by-token generation phase。

Previous Tokens
→ Model predicts next token
→ Append token
→ Repeat

Decode Loop

while not stopped:
    run model forward pass
    sample next token
    append token
    stream token

为什么 Decode 慢？

Tokens 是 sequentially generated。

下一个 token 依赖之前的 tokens。

👉 面试回答

Decode 是 model 一个 token 一个 token 生成 output 的阶段。

它通常 latency-sensitive，因为 generation 是 sequential：每个 token 依赖之前生成的 tokens。

1️⃣2️⃣ Sampling

什么是 Sampling？

Sampling 决定下一个输出哪个 token。

Model 产生 probabilities。

Sampler 选择 token。

Common Parameters

Temperature
Top-p
Top-k
Frequency penalty
Presence penalty
Stop sequences

Example

Low temperature
→ More deterministic

High temperature
→ More creative

👉 面试回答

Sampling 把 model probability distribution 转换成实际 output tokens。

Temperature、top-p、top-k 和 penalties 等参数控制 randomness、creativity 和 repetition。

1️⃣3️⃣ Stop Conditions

Generation 什么时候停止？

Generation stops when:

Max output tokens reached
Stop sequence appears
End-of-text token generated
Safety rule triggers
Client disconnects
Timeout reached

Example

Stop sequence = "</answer>"
→ Stop when model generates it

👉 面试回答

Decode loop 会在 stopping condition 达成时停止。

可能是 max token limit、stop sequence、 end-of-text token、timeout、 safety stop 或 client cancellation。

1️⃣4️⃣ Streaming Response

为什么使用 Streaming？

LLM responses 可能需要几秒。

Streaming 改善 perceived latency。

Token 1 → Client
Token 2 → Client
Token 3 → Client

Streaming Flow

Generated token
→ Serialize event
→ Send to client
→ Client renders partial output

Benefits

Better user experience
Faster first-token visibility
Long responses feel responsive
Client can cancel early

👉 面试回答

Streaming 会在 tokens 生成时返回。

它降低 perceived latency，改善 user experience，并允许 clients 显示 partial responses 或提前 cancel。

1️⃣5️⃣ Latency Metrics

Important Metrics

Time to first token
Tokens per second
Total latency
Queue time
Prefill latency
Decode latency
GPU utilization
Error rate

What They Mean

Time to First Token

第一个 output token 出现前的时间。

Tokens per Second

Model 生成速度。

Total Latency

End-to-end request duration。

👉 面试回答

我会用 time to first token、 tokens per second、total latency、 queue time、prefill latency 和 decode latency 衡量 inference latency。

这些 metrics 可以帮助判断 bottleneck 来自 scheduling、input length、 GPU execution 还是 output generation。

1️⃣6️⃣ Best Practices

Practical Rules

Validate requests before GPU execution
Keep prompts concise
Use routing for model selection
Batch requests carefully
Monitor prefill and decode separately
Use KV cache efficiently
Stream long responses
Enforce token limits
Add fallback and timeout handling
Track cost per request

Design Principle

Inference performance is dominated by tokens,
batching,
and GPU memory.

👉 面试回答

Production LLM inference pipeline 应优化 prompt length、batching、 scheduling、KV cache usage、 streaming、token limits 和 GPU utilization。

核心权衡是 latency、throughput、 memory 和 cost。

🧠 Staff-Level Answer Final

👉 面试回答完整版本

LLM inference 是使用 trained model 生成 output 的过程。

不同于 training， inference 不会更新 model weights。

它接收 input text，将其格式化成 model 需要的 prompt structure， tokenize，运行 model，生成 tokens，并返回 output。

Production inference pipeline 从 request intake 和 validation 开始。

系统在使用 expensive GPU resources 前，会检查 authentication、model access、 message schema、token limits、 rate limits、tool definitions 和 unsupported parameters。

然后 prompt builder 格式化 system、developer、user、 conversation、tool 和 context messages。

Tokenizer 把 formatted prompt 转换成 token IDs。

之后 model router 根据 model name、permissions、 region、load、latency target 和 cost policy，把 request 发送到正确 model cluster。

Inference scheduler 决定 requests 如何 batch 并在 GPUs 上执行。

这很重要，因为 GPU inference 昂贵， scheduler 必须平衡 latency、throughput、 fairness、priority、memory 和 utilization。

真正的 model execution 有两个主要阶段： prefill 和 decode。

Prefill 处理所有 input tokens，并构建 KV cache。

Long prompts 会增加 prefill latency 和 memory usage。

Decode 一个 token 一个 token生成 output。

Decode 通常 latency-sensitive，因为 token generation 是 sequential。

KV cache 很关键，因为它存储 previous tokens 的 attention states，让 model 在 generation 时不需要重新计算完整 context。

Sampling 使用 temperature、top-p、top-k、 penalties 和 stop sequences 从 model probability distribution 中选择 next token。

Generation 会在 max tokens、stop sequence、 end token、timeout、safety rule 或 client cancellation 时停止。

Streaming 会在 tokens 生成时返回，改善 perceived latency 和 user experience。

关键 metrics 包括 time to first token、 tokens per second、total latency、 queue time、prefill latency、 decode latency、GPU utilization 和 cost per request。

核心 engineering trade-off 是： latency、throughput、memory 和 cost。

⭐ Final Insight

LLM Inference Pipeline 不是简单的：

“把 prompt 丢给模型，然后拿 response”

真正的 pipeline 是：

Request Validation

Prompt Formatting

Tokenization

Model Routing

Scheduling

Prefill

KV Cache

Decode

Sampling

Streaming

Monitoring。

其中最重要的性能因素是：

tokens、batching、GPU memory。

最重要的一句话：

Inference performance is dominated by tokens, batching, and GPU memory.