aaa-llm LLM Infrastructure ·

🎯 Load Balancing for GPU-based Systems

1️⃣ Core Framework

When discussing GPU-based load balancing, I frame it as:

Why GPU load balancing is different
Request routing
Model-aware routing
GPU capacity and memory constraints
Queue-based scheduling
Dynamic batching
Failure handling and fallback
Trade-offs: latency vs throughput vs utilization

2️⃣ Why GPU Load Balancing Is Different

Traditional backend load balancing usually balances HTTP requests.

GPU systems must balance:

GPU memory
Model size
Batch size
Token length
Queue time
KV cache usage
Request priority
Inference latency
Throughput

Key Difference

CPU service:
Request cost is often similar.

GPU inference:
Request cost can vary massively by tokens, model, and output length.

👉 Interview Answer

GPU load balancing is harder than normal service load balancing because requests have very different costs.

A short prompt on a small model is cheap, while a long-context request on a large model can consume much more GPU memory and time.

The load balancer must understand model, token count, queue depth, GPU memory, and latency targets.

3️⃣ High-Level Architecture

Architecture

Client
→ API Gateway
→ Request Validator
→ Model Router
→ GPU Load Balancer
→ Inference Queue
→ Scheduler
→ GPU Workers
→ Response Streamer

Core Components

Model Router

Chooses model family or cluster.

GPU Load Balancer

Chooses the best GPU worker or queue.

Scheduler

Batches and schedules inference.

GPU Worker

Runs model inference.

👉 Interview Answer

A GPU inference platform usually has an API gateway, model router, GPU-aware load balancer, inference queue, scheduler, GPU workers, and response streamer.

The load balancer should route requests based on model availability, queue depth, GPU memory, and expected token cost.

4️⃣ Basic Load Balancing Strategies

Round Robin

Request 1 → Worker A
Request 2 → Worker B
Request 3 → Worker C

Simple, but not ideal for GPU inference.

Least Connections

Routes to worker with fewest active requests.

Better, but still incomplete.

Least Queue Time

Routes to worker with shortest queue.

More useful for inference systems.

GPU-aware Routing

Routes based on:

GPU memory
Current batch
Queue depth
Expected token cost
Model loaded
Worker health

👉 Interview Answer

Basic strategies like round robin are usually not enough for GPU systems.

Production inference systems need GPU-aware routing that considers queue depth, model placement, GPU memory, expected token count, and worker health.

5️⃣ Model-aware Routing

Why Model Awareness Matters

Different GPUs may host different models.

Worker A → small model
Worker B → large model
Worker C → embedding model

Routing Flow

Request model = large-model
→ Find cluster serving large-model
→ Choose best worker in that cluster

Why Important

Loading a model into GPU memory is expensive.

You do not want to move models frequently.

👉 Interview Answer

Model-aware routing ensures requests go to workers that already have the required model loaded.

Since model loading is expensive and GPU memory is limited, production systems usually keep models warm on specific GPU pools.

6️⃣ Token-aware Routing

Why Tokens Matter

Request cost depends heavily on:

Input tokens
Output tokens
Context length
Max generation length

Example

Request A:
500 input tokens + 100 output tokens

Request B:
50,000 input tokens + 2,000 output tokens

These should not be treated equally.

Token-aware Routing

Estimate token cost
→ Route large requests to suitable queues
→ Protect latency-sensitive small requests

👉 Interview Answer

Token-aware routing estimates request cost before scheduling.

Long-context or long-generation requests should be routed differently from short interactive requests, otherwise large requests can block small ones and hurt latency.

7️⃣ Queue-based Load Balancing

Why Use Queues?

GPU workers cannot accept unlimited requests.

A queue provides:

Backpressure
Fairness
Retry handling
Priority handling
Batch formation
Overload protection

Queue Flow

Request
→ Model-specific queue
→ Scheduler
→ GPU batch
→ Response

Queue Types

Per-model queue
Per-priority queue
Per-tenant queue
Short-context queue
Long-context queue

👉 Interview Answer

Queue-based load balancing is common in GPU systems.

Requests are placed into model-specific or priority-specific queues, and the scheduler decides how to batch and execute them on GPUs.

This enables backpressure, fairness, and better GPU utilization.

8️⃣ Dynamic Batching

What Is Dynamic Batching?

Dynamic batching combines multiple requests into one GPU batch.

Request A
Request B
Request C
→ One GPU batch

Why It Helps

Improves throughput
Increases GPU utilization
Reduces cost per token

Trade-off

Wait longer for bigger batch
→ Better throughput
→ Higher latency

👉 Interview Answer

Dynamic batching improves GPU utilization by grouping multiple requests into one batch.

It increases throughput and lowers cost, but can increase latency if the scheduler waits too long to form a batch.

9️⃣ Latency-aware Routing

Different Workloads Need Different Latency

Interactive Chat

Needs low time-to-first-token.

Batch Summarization

Can tolerate higher latency.

Embeddings

Usually high-throughput and batch-friendly.

Routing Strategy

Interactive traffic → low-latency queue
Batch jobs → throughput-optimized queue
Embeddings → batch-heavy queue

👉 Interview Answer

GPU systems should separate latency-sensitive traffic from throughput-oriented workloads.

Interactive chat should use low-latency queues, while batch jobs and embeddings can use larger batches for better GPU efficiency.

🔟 Memory-aware Load Balancing

Why GPU Memory Matters

GPU memory is often the limiting factor.

Memory is used by:

Model weights
KV cache
Batch activations
Long context
Parallel requests

Failure Example

Too many long-context requests
→ GPU memory exhausted
→ Worker crashes or rejects requests

Controls

Admission control
Token limits
KV cache budgeting
Max concurrent requests
Memory-aware scheduling

👉 Interview Answer

GPU memory is a key constraint in inference serving.

The load balancer and scheduler should estimate memory usage from model size, input length, output length, batch size, and KV cache usage before admitting requests.

1️⃣1️⃣ Hot Model and Cold Model Problem

Hot Models

Popular models receive heavy traffic.

90% traffic → one model

This can overload one GPU pool.

Cold Models

Rarely used models waste GPU memory if kept loaded.

Solutions

Dedicated hot model pools
Autoscaling hot pools
Lazy loading cold models
Model eviction
Multi-model serving
Request queuing for cold starts

👉 Interview Answer

GPU systems often face hot model and cold model problems.

Popular models need dedicated scalable pools, while rarely used models may be lazily loaded or evicted to save GPU memory.

1️⃣2️⃣ Autoscaling GPU Workers

Why Autoscaling Is Hard

GPU workers are expensive and slow to start.

Scaling Signals

Queue depth
Queue wait time
GPU utilization
Token throughput
Time to first token
Requests per second
Memory pressure

Challenges

GPU availability
Model loading time
Warmup latency
Cost spikes
Over-provisioning

👉 Interview Answer

Autoscaling GPU workers is harder than autoscaling normal services because GPUs are expensive, scarce, and slow to warm up.

Scaling should consider queue time, GPU utilization, token throughput, memory pressure, and model loading time.

1️⃣3️⃣ Fairness and Multi-tenancy

Why Fairness Matters

A few tenants should not monopolize GPU capacity.

Controls

Per-tenant quotas
Weighted fair queues
Priority classes
Token-per-minute limits
Concurrency limits
Budget-based throttling

Example

Tenant A sends huge batch job
→ Without fairness, Tenant B chat latency suffers

👉 Interview Answer

Multi-tenant GPU systems need fairness controls.

Per-tenant quotas, priority queues, token limits, and weighted fair scheduling prevent one tenant or workload from starving others.

1️⃣4️⃣ Failure Handling

Common Failures

GPU worker crash
Out-of-memory error
Model load failure
Queue timeout
Worker overload
Network failure
Region outage

Fallback Strategies

Retry safe requests
Route to another replica
Use smaller fallback model
Return partial response
Shed low-priority traffic
Circuit breaker for unhealthy workers

👉 Interview Answer

GPU load balancing must handle worker failures, out-of-memory errors, overload, queue timeouts, and regional failures.

Common strategies include retries, replica routing, fallback models, circuit breakers, and load shedding.

1️⃣5️⃣ Observability

What to Monitor

Queue depth
Queue wait time
GPU utilization
GPU memory usage
Time to first token
Tokens per second
Batch size
Request latency
OOM errors
Worker health
Model load time
Cost per token

Debugging Questions

Is latency caused by queueing or decoding?
Are GPUs underutilized?
Are batches too small?
Are long requests blocking short ones?
Is a model pool overloaded?

👉 Interview Answer

Observability is essential for GPU load balancing.

I would monitor queue wait time, GPU utilization, memory usage, batch size, time to first token, tokens per second, OOM errors, worker health, and cost per token.

1️⃣6️⃣ Best Practices

Practical Rules

Use model-aware routing
Use token-aware routing
Separate low-latency and batch workloads
Use dynamic batching carefully
Add admission control
Track GPU memory pressure
Use per-tenant fairness controls
Autoscale based on queue time and token throughput
Add fallback models
Monitor queueing and decoding separately

Design Principle

Balance GPU systems by expected compute cost,
not just request count.

👉 Interview Answer

The best GPU load balancing systems are model-aware, token-aware, memory-aware, and latency-aware.

They balance expected compute cost, not just request count.

🧠 Staff-Level Answer Final

👉 Interview Answer Full Version

Load balancing for GPU-based systems is harder than normal HTTP load balancing because requests do not have uniform cost.

In LLM inference, request cost depends on model size, input tokens, output tokens, batch size, KV cache usage, GPU memory, and latency target.

A production architecture usually includes an API gateway, model router, GPU-aware load balancer, inference queues, scheduler, GPU workers, and response streamer.

The first step is model-aware routing.

Requests should go to GPU workers or pools that already have the requested model loaded, because loading models into GPU memory is expensive.

The second step is token-aware routing.

Long-context or long-generation requests should not be treated the same as short interactive requests.

The system should estimate token cost and route requests to appropriate queues.

Queue-based scheduling is important because it provides backpressure, fairness, priority handling, and batch formation.

Dynamic batching improves GPU utilization by grouping requests together, but it introduces a latency-throughput trade-off.

Larger batches improve throughput, but waiting too long to form a batch hurts latency.

GPU memory is another key constraint.

The scheduler must account for model weights, KV cache, context length, batch activations, and concurrent requests.

Without memory-aware admission control, long-context requests can cause out-of-memory errors.

Multi-tenant fairness is also critical.

Per-tenant quotas, priority queues, weighted fair scheduling, token-per-minute limits, and concurrency limits prevent one tenant from consuming all GPU capacity.

Autoscaling GPU workers is difficult because GPUs are expensive, scarce, and slow to warm up.

Scaling should be based on queue wait time, GPU utilization, token throughput, memory pressure, and model loading time.

Finally, observability must separate queueing latency, prefill latency, decode latency, GPU utilization, memory pressure, batch size, OOM errors, and cost per token.

The core principle is: balance GPU systems by expected compute cost, not just request count.

⭐ Final Insight

GPU Load Balancing 不是简单的：

“哪个 worker 请求少，就发给谁”

真正的 GPU load balancing 需要考虑：

Model Placement

Token Count

Queue Time

GPU Memory

KV Cache

Batch Size

Tenant Fairness

Latency Target

Cost。

普通 backend 看 request count。

GPU inference system 看 expected compute cost。

最重要的一句话：

Balance GPU systems by expected compute cost, not just request count.

中文部分

🎯 Load Balancing for GPU-based Systems

1️⃣ 核心框架

讨论 GPU-based load balancing 时，我通常从这些方面分析：

为什么 GPU load balancing 不同
Request routing
Model-aware routing
GPU capacity and memory constraints
Queue-based scheduling
Dynamic batching
Failure handling and fallback
核心权衡：latency vs throughput vs utilization

2️⃣ 为什么 GPU Load Balancing 不一样？

Traditional backend load balancing 通常平衡 HTTP requests。

GPU systems 必须平衡：

GPU memory
Model size
Batch size
Token length
Queue time
KV cache usage
Request priority
Inference latency
Throughput

Key Difference

CPU service:
Request cost is often similar.

GPU inference:
Request cost can vary massively by tokens, model, and output length.

👉 面试回答

GPU load balancing 比普通 service load balancing 更难，因为 requests 的成本差异非常大。

一个 short prompt + small model 很便宜，但 long-context request + large model 可能消耗大量 GPU memory 和时间。

Load balancer 必须理解 model、token count、 queue depth、GPU memory 和 latency targets。

3️⃣ High-Level Architecture

Architecture

Client
→ API Gateway
→ Request Validator
→ Model Router
→ GPU Load Balancer
→ Inference Queue
→ Scheduler
→ GPU Workers
→ Response Streamer

Core Components

Model Router

选择 model family 或 cluster。

GPU Load Balancer

选择最合适的 GPU worker 或 queue。

Scheduler

Batch 并调度 inference。

GPU Worker

运行 model inference。

👉 面试回答

GPU inference platform 通常包含 API gateway、model router、 GPU-aware load balancer、inference queue、 scheduler、GPU workers 和 response streamer。

Load balancer 应根据 model availability、 queue depth、GPU memory 和 expected token cost 进行 routing。

4️⃣ Basic Load Balancing Strategies

Round Robin

Request 1 → Worker A
Request 2 → Worker B
Request 3 → Worker C

简单，但不适合 GPU inference。

Least Connections

路由到 active requests 最少的 worker。

更好一些，但仍然不完整。

Least Queue Time

路由到 queue 最短的 worker。

对 inference systems 更有用。

GPU-aware Routing

根据这些因素 routing：

GPU memory
Current batch
Queue depth
Expected token cost
Model loaded
Worker health

👉 面试回答

Round robin 这类 basic strategies 通常不够用于 GPU systems。

Production inference systems 需要 GPU-aware routing，考虑 queue depth、model placement、 GPU memory、expected token count 和 worker health。

5️⃣ Model-aware Routing

为什么 Model Awareness 重要？

不同 GPUs 可能 host 不同 models。

Worker A → small model
Worker B → large model
Worker C → embedding model

Routing Flow

Request model = large-model
→ Find cluster serving large-model
→ Choose best worker in that cluster

为什么重要？

把 model load 到 GPU memory 非常昂贵。

不应该频繁移动 models。

👉 面试回答

Model-aware routing 确保 requests 被发送到已经 loaded 所需 model 的 workers。

因为 model loading 昂贵，且 GPU memory 有限， production systems 通常会把 models warm 在特定 GPU pools 上。

6️⃣ Token-aware Routing

为什么 Tokens 重要？

Request cost 高度依赖：

Input tokens
Output tokens
Context length
Max generation length

Example

Request A:
500 input tokens + 100 output tokens

Request B:
50,000 input tokens + 2,000 output tokens

这两个 request 不应该被同等对待。

Token-aware Routing

Estimate token cost
→ Route large requests to suitable queues
→ Protect latency-sensitive small requests

👉 面试回答

Token-aware routing 会在 scheduling 前估算 request cost。

Long-context 或 long-generation requests 应该和 short interactive requests 走不同 routing，否则 large requests 会 block small requests，影响 latency。

7️⃣ Queue-based Load Balancing

为什么使用 Queues？

GPU workers 不能接受无限请求。

Queue 提供：

Backpressure
Fairness
Retry handling
Priority handling
Batch formation
Overload protection

Queue Flow

Request
→ Model-specific queue
→ Scheduler
→ GPU batch
→ Response

Queue Types

Per-model queue
Per-priority queue
Per-tenant queue
Short-context queue
Long-context queue

👉 面试回答

Queue-based load balancing 在 GPU systems 中很常见。

Requests 被放入 model-specific 或 priority-specific queues， scheduler 决定如何 batch 并在 GPUs 上执行。

这样可以提供 backpressure、 fairness 和更好的 GPU utilization。

8️⃣ Dynamic Batching

什么是 Dynamic Batching？

Dynamic batching 把多个 requests 组合成一个 GPU batch。

Request A
Request B
Request C
→ One GPU batch

为什么有帮助？

Improves throughput
Increases GPU utilization
Reduces cost per token

Trade-off

Wait longer for bigger batch
→ Better throughput
→ Higher latency

👉 面试回答

Dynamic batching 通过把多个 requests 组合成一个 batch 来提升 GPU utilization。

它增加 throughput 并降低 cost，但如果 scheduler 等太久形成 batch，就会增加 latency。

9️⃣ Latency-aware Routing

不同 Workloads 有不同 Latency 需求

Interactive Chat

需要 low time-to-first-token。

Batch Summarization

可以接受更高 latency。

Embeddings

通常 high-throughput 并且 batch-friendly。

Routing Strategy

Interactive traffic → low-latency queue
Batch jobs → throughput-optimized queue
Embeddings → batch-heavy queue

👉 面试回答

GPU systems 应该区分 latency-sensitive traffic 和 throughput-oriented workloads。

Interactive chat 应该使用 low-latency queues， batch jobs 和 embeddings 可以使用更大 batch 来提高 GPU efficiency。

🔟 Memory-aware Load Balancing

为什么 GPU Memory 重要？

GPU memory 经常是限制因素。

Memory 被这些使用：

Model weights
KV cache
Batch activations
Long context
Parallel requests

Failure Example

Too many long-context requests
→ GPU memory exhausted
→ Worker crashes or rejects requests

Controls

Admission control
Token limits
KV cache budgeting
Max concurrent requests
Memory-aware scheduling

👉 面试回答

GPU memory 是 inference serving 的核心限制。

Load balancer 和 scheduler 应在 admit requests 前，根据 model size、input length、 output length、batch size 和 KV cache usage 估算 memory usage。

1️⃣1️⃣ Hot Model and Cold Model Problem

Hot Models

Popular models 接收大量 traffic。

90% traffic → one model

这会 overload 某一个 GPU pool。

Cold Models

很少使用的 models 如果一直 loaded 会浪费 GPU memory。

Solutions

Dedicated hot model pools
Autoscaling hot pools
Lazy loading cold models
Model eviction
Multi-model serving
Request queuing for cold starts

👉 面试回答

GPU systems 经常遇到 hot model 和 cold model 问题。

Popular models 需要 dedicated scalable pools，很少使用的 models 可以 lazy loaded 或 evicted 来节省 GPU memory。

1️⃣2️⃣ Autoscaling GPU Workers

为什么 Autoscaling 很难？

GPU workers 昂贵且启动慢。

Scaling Signals

Queue depth
Queue wait time
GPU utilization
Token throughput
Time to first token
Requests per second
Memory pressure

Challenges

GPU availability
Model loading time
Warmup latency
Cost spikes
Over-provisioning

👉 面试回答

Autoscaling GPU workers 比 autoscaling 普通 services 更难，因为 GPUs 昂贵、稀缺，且 warm up 慢。

Scaling 应考虑 queue time、 GPU utilization、token throughput、 memory pressure 和 model loading time。

1️⃣3️⃣ Fairness and Multi-tenancy

为什么 Fairness 重要？

少数 tenants 不应该 monopolize GPU capacity。

Controls

Per-tenant quotas
Weighted fair queues
Priority classes
Token-per-minute limits
Concurrency limits
Budget-based throttling

Example

Tenant A sends huge batch job
→ Without fairness, Tenant B chat latency suffers

👉 面试回答

Multi-tenant GPU systems 需要 fairness controls。

Per-tenant quotas、priority queues、 token limits 和 weighted fair scheduling 可以防止某个 tenant 或 workload 饿死其他请求。

1️⃣4️⃣ Failure Handling

Common Failures

GPU worker crash
Out-of-memory error
Model load failure
Queue timeout
Worker overload
Network failure
Region outage

Fallback Strategies

Retry safe requests
Route to another replica
Use smaller fallback model
Return partial response
Shed low-priority traffic
Circuit breaker for unhealthy workers

👉 面试回答

GPU load balancing 必须处理 worker failures、OOM errors、 overload、queue timeouts 和 regional failures。

常见策略包括 retries、replica routing、 fallback models、circuit breakers 和 load shedding。

1️⃣5️⃣ Observability

What to Monitor

Queue depth
Queue wait time
GPU utilization
GPU memory usage
Time to first token
Tokens per second
Batch size
Request latency
OOM errors
Worker health
Model load time
Cost per token

Debugging Questions

Latency 是 queueing 造成，还是 decoding？
GPUs 是否 underutilized？
Batches 是否太小？
Long requests 是否 block short requests？
某个 model pool 是否 overloaded？

👉 面试回答

Observability 对 GPU load balancing 非常重要。

我会监控 queue wait time、 GPU utilization、memory usage、 batch size、time to first token、 tokens per second、OOM errors、 worker health 和 cost per token。

1️⃣6️⃣ Best Practices

Practical Rules

Use model-aware routing
Use token-aware routing
Separate low-latency and batch workloads
Use dynamic batching carefully
Add admission control
Track GPU memory pressure
Use per-tenant fairness controls
Autoscale based on queue time and token throughput
Add fallback models
Monitor queueing and decoding separately

Design Principle

Balance GPU systems by expected compute cost,
not just request count.

👉 面试回答

最好的 GPU load balancing systems 是 model-aware、token-aware、 memory-aware 和 latency-aware 的。

它们根据 expected compute cost 平衡负载，而不是只看 request count。

🧠 Staff-Level Answer Final

👉 面试回答完整版本

GPU-based systems 的 load balancing 比普通 HTTP load balancing 更难，因为 requests 的 cost 并不均匀。

在 LLM inference 中， request cost 取决于 model size、 input tokens、output tokens、 batch size、KV cache usage、 GPU memory 和 latency target。

Production architecture 通常包含 API gateway、 model router、GPU-aware load balancer、 inference queues、scheduler、GPU workers 和 response streamer。

第一步是 model-aware routing。

Requests 应该去已经 loaded requested model 的 GPU workers 或 pools，因为把 model load 到 GPU memory 很昂贵。

第二步是 token-aware routing。

Long-context 或 long-generation requests 不应该和 short interactive requests 被同等对待。

系统应该估算 token cost，并把请求 route 到合适 queues。

Queue-based scheduling 很重要，因为它提供 backpressure、fairness、 priority handling 和 batch formation。

Dynamic batching 通过把 requests 组合在一起来提升 GPU utilization，但它带来 latency-throughput trade-off。

Bigger batches 提升 throughput，但等待太久形成 batch 会损害 latency。

GPU memory 是另一个核心限制。

Scheduler 必须考虑 model weights、 KV cache、context length、 batch activations 和 concurrent requests。

如果没有 memory-aware admission control， long-context requests 可能导致 out-of-memory errors。

Multi-tenant fairness 也非常关键。

Per-tenant quotas、priority queues、 weighted fair scheduling、 token-per-minute limits 和 concurrency limits 可以防止单个 tenant 占用所有 GPU capacity。

Autoscaling GPU workers 很难，因为 GPUs 昂贵、稀缺，且 warm up 慢。

Scaling 应基于 queue wait time、 GPU utilization、token throughput、 memory pressure 和 model loading time。

最后，observability 必须区分 queueing latency、 prefill latency、decode latency、 GPU utilization、memory pressure、 batch size、OOM errors 和 cost per token。

核心原则是： balance GPU systems by expected compute cost， not just request count。

⭐ Final Insight

GPU Load Balancing 不是简单的：

“哪个 worker 请求少，就发给谁”

真正的 GPU load balancing 需要考虑：

Model Placement

Token Count

Queue Time

GPU Memory

KV Cache

Batch Size

Tenant Fairness

Latency Target

Cost。

普通 backend 看 request count。

GPU inference system 看 expected compute cost。

最重要的一句话：

Balance GPU systems by expected compute cost, not just request count.