🎯 Load Balancing for GPU-based Systems
1️⃣ Core Framework
When discussing GPU-based load balancing, I frame it as:
- Why GPU load balancing is different
- Request routing
- Model-aware routing
- GPU capacity and memory constraints
- Queue-based scheduling
- Dynamic batching
- Failure handling and fallback
- Trade-offs: latency vs throughput vs utilization
2️⃣ Why GPU Load Balancing Is Different
Traditional backend load balancing usually balances HTTP requests.
GPU systems must balance:
- GPU memory
- Model size
- Batch size
- Token length
- Queue time
- KV cache usage
- Request priority
- Inference latency
- Throughput
Key Difference
CPU service:
Request cost is often similar.
GPU inference:
Request cost can vary massively by tokens, model, and output length.
👉 Interview Answer
GPU load balancing is harder than normal service load balancing because requests have very different costs.
A short prompt on a small model is cheap, while a long-context request on a large model can consume much more GPU memory and time.
The load balancer must understand model, token count, queue depth, GPU memory, and latency targets.
3️⃣ High-Level Architecture
Architecture
Client
→ API Gateway
→ Request Validator
→ Model Router
→ GPU Load Balancer
→ Inference Queue
→ Scheduler
→ GPU Workers
→ Response Streamer
Core Components
Model Router
Chooses model family or cluster.
GPU Load Balancer
Chooses the best GPU worker or queue.
Scheduler
Batches and schedules inference.
GPU Worker
Runs model inference.
👉 Interview Answer
A GPU inference platform usually has an API gateway, model router, GPU-aware load balancer, inference queue, scheduler, GPU workers, and response streamer.
The load balancer should route requests based on model availability, queue depth, GPU memory, and expected token cost.
4️⃣ Basic Load Balancing Strategies
Round Robin
Request 1 → Worker A
Request 2 → Worker B
Request 3 → Worker C
Simple, but not ideal for GPU inference.
Least Connections
Routes to worker with fewest active requests.
Better, but still incomplete.
Least Queue Time
Routes to worker with shortest queue.
More useful for inference systems.
GPU-aware Routing
Routes based on:
- GPU memory
- Current batch
- Queue depth
- Expected token cost
- Model loaded
- Worker health
👉 Interview Answer
Basic strategies like round robin are usually not enough for GPU systems.
Production inference systems need GPU-aware routing that considers queue depth, model placement, GPU memory, expected token count, and worker health.
5️⃣ Model-aware Routing
Why Model Awareness Matters
Different GPUs may host different models.
Worker A → small model
Worker B → large model
Worker C → embedding model
Routing Flow
Request model = large-model
→ Find cluster serving large-model
→ Choose best worker in that cluster
Why Important
Loading a model into GPU memory is expensive.
You do not want to move models frequently.
👉 Interview Answer
Model-aware routing ensures requests go to workers that already have the required model loaded.
Since model loading is expensive and GPU memory is limited, production systems usually keep models warm on specific GPU pools.
6️⃣ Token-aware Routing
Why Tokens Matter
Request cost depends heavily on:
- Input tokens
- Output tokens
- Context length
- Max generation length
Example
Request A:
500 input tokens + 100 output tokens
Request B:
50,000 input tokens + 2,000 output tokens
These should not be treated equally.
Token-aware Routing
Estimate token cost
→ Route large requests to suitable queues
→ Protect latency-sensitive small requests
👉 Interview Answer
Token-aware routing estimates request cost before scheduling.
Long-context or long-generation requests should be routed differently from short interactive requests, otherwise large requests can block small ones and hurt latency.
7️⃣ Queue-based Load Balancing
Why Use Queues?
GPU workers cannot accept unlimited requests.
A queue provides:
- Backpressure
- Fairness
- Retry handling
- Priority handling
- Batch formation
- Overload protection
Queue Flow
Request
→ Model-specific queue
→ Scheduler
→ GPU batch
→ Response
Queue Types
- Per-model queue
- Per-priority queue
- Per-tenant queue
- Short-context queue
- Long-context queue
👉 Interview Answer
Queue-based load balancing is common in GPU systems.
Requests are placed into model-specific or priority-specific queues, and the scheduler decides how to batch and execute them on GPUs.
This enables backpressure, fairness, and better GPU utilization.
8️⃣ Dynamic Batching
What Is Dynamic Batching?
Dynamic batching combines multiple requests into one GPU batch.
Request A
Request B
Request C
→ One GPU batch
Why It Helps
- Improves throughput
- Increases GPU utilization
- Reduces cost per token
Trade-off
Wait longer for bigger batch
→ Better throughput
→ Higher latency
👉 Interview Answer
Dynamic batching improves GPU utilization by grouping multiple requests into one batch.
It increases throughput and lowers cost, but can increase latency if the scheduler waits too long to form a batch.
9️⃣ Latency-aware Routing
Different Workloads Need Different Latency
Interactive Chat
Needs low time-to-first-token.
Batch Summarization
Can tolerate higher latency.
Embeddings
Usually high-throughput and batch-friendly.
Routing Strategy
Interactive traffic → low-latency queue
Batch jobs → throughput-optimized queue
Embeddings → batch-heavy queue
👉 Interview Answer
GPU systems should separate latency-sensitive traffic from throughput-oriented workloads.
Interactive chat should use low-latency queues, while batch jobs and embeddings can use larger batches for better GPU efficiency.
🔟 Memory-aware Load Balancing
Why GPU Memory Matters
GPU memory is often the limiting factor.
Memory is used by:
- Model weights
- KV cache
- Batch activations
- Long context
- Parallel requests
Failure Example
Too many long-context requests
→ GPU memory exhausted
→ Worker crashes or rejects requests
Controls
- Admission control
- Token limits
- KV cache budgeting
- Max concurrent requests
- Memory-aware scheduling
👉 Interview Answer
GPU memory is a key constraint in inference serving.
The load balancer and scheduler should estimate memory usage from model size, input length, output length, batch size, and KV cache usage before admitting requests.
1️⃣1️⃣ Hot Model and Cold Model Problem
Hot Models
Popular models receive heavy traffic.
90% traffic → one model
This can overload one GPU pool.
Cold Models
Rarely used models waste GPU memory if kept loaded.
Solutions
- Dedicated hot model pools
- Autoscaling hot pools
- Lazy loading cold models
- Model eviction
- Multi-model serving
- Request queuing for cold starts
👉 Interview Answer
GPU systems often face hot model and cold model problems.
Popular models need dedicated scalable pools, while rarely used models may be lazily loaded or evicted to save GPU memory.
1️⃣2️⃣ Autoscaling GPU Workers
Why Autoscaling Is Hard
GPU workers are expensive and slow to start.
Scaling Signals
- Queue depth
- Queue wait time
- GPU utilization
- Token throughput
- Time to first token
- Requests per second
- Memory pressure
Challenges
- GPU availability
- Model loading time
- Warmup latency
- Cost spikes
- Over-provisioning
👉 Interview Answer
Autoscaling GPU workers is harder than autoscaling normal services because GPUs are expensive, scarce, and slow to warm up.
Scaling should consider queue time, GPU utilization, token throughput, memory pressure, and model loading time.
1️⃣3️⃣ Fairness and Multi-tenancy
Why Fairness Matters
A few tenants should not monopolize GPU capacity.
Controls
- Per-tenant quotas
- Weighted fair queues
- Priority classes
- Token-per-minute limits
- Concurrency limits
- Budget-based throttling
Example
Tenant A sends huge batch job
→ Without fairness, Tenant B chat latency suffers
👉 Interview Answer
Multi-tenant GPU systems need fairness controls.
Per-tenant quotas, priority queues, token limits, and weighted fair scheduling prevent one tenant or workload from starving others.
1️⃣4️⃣ Failure Handling
Common Failures
- GPU worker crash
- Out-of-memory error
- Model load failure
- Queue timeout
- Worker overload
- Network failure
- Region outage
Fallback Strategies
- Retry safe requests
- Route to another replica
- Use smaller fallback model
- Return partial response
- Shed low-priority traffic
- Circuit breaker for unhealthy workers
👉 Interview Answer
GPU load balancing must handle worker failures, out-of-memory errors, overload, queue timeouts, and regional failures.
Common strategies include retries, replica routing, fallback models, circuit breakers, and load shedding.
1️⃣5️⃣ Observability
What to Monitor
- Queue depth
- Queue wait time
- GPU utilization
- GPU memory usage
- Time to first token
- Tokens per second
- Batch size
- Request latency
- OOM errors
- Worker health
- Model load time
- Cost per token
Debugging Questions
- Is latency caused by queueing or decoding?
- Are GPUs underutilized?
- Are batches too small?
- Are long requests blocking short ones?
- Is a model pool overloaded?
👉 Interview Answer
Observability is essential for GPU load balancing.
I would monitor queue wait time, GPU utilization, memory usage, batch size, time to first token, tokens per second, OOM errors, worker health, and cost per token.
1️⃣6️⃣ Best Practices
Practical Rules
- Use model-aware routing
- Use token-aware routing
- Separate low-latency and batch workloads
- Use dynamic batching carefully
- Add admission control
- Track GPU memory pressure
- Use per-tenant fairness controls
- Autoscale based on queue time and token throughput
- Add fallback models
- Monitor queueing and decoding separately
Design Principle
Balance GPU systems by expected compute cost,
not just request count.
👉 Interview Answer
The best GPU load balancing systems are model-aware, token-aware, memory-aware, and latency-aware.
They balance expected compute cost, not just request count.
🧠 Staff-Level Answer Final
👉 Interview Answer Full Version
Load balancing for GPU-based systems is harder than normal HTTP load balancing because requests do not have uniform cost.
In LLM inference, request cost depends on model size, input tokens, output tokens, batch size, KV cache usage, GPU memory, and latency target.
A production architecture usually includes an API gateway, model router, GPU-aware load balancer, inference queues, scheduler, GPU workers, and response streamer.
The first step is model-aware routing.
Requests should go to GPU workers or pools that already have the requested model loaded, because loading models into GPU memory is expensive.
The second step is token-aware routing.
Long-context or long-generation requests should not be treated the same as short interactive requests.
The system should estimate token cost and route requests to appropriate queues.
Queue-based scheduling is important because it provides backpressure, fairness, priority handling, and batch formation.
Dynamic batching improves GPU utilization by grouping requests together, but it introduces a latency-throughput trade-off.
Larger batches improve throughput, but waiting too long to form a batch hurts latency.
GPU memory is another key constraint.
The scheduler must account for model weights, KV cache, context length, batch activations, and concurrent requests.
Without memory-aware admission control, long-context requests can cause out-of-memory errors.
Multi-tenant fairness is also critical.
Per-tenant quotas, priority queues, weighted fair scheduling, token-per-minute limits, and concurrency limits prevent one tenant from consuming all GPU capacity.
Autoscaling GPU workers is difficult because GPUs are expensive, scarce, and slow to warm up.
Scaling should be based on queue wait time, GPU utilization, token throughput, memory pressure, and model loading time.
Finally, observability must separate queueing latency, prefill latency, decode latency, GPU utilization, memory pressure, batch size, OOM errors, and cost per token.
The core principle is: balance GPU systems by expected compute cost, not just request count.
⭐ Final Insight
GPU Load Balancing 不是简单的:
“哪个 worker 请求少,就发给谁”
真正的 GPU load balancing 需要考虑:
Model Placement
- Token Count
- Queue Time
- GPU Memory
- KV Cache
- Batch Size
- Tenant Fairness
- Latency Target
- Cost。
普通 backend 看 request count。
GPU inference system 看 expected compute cost。
最重要的一句话:
Balance GPU systems by expected compute cost, not just request count.
中文部分
🎯 Load Balancing for GPU-based Systems
1️⃣ 核心框架
讨论 GPU-based load balancing 时,我通常从这些方面分析:
- 为什么 GPU load balancing 不同
- Request routing
- Model-aware routing
- GPU capacity and memory constraints
- Queue-based scheduling
- Dynamic batching
- Failure handling and fallback
- 核心权衡:latency vs throughput vs utilization
2️⃣ 为什么 GPU Load Balancing 不一样?
Traditional backend load balancing 通常平衡 HTTP requests。
GPU systems 必须平衡:
- GPU memory
- Model size
- Batch size
- Token length
- Queue time
- KV cache usage
- Request priority
- Inference latency
- Throughput
Key Difference
CPU service:
Request cost is often similar.
GPU inference:
Request cost can vary massively by tokens, model, and output length.
👉 面试回答
GPU load balancing 比普通 service load balancing 更难, 因为 requests 的成本差异非常大。
一个 short prompt + small model 很便宜, 但 long-context request + large model 可能消耗大量 GPU memory 和时间。
Load balancer 必须理解 model、token count、 queue depth、GPU memory 和 latency targets。
3️⃣ High-Level Architecture
Architecture
Client
→ API Gateway
→ Request Validator
→ Model Router
→ GPU Load Balancer
→ Inference Queue
→ Scheduler
→ GPU Workers
→ Response Streamer
Core Components
Model Router
选择 model family 或 cluster。
GPU Load Balancer
选择最合适的 GPU worker 或 queue。
Scheduler
Batch 并调度 inference。
GPU Worker
运行 model inference。
👉 面试回答
GPU inference platform 通常包含 API gateway、model router、 GPU-aware load balancer、inference queue、 scheduler、GPU workers 和 response streamer。
Load balancer 应根据 model availability、 queue depth、GPU memory 和 expected token cost 进行 routing。
4️⃣ Basic Load Balancing Strategies
Round Robin
Request 1 → Worker A
Request 2 → Worker B
Request 3 → Worker C
简单, 但不适合 GPU inference。
Least Connections
路由到 active requests 最少的 worker。
更好一些, 但仍然不完整。
Least Queue Time
路由到 queue 最短的 worker。
对 inference systems 更有用。
GPU-aware Routing
根据这些因素 routing:
- GPU memory
- Current batch
- Queue depth
- Expected token cost
- Model loaded
- Worker health
👉 面试回答
Round robin 这类 basic strategies 通常不够用于 GPU systems。
Production inference systems 需要 GPU-aware routing, 考虑 queue depth、model placement、 GPU memory、expected token count 和 worker health。
5️⃣ Model-aware Routing
为什么 Model Awareness 重要?
不同 GPUs 可能 host 不同 models。
Worker A → small model
Worker B → large model
Worker C → embedding model
Routing Flow
Request model = large-model
→ Find cluster serving large-model
→ Choose best worker in that cluster
为什么重要?
把 model load 到 GPU memory 非常昂贵。
不应该频繁移动 models。
👉 面试回答
Model-aware routing 确保 requests 被发送到已经 loaded 所需 model 的 workers。
因为 model loading 昂贵, 且 GPU memory 有限, production systems 通常会把 models warm 在特定 GPU pools 上。
6️⃣ Token-aware Routing
为什么 Tokens 重要?
Request cost 高度依赖:
- Input tokens
- Output tokens
- Context length
- Max generation length
Example
Request A:
500 input tokens + 100 output tokens
Request B:
50,000 input tokens + 2,000 output tokens
这两个 request 不应该被同等对待。
Token-aware Routing
Estimate token cost
→ Route large requests to suitable queues
→ Protect latency-sensitive small requests
👉 面试回答
Token-aware routing 会在 scheduling 前估算 request cost。
Long-context 或 long-generation requests 应该和 short interactive requests 走不同 routing, 否则 large requests 会 block small requests, 影响 latency。
7️⃣ Queue-based Load Balancing
为什么使用 Queues?
GPU workers 不能接受无限请求。
Queue 提供:
- Backpressure
- Fairness
- Retry handling
- Priority handling
- Batch formation
- Overload protection
Queue Flow
Request
→ Model-specific queue
→ Scheduler
→ GPU batch
→ Response
Queue Types
- Per-model queue
- Per-priority queue
- Per-tenant queue
- Short-context queue
- Long-context queue
👉 面试回答
Queue-based load balancing 在 GPU systems 中很常见。
Requests 被放入 model-specific 或 priority-specific queues, scheduler 决定如何 batch 并在 GPUs 上执行。
这样可以提供 backpressure、 fairness 和更好的 GPU utilization。
8️⃣ Dynamic Batching
什么是 Dynamic Batching?
Dynamic batching 把多个 requests 组合成一个 GPU batch。
Request A
Request B
Request C
→ One GPU batch
为什么有帮助?
- Improves throughput
- Increases GPU utilization
- Reduces cost per token
Trade-off
Wait longer for bigger batch
→ Better throughput
→ Higher latency
👉 面试回答
Dynamic batching 通过把多个 requests 组合成一个 batch 来提升 GPU utilization。
它增加 throughput 并降低 cost, 但如果 scheduler 等太久形成 batch, 就会增加 latency。
9️⃣ Latency-aware Routing
不同 Workloads 有不同 Latency 需求
Interactive Chat
需要 low time-to-first-token。
Batch Summarization
可以接受更高 latency。
Embeddings
通常 high-throughput 并且 batch-friendly。
Routing Strategy
Interactive traffic → low-latency queue
Batch jobs → throughput-optimized queue
Embeddings → batch-heavy queue
👉 面试回答
GPU systems 应该区分 latency-sensitive traffic 和 throughput-oriented workloads。
Interactive chat 应该使用 low-latency queues, batch jobs 和 embeddings 可以使用更大 batch 来提高 GPU efficiency。
🔟 Memory-aware Load Balancing
为什么 GPU Memory 重要?
GPU memory 经常是限制因素。
Memory 被这些使用:
- Model weights
- KV cache
- Batch activations
- Long context
- Parallel requests
Failure Example
Too many long-context requests
→ GPU memory exhausted
→ Worker crashes or rejects requests
Controls
- Admission control
- Token limits
- KV cache budgeting
- Max concurrent requests
- Memory-aware scheduling
👉 面试回答
GPU memory 是 inference serving 的核心限制。
Load balancer 和 scheduler 应在 admit requests 前, 根据 model size、input length、 output length、batch size 和 KV cache usage 估算 memory usage。
1️⃣1️⃣ Hot Model and Cold Model Problem
Hot Models
Popular models 接收大量 traffic。
90% traffic → one model
这会 overload 某一个 GPU pool。
Cold Models
很少使用的 models 如果一直 loaded 会浪费 GPU memory。
Solutions
- Dedicated hot model pools
- Autoscaling hot pools
- Lazy loading cold models
- Model eviction
- Multi-model serving
- Request queuing for cold starts
👉 面试回答
GPU systems 经常遇到 hot model 和 cold model 问题。
Popular models 需要 dedicated scalable pools, 很少使用的 models 可以 lazy loaded 或 evicted 来节省 GPU memory。
1️⃣2️⃣ Autoscaling GPU Workers
为什么 Autoscaling 很难?
GPU workers 昂贵且启动慢。
Scaling Signals
- Queue depth
- Queue wait time
- GPU utilization
- Token throughput
- Time to first token
- Requests per second
- Memory pressure
Challenges
- GPU availability
- Model loading time
- Warmup latency
- Cost spikes
- Over-provisioning
👉 面试回答
Autoscaling GPU workers 比 autoscaling 普通 services 更难, 因为 GPUs 昂贵、稀缺, 且 warm up 慢。
Scaling 应考虑 queue time、 GPU utilization、token throughput、 memory pressure 和 model loading time。
1️⃣3️⃣ Fairness and Multi-tenancy
为什么 Fairness 重要?
少数 tenants 不应该 monopolize GPU capacity。
Controls
- Per-tenant quotas
- Weighted fair queues
- Priority classes
- Token-per-minute limits
- Concurrency limits
- Budget-based throttling
Example
Tenant A sends huge batch job
→ Without fairness, Tenant B chat latency suffers
👉 面试回答
Multi-tenant GPU systems 需要 fairness controls。
Per-tenant quotas、priority queues、 token limits 和 weighted fair scheduling 可以防止某个 tenant 或 workload 饿死其他请求。
1️⃣4️⃣ Failure Handling
Common Failures
- GPU worker crash
- Out-of-memory error
- Model load failure
- Queue timeout
- Worker overload
- Network failure
- Region outage
Fallback Strategies
- Retry safe requests
- Route to another replica
- Use smaller fallback model
- Return partial response
- Shed low-priority traffic
- Circuit breaker for unhealthy workers
👉 面试回答
GPU load balancing 必须处理 worker failures、OOM errors、 overload、queue timeouts 和 regional failures。
常见策略包括 retries、replica routing、 fallback models、circuit breakers 和 load shedding。
1️⃣5️⃣ Observability
What to Monitor
- Queue depth
- Queue wait time
- GPU utilization
- GPU memory usage
- Time to first token
- Tokens per second
- Batch size
- Request latency
- OOM errors
- Worker health
- Model load time
- Cost per token
Debugging Questions
- Latency 是 queueing 造成,还是 decoding?
- GPUs 是否 underutilized?
- Batches 是否太小?
- Long requests 是否 block short requests?
- 某个 model pool 是否 overloaded?
👉 面试回答
Observability 对 GPU load balancing 非常重要。
我会监控 queue wait time、 GPU utilization、memory usage、 batch size、time to first token、 tokens per second、OOM errors、 worker health 和 cost per token。
1️⃣6️⃣ Best Practices
Practical Rules
- Use model-aware routing
- Use token-aware routing
- Separate low-latency and batch workloads
- Use dynamic batching carefully
- Add admission control
- Track GPU memory pressure
- Use per-tenant fairness controls
- Autoscale based on queue time and token throughput
- Add fallback models
- Monitor queueing and decoding separately
Design Principle
Balance GPU systems by expected compute cost,
not just request count.
👉 面试回答
最好的 GPU load balancing systems 是 model-aware、token-aware、 memory-aware 和 latency-aware 的。
它们根据 expected compute cost 平衡负载, 而不是只看 request count。
🧠 Staff-Level Answer Final
👉 面试回答完整版本
GPU-based systems 的 load balancing 比普通 HTTP load balancing 更难, 因为 requests 的 cost 并不均匀。
在 LLM inference 中, request cost 取决于 model size、 input tokens、output tokens、 batch size、KV cache usage、 GPU memory 和 latency target。
Production architecture 通常包含 API gateway、 model router、GPU-aware load balancer、 inference queues、scheduler、GPU workers 和 response streamer。
第一步是 model-aware routing。
Requests 应该去已经 loaded requested model 的 GPU workers 或 pools, 因为把 model load 到 GPU memory 很昂贵。
第二步是 token-aware routing。
Long-context 或 long-generation requests 不应该和 short interactive requests 被同等对待。
系统应该估算 token cost, 并把请求 route 到合适 queues。
Queue-based scheduling 很重要, 因为它提供 backpressure、fairness、 priority handling 和 batch formation。
Dynamic batching 通过把 requests 组合在一起 来提升 GPU utilization, 但它带来 latency-throughput trade-off。
Bigger batches 提升 throughput, 但等待太久形成 batch 会损害 latency。
GPU memory 是另一个核心限制。
Scheduler 必须考虑 model weights、 KV cache、context length、 batch activations 和 concurrent requests。
如果没有 memory-aware admission control, long-context requests 可能导致 out-of-memory errors。
Multi-tenant fairness 也非常关键。
Per-tenant quotas、priority queues、 weighted fair scheduling、 token-per-minute limits 和 concurrency limits 可以防止单个 tenant 占用所有 GPU capacity。
Autoscaling GPU workers 很难, 因为 GPUs 昂贵、稀缺, 且 warm up 慢。
Scaling 应基于 queue wait time、 GPU utilization、token throughput、 memory pressure 和 model loading time。
最后,observability 必须区分 queueing latency、 prefill latency、decode latency、 GPU utilization、memory pressure、 batch size、OOM errors 和 cost per token。
核心原则是: balance GPU systems by expected compute cost, not just request count。
⭐ Final Insight
GPU Load Balancing 不是简单的:
“哪个 worker 请求少,就发给谁”
真正的 GPU load balancing 需要考虑:
Model Placement
- Token Count
- Queue Time
- GPU Memory
- KV Cache
- Batch Size
- Tenant Fairness
- Latency Target
- Cost。
普通 backend 看 request count。
GPU inference system 看 expected compute cost。
最重要的一句话:
Balance GPU systems by expected compute cost, not just request count.
Implement