·

System Design Deep Dive - 05 Load Balancing for GPU-based Systems

Post by ailswan May. 24, 2026

中文 ↓

🎯 Load Balancing for GPU-based Systems


1️⃣ Core Framework

When discussing GPU-based load balancing, I frame it as:

  1. Why GPU load balancing is different
  2. Request routing
  3. Model-aware routing
  4. GPU capacity and memory constraints
  5. Queue-based scheduling
  6. Dynamic batching
  7. Failure handling and fallback
  8. Trade-offs: latency vs throughput vs utilization

2️⃣ Why GPU Load Balancing Is Different

Traditional backend load balancing usually balances HTTP requests.

GPU systems must balance:


Key Difference

CPU service:
Request cost is often similar.

GPU inference:
Request cost can vary massively by tokens, model, and output length.

👉 Interview Answer

GPU load balancing is harder than normal service load balancing because requests have very different costs.

A short prompt on a small model is cheap, while a long-context request on a large model can consume much more GPU memory and time.

The load balancer must understand model, token count, queue depth, GPU memory, and latency targets.


3️⃣ High-Level Architecture


Architecture

Client
→ API Gateway
→ Request Validator
→ Model Router
→ GPU Load Balancer
→ Inference Queue
→ Scheduler
→ GPU Workers
→ Response Streamer

Core Components

Model Router

Chooses model family or cluster.


GPU Load Balancer

Chooses the best GPU worker or queue.


Scheduler

Batches and schedules inference.


GPU Worker

Runs model inference.


👉 Interview Answer

A GPU inference platform usually has an API gateway, model router, GPU-aware load balancer, inference queue, scheduler, GPU workers, and response streamer.

The load balancer should route requests based on model availability, queue depth, GPU memory, and expected token cost.


4️⃣ Basic Load Balancing Strategies


Round Robin

Request 1 → Worker A
Request 2 → Worker B
Request 3 → Worker C

Simple, but not ideal for GPU inference.


Least Connections

Routes to worker with fewest active requests.

Better, but still incomplete.


Least Queue Time

Routes to worker with shortest queue.

More useful for inference systems.


GPU-aware Routing

Routes based on:


👉 Interview Answer

Basic strategies like round robin are usually not enough for GPU systems.

Production inference systems need GPU-aware routing that considers queue depth, model placement, GPU memory, expected token count, and worker health.


5️⃣ Model-aware Routing


Why Model Awareness Matters

Different GPUs may host different models.

Worker A → small model
Worker B → large model
Worker C → embedding model

Routing Flow

Request model = large-model
→ Find cluster serving large-model
→ Choose best worker in that cluster

Why Important

Loading a model into GPU memory is expensive.

You do not want to move models frequently.


👉 Interview Answer

Model-aware routing ensures requests go to workers that already have the required model loaded.

Since model loading is expensive and GPU memory is limited, production systems usually keep models warm on specific GPU pools.


6️⃣ Token-aware Routing


Why Tokens Matter

Request cost depends heavily on:


Example

Request A:
500 input tokens + 100 output tokens

Request B:
50,000 input tokens + 2,000 output tokens

These should not be treated equally.


Token-aware Routing

Estimate token cost
→ Route large requests to suitable queues
→ Protect latency-sensitive small requests

👉 Interview Answer

Token-aware routing estimates request cost before scheduling.

Long-context or long-generation requests should be routed differently from short interactive requests, otherwise large requests can block small ones and hurt latency.


7️⃣ Queue-based Load Balancing


Why Use Queues?

GPU workers cannot accept unlimited requests.

A queue provides:


Queue Flow

Request
→ Model-specific queue
→ Scheduler
→ GPU batch
→ Response

Queue Types


👉 Interview Answer

Queue-based load balancing is common in GPU systems.

Requests are placed into model-specific or priority-specific queues, and the scheduler decides how to batch and execute them on GPUs.

This enables backpressure, fairness, and better GPU utilization.


8️⃣ Dynamic Batching


What Is Dynamic Batching?

Dynamic batching combines multiple requests into one GPU batch.

Request A
Request B
Request C
→ One GPU batch

Why It Helps


Trade-off

Wait longer for bigger batch
→ Better throughput
→ Higher latency

👉 Interview Answer

Dynamic batching improves GPU utilization by grouping multiple requests into one batch.

It increases throughput and lowers cost, but can increase latency if the scheduler waits too long to form a batch.


9️⃣ Latency-aware Routing


Different Workloads Need Different Latency

Interactive Chat

Needs low time-to-first-token.


Batch Summarization

Can tolerate higher latency.


Embeddings

Usually high-throughput and batch-friendly.


Routing Strategy

Interactive traffic → low-latency queue
Batch jobs → throughput-optimized queue
Embeddings → batch-heavy queue

👉 Interview Answer

GPU systems should separate latency-sensitive traffic from throughput-oriented workloads.

Interactive chat should use low-latency queues, while batch jobs and embeddings can use larger batches for better GPU efficiency.


🔟 Memory-aware Load Balancing


Why GPU Memory Matters

GPU memory is often the limiting factor.

Memory is used by:


Failure Example

Too many long-context requests
→ GPU memory exhausted
→ Worker crashes or rejects requests

Controls


👉 Interview Answer

GPU memory is a key constraint in inference serving.

The load balancer and scheduler should estimate memory usage from model size, input length, output length, batch size, and KV cache usage before admitting requests.


1️⃣1️⃣ Hot Model and Cold Model Problem


Hot Models

Popular models receive heavy traffic.

90% traffic → one model

This can overload one GPU pool.


Cold Models

Rarely used models waste GPU memory if kept loaded.


Solutions


👉 Interview Answer

GPU systems often face hot model and cold model problems.

Popular models need dedicated scalable pools, while rarely used models may be lazily loaded or evicted to save GPU memory.


1️⃣2️⃣ Autoscaling GPU Workers


Why Autoscaling Is Hard

GPU workers are expensive and slow to start.


Scaling Signals


Challenges


👉 Interview Answer

Autoscaling GPU workers is harder than autoscaling normal services because GPUs are expensive, scarce, and slow to warm up.

Scaling should consider queue time, GPU utilization, token throughput, memory pressure, and model loading time.


1️⃣3️⃣ Fairness and Multi-tenancy


Why Fairness Matters

A few tenants should not monopolize GPU capacity.


Controls


Example

Tenant A sends huge batch job
→ Without fairness, Tenant B chat latency suffers

👉 Interview Answer

Multi-tenant GPU systems need fairness controls.

Per-tenant quotas, priority queues, token limits, and weighted fair scheduling prevent one tenant or workload from starving others.


1️⃣4️⃣ Failure Handling


Common Failures


Fallback Strategies


👉 Interview Answer

GPU load balancing must handle worker failures, out-of-memory errors, overload, queue timeouts, and regional failures.

Common strategies include retries, replica routing, fallback models, circuit breakers, and load shedding.


1️⃣5️⃣ Observability


What to Monitor


Debugging Questions


👉 Interview Answer

Observability is essential for GPU load balancing.

I would monitor queue wait time, GPU utilization, memory usage, batch size, time to first token, tokens per second, OOM errors, worker health, and cost per token.


1️⃣6️⃣ Best Practices


Practical Rules


Design Principle

Balance GPU systems by expected compute cost,
not just request count.

👉 Interview Answer

The best GPU load balancing systems are model-aware, token-aware, memory-aware, and latency-aware.

They balance expected compute cost, not just request count.


🧠 Staff-Level Answer Final


👉 Interview Answer Full Version

Load balancing for GPU-based systems is harder than normal HTTP load balancing because requests do not have uniform cost.

In LLM inference, request cost depends on model size, input tokens, output tokens, batch size, KV cache usage, GPU memory, and latency target.

A production architecture usually includes an API gateway, model router, GPU-aware load balancer, inference queues, scheduler, GPU workers, and response streamer.

The first step is model-aware routing.

Requests should go to GPU workers or pools that already have the requested model loaded, because loading models into GPU memory is expensive.

The second step is token-aware routing.

Long-context or long-generation requests should not be treated the same as short interactive requests.

The system should estimate token cost and route requests to appropriate queues.

Queue-based scheduling is important because it provides backpressure, fairness, priority handling, and batch formation.

Dynamic batching improves GPU utilization by grouping requests together, but it introduces a latency-throughput trade-off.

Larger batches improve throughput, but waiting too long to form a batch hurts latency.

GPU memory is another key constraint.

The scheduler must account for model weights, KV cache, context length, batch activations, and concurrent requests.

Without memory-aware admission control, long-context requests can cause out-of-memory errors.

Multi-tenant fairness is also critical.

Per-tenant quotas, priority queues, weighted fair scheduling, token-per-minute limits, and concurrency limits prevent one tenant from consuming all GPU capacity.

Autoscaling GPU workers is difficult because GPUs are expensive, scarce, and slow to warm up.

Scaling should be based on queue wait time, GPU utilization, token throughput, memory pressure, and model loading time.

Finally, observability must separate queueing latency, prefill latency, decode latency, GPU utilization, memory pressure, batch size, OOM errors, and cost per token.

The core principle is: balance GPU systems by expected compute cost, not just request count.


⭐ Final Insight

GPU Load Balancing 不是简单的:

“哪个 worker 请求少,就发给谁”

真正的 GPU load balancing 需要考虑:

Model Placement

  • Token Count
  • Queue Time
  • GPU Memory
  • KV Cache
  • Batch Size
  • Tenant Fairness
  • Latency Target
  • Cost。

普通 backend 看 request count。

GPU inference system 看 expected compute cost。

最重要的一句话:

Balance GPU systems by expected compute cost, not just request count.


中文部分


🎯 Load Balancing for GPU-based Systems


1️⃣ 核心框架

讨论 GPU-based load balancing 时,我通常从这些方面分析:

  1. 为什么 GPU load balancing 不同
  2. Request routing
  3. Model-aware routing
  4. GPU capacity and memory constraints
  5. Queue-based scheduling
  6. Dynamic batching
  7. Failure handling and fallback
  8. 核心权衡:latency vs throughput vs utilization

2️⃣ 为什么 GPU Load Balancing 不一样?

Traditional backend load balancing 通常平衡 HTTP requests。

GPU systems 必须平衡:


Key Difference

CPU service:
Request cost is often similar.

GPU inference:
Request cost can vary massively by tokens, model, and output length.

👉 面试回答

GPU load balancing 比普通 service load balancing 更难, 因为 requests 的成本差异非常大。

一个 short prompt + small model 很便宜, 但 long-context request + large model 可能消耗大量 GPU memory 和时间。

Load balancer 必须理解 model、token count、 queue depth、GPU memory 和 latency targets。


3️⃣ High-Level Architecture


Architecture

Client
→ API Gateway
→ Request Validator
→ Model Router
→ GPU Load Balancer
→ Inference Queue
→ Scheduler
→ GPU Workers
→ Response Streamer

Core Components

Model Router

选择 model family 或 cluster。


GPU Load Balancer

选择最合适的 GPU worker 或 queue。


Scheduler

Batch 并调度 inference。


GPU Worker

运行 model inference。


👉 面试回答

GPU inference platform 通常包含 API gateway、model router、 GPU-aware load balancer、inference queue、 scheduler、GPU workers 和 response streamer。

Load balancer 应根据 model availability、 queue depth、GPU memory 和 expected token cost 进行 routing。


4️⃣ Basic Load Balancing Strategies


Round Robin

Request 1 → Worker A
Request 2 → Worker B
Request 3 → Worker C

简单, 但不适合 GPU inference。


Least Connections

路由到 active requests 最少的 worker。

更好一些, 但仍然不完整。


Least Queue Time

路由到 queue 最短的 worker。

对 inference systems 更有用。


GPU-aware Routing

根据这些因素 routing:


👉 面试回答

Round robin 这类 basic strategies 通常不够用于 GPU systems。

Production inference systems 需要 GPU-aware routing, 考虑 queue depth、model placement、 GPU memory、expected token count 和 worker health。


5️⃣ Model-aware Routing


为什么 Model Awareness 重要?

不同 GPUs 可能 host 不同 models。

Worker A → small model
Worker B → large model
Worker C → embedding model

Routing Flow

Request model = large-model
→ Find cluster serving large-model
→ Choose best worker in that cluster

为什么重要?

把 model load 到 GPU memory 非常昂贵。

不应该频繁移动 models。


👉 面试回答

Model-aware routing 确保 requests 被发送到已经 loaded 所需 model 的 workers。

因为 model loading 昂贵, 且 GPU memory 有限, production systems 通常会把 models warm 在特定 GPU pools 上。


6️⃣ Token-aware Routing


为什么 Tokens 重要?

Request cost 高度依赖:


Example

Request A:
500 input tokens + 100 output tokens

Request B:
50,000 input tokens + 2,000 output tokens

这两个 request 不应该被同等对待。


Token-aware Routing

Estimate token cost
→ Route large requests to suitable queues
→ Protect latency-sensitive small requests

👉 面试回答

Token-aware routing 会在 scheduling 前估算 request cost。

Long-context 或 long-generation requests 应该和 short interactive requests 走不同 routing, 否则 large requests 会 block small requests, 影响 latency。


7️⃣ Queue-based Load Balancing


为什么使用 Queues?

GPU workers 不能接受无限请求。

Queue 提供:


Queue Flow

Request
→ Model-specific queue
→ Scheduler
→ GPU batch
→ Response

Queue Types


👉 面试回答

Queue-based load balancing 在 GPU systems 中很常见。

Requests 被放入 model-specific 或 priority-specific queues, scheduler 决定如何 batch 并在 GPUs 上执行。

这样可以提供 backpressure、 fairness 和更好的 GPU utilization。


8️⃣ Dynamic Batching


什么是 Dynamic Batching?

Dynamic batching 把多个 requests 组合成一个 GPU batch。

Request A
Request B
Request C
→ One GPU batch

为什么有帮助?


Trade-off

Wait longer for bigger batch
→ Better throughput
→ Higher latency

👉 面试回答

Dynamic batching 通过把多个 requests 组合成一个 batch 来提升 GPU utilization。

它增加 throughput 并降低 cost, 但如果 scheduler 等太久形成 batch, 就会增加 latency。


9️⃣ Latency-aware Routing


不同 Workloads 有不同 Latency 需求

Interactive Chat

需要 low time-to-first-token。


Batch Summarization

可以接受更高 latency。


Embeddings

通常 high-throughput 并且 batch-friendly。


Routing Strategy

Interactive traffic → low-latency queue
Batch jobs → throughput-optimized queue
Embeddings → batch-heavy queue

👉 面试回答

GPU systems 应该区分 latency-sensitive traffic 和 throughput-oriented workloads。

Interactive chat 应该使用 low-latency queues, batch jobs 和 embeddings 可以使用更大 batch 来提高 GPU efficiency。


🔟 Memory-aware Load Balancing


为什么 GPU Memory 重要?

GPU memory 经常是限制因素。

Memory 被这些使用:


Failure Example

Too many long-context requests
→ GPU memory exhausted
→ Worker crashes or rejects requests

Controls


👉 面试回答

GPU memory 是 inference serving 的核心限制。

Load balancer 和 scheduler 应在 admit requests 前, 根据 model size、input length、 output length、batch size 和 KV cache usage 估算 memory usage。


1️⃣1️⃣ Hot Model and Cold Model Problem


Hot Models

Popular models 接收大量 traffic。

90% traffic → one model

这会 overload 某一个 GPU pool。


Cold Models

很少使用的 models 如果一直 loaded 会浪费 GPU memory。


Solutions


👉 面试回答

GPU systems 经常遇到 hot model 和 cold model 问题。

Popular models 需要 dedicated scalable pools, 很少使用的 models 可以 lazy loaded 或 evicted 来节省 GPU memory。


1️⃣2️⃣ Autoscaling GPU Workers


为什么 Autoscaling 很难?

GPU workers 昂贵且启动慢。


Scaling Signals


Challenges


👉 面试回答

Autoscaling GPU workers 比 autoscaling 普通 services 更难, 因为 GPUs 昂贵、稀缺, 且 warm up 慢。

Scaling 应考虑 queue time、 GPU utilization、token throughput、 memory pressure 和 model loading time。


1️⃣3️⃣ Fairness and Multi-tenancy


为什么 Fairness 重要?

少数 tenants 不应该 monopolize GPU capacity。


Controls


Example

Tenant A sends huge batch job
→ Without fairness, Tenant B chat latency suffers

👉 面试回答

Multi-tenant GPU systems 需要 fairness controls。

Per-tenant quotas、priority queues、 token limits 和 weighted fair scheduling 可以防止某个 tenant 或 workload 饿死其他请求。


1️⃣4️⃣ Failure Handling


Common Failures


Fallback Strategies


👉 面试回答

GPU load balancing 必须处理 worker failures、OOM errors、 overload、queue timeouts 和 regional failures。

常见策略包括 retries、replica routing、 fallback models、circuit breakers 和 load shedding。


1️⃣5️⃣ Observability


What to Monitor


Debugging Questions


👉 面试回答

Observability 对 GPU load balancing 非常重要。

我会监控 queue wait time、 GPU utilization、memory usage、 batch size、time to first token、 tokens per second、OOM errors、 worker health 和 cost per token。


1️⃣6️⃣ Best Practices


Practical Rules


Design Principle

Balance GPU systems by expected compute cost,
not just request count.

👉 面试回答

最好的 GPU load balancing systems 是 model-aware、token-aware、 memory-aware 和 latency-aware 的。

它们根据 expected compute cost 平衡负载, 而不是只看 request count。


🧠 Staff-Level Answer Final


👉 面试回答完整版本

GPU-based systems 的 load balancing 比普通 HTTP load balancing 更难, 因为 requests 的 cost 并不均匀。

在 LLM inference 中, request cost 取决于 model size、 input tokens、output tokens、 batch size、KV cache usage、 GPU memory 和 latency target。

Production architecture 通常包含 API gateway、 model router、GPU-aware load balancer、 inference queues、scheduler、GPU workers 和 response streamer。

第一步是 model-aware routing。

Requests 应该去已经 loaded requested model 的 GPU workers 或 pools, 因为把 model load 到 GPU memory 很昂贵。

第二步是 token-aware routing。

Long-context 或 long-generation requests 不应该和 short interactive requests 被同等对待。

系统应该估算 token cost, 并把请求 route 到合适 queues。

Queue-based scheduling 很重要, 因为它提供 backpressure、fairness、 priority handling 和 batch formation。

Dynamic batching 通过把 requests 组合在一起 来提升 GPU utilization, 但它带来 latency-throughput trade-off。

Bigger batches 提升 throughput, 但等待太久形成 batch 会损害 latency。

GPU memory 是另一个核心限制。

Scheduler 必须考虑 model weights、 KV cache、context length、 batch activations 和 concurrent requests。

如果没有 memory-aware admission control, long-context requests 可能导致 out-of-memory errors。

Multi-tenant fairness 也非常关键。

Per-tenant quotas、priority queues、 weighted fair scheduling、 token-per-minute limits 和 concurrency limits 可以防止单个 tenant 占用所有 GPU capacity。

Autoscaling GPU workers 很难, 因为 GPUs 昂贵、稀缺, 且 warm up 慢。

Scaling 应基于 queue wait time、 GPU utilization、token throughput、 memory pressure 和 model loading time。

最后,observability 必须区分 queueing latency、 prefill latency、decode latency、 GPU utilization、memory pressure、 batch size、OOM errors 和 cost per token。

核心原则是: balance GPU systems by expected compute cost, not just request count。


⭐ Final Insight

GPU Load Balancing 不是简单的:

“哪个 worker 请求少,就发给谁”

真正的 GPU load balancing 需要考虑:

Model Placement

  • Token Count
  • Queue Time
  • GPU Memory
  • KV Cache
  • Batch Size
  • Tenant Fairness
  • Latency Target
  • Cost。

普通 backend 看 request count。

GPU inference system 看 expected compute cost。

最重要的一句话:

Balance GPU systems by expected compute cost, not just request count.


Implement