aaa-llm LLM Infrastructure ·

🎯 Model Routing Systems: GPT-4 vs Mini Models

1️⃣ Core Framework

When discussing Model Routing Systems, I frame it as:

Why model routing is needed
Large models vs mini models
Routing signals
Cost and latency trade-offs
Quality and fallback strategy
Confidence-based escalation
Multi-model orchestration
Trade-offs: quality vs cost vs speed

2️⃣ Why Model Routing Is Needed

Not every request needs the largest model.

Some tasks are simple.

Some tasks are complex.

A good system routes each request to the right model.

Basic Idea

User Request
→ Classify task complexity
→ Choose model
→ Run inference
→ Validate output
→ Escalate if needed

👉 Interview Answer

Model routing is the process of selecting the right model for each request.

The goal is to balance quality, latency, and cost.

Simple tasks can use smaller models, while complex reasoning, high-risk, or ambiguous tasks should use larger models.

3️⃣ Large Models vs Mini Models

Large Models

Large models are stronger at:

Complex reasoning
Ambiguous tasks
Long context
Multi-step planning
Code generation
Hard instruction following
High-risk decisions

Mini Models

Mini models are better for:

Simple classification
Short summarization
Data extraction
Routing decisions
Formatting
Lightweight Q&A
Cheap high-volume workloads

Comparison

Dimension	Large Model	Mini Model
Quality	Higher	Lower
Cost	Higher	Lower
Latency	Higher	Lower
Reasoning	Stronger	Weaker
Throughput	Lower	Higher
Best for	Complex tasks	Simple tasks

👉 Interview Answer

Large models provide better reasoning and instruction following, but they are slower and more expensive.

Mini models are cheaper and faster, but weaker on complex or ambiguous tasks.

Model routing lets the system use each model where it fits best.

4️⃣ High-Level Architecture

Architecture

User Request
→ API Gateway
→ Request Validator
→ Router / Classifier
→ Model Policy Engine
→ Model Selection
→ Inference Service
→ Output Validator
→ Fallback / Escalation
→ Final Response

Core Components

Router

Classifies request type and complexity.

Policy Engine

Applies business rules, cost limits, risk rules, and model permissions.

Inference Service

Runs selected model.

Output Validator

Checks correctness, format, safety, and confidence.

👉 Interview Answer

A model routing system usually includes a router, policy engine, inference service, output validator, and fallback logic.

The router chooses an initial model, and the system can escalate to a stronger model when needed.

5️⃣ Routing Signals

What Signals Should Be Used?

A router may consider:

Task type
Prompt length
User tier
Required latency
Required quality
Risk level
Tool usage
Output format
Domain complexity
Previous failure history

Example

Short JSON extraction
→ Mini model

Legal policy interpretation
→ Large model

Complex debugging question
→ Large model

👉 Interview Answer

Model routing should use signals such as task type, prompt length, risk level, latency requirement, user tier, output format, and expected complexity.

The goal is to choose the cheapest model that can reliably solve the task.

6️⃣ Rule-based Routing

What Is Rule-based Routing?

Rule-based routing uses explicit rules.

Example Rules

If task = summarization and input < 2,000 tokens
→ Use mini model

If task = complex reasoning
→ Use large model

If risk = high
→ Use large model

If output must be strict JSON
→ Use mini model first, validate, then fallback

Advantages

Easy to understand
Easy to debug
Predictable
Good for early systems

Disadvantages

Rigid
Requires maintenance
May miss subtle complexity

👉 Interview Answer

Rule-based routing is simple and explainable.

It works well when task types are clear.

But it can become brittle as use cases grow, because request complexity is not always easy to capture with static rules.

7️⃣ Classifier-based Routing

What Is Classifier-based Routing?

A lightweight model classifies the request first.

User Request
→ Small classifier model
→ Task type + complexity + risk
→ Route to selected model

Classifier Output

{
  "task_type": "code_debugging",
  "complexity": "high",
  "risk": "medium",
  "recommended_model": "large"
}

Why Useful

It handles more flexible inputs than static rules.

👉 Interview Answer

Classifier-based routing uses a small model or classifier to predict task type, complexity, and risk.

This allows more flexible routing than hard-coded rules, while keeping routing cost low.

8️⃣ Confidence-based Escalation

Key Pattern

Start with a cheaper model.

Escalate only when needed.

Flow

Mini model
→ Generate answer
→ Validate confidence / format / quality
→ If pass, return
→ If fail, escalate to large model

When to Escalate

Low confidence
Invalid JSON
Failed safety check
Contradictory answer
User asks for high accuracy
Task is more complex than expected

👉 Interview Answer

A common production pattern is confidence-based escalation.

The system tries a smaller model first, validates the output, and escalates to a larger model only if the result is low-confidence, invalid, unsafe, or incomplete.

9️⃣ Fallback Strategy

Why Fallbacks Matter

Model calls can fail.

Failures include:

Timeout
Overload
Rate limit
Invalid output
Safety block
Low confidence
Model unavailable

Fallback Options

Large model unavailable
→ Use another large model

Mini model fails validation
→ Retry with large model

All models fail
→ Return graceful error

👉 Interview Answer

Model routing systems need fallback strategies.

If a model fails, times out, or produces invalid output, the system can retry, route to another model, escalate to a stronger model, or return a graceful failure.

🔟 Cost-aware Routing

Why Cost Matters

Large models are expensive.

Using them for every request wastes money.

Cost-aware Policy

Low-value simple request
→ Mini model

High-value enterprise request
→ Large model

Batch job
→ Cheaper model or async processing

Cost Controls

Model budgets
User tier limits
Token limits
Fallback thresholds
Batch processing
Cache before model call

👉 Interview Answer

Cost-aware routing chooses models based on task value, user tier, complexity, and budget.

The system should avoid using large models for simple tasks that smaller models can solve reliably.

1️⃣1️⃣ Latency-aware Routing

Why Latency Matters

Some use cases need fast responses.

Examples:

Chat UI
Autocomplete
Real-time agent
Customer support bot

Latency Strategy

Low-latency request
→ Mini model

Complex background task
→ Large model

Timeout approaching
→ Fallback to faster model

Trade-off

Fast models may reduce quality.

👉 Interview Answer

Latency-aware routing sends time-sensitive requests to faster models, while allowing slower large models for complex or asynchronous tasks.

This helps balance user experience against output quality.

1️⃣2️⃣ Risk-aware Routing

Why Risk Matters

High-risk tasks need stronger models and stricter controls.

High-risk Examples

Financial advice
Legal interpretation
Medical information
Production changes
Security decisions
Customer-facing actions

Risk-aware Flow

Risk classifier
→ If low risk: mini model
→ If high risk: large model + validation + human review

👉 Interview Answer

Risk-aware routing routes high-impact or sensitive tasks to stronger models, often with extra validation or human review.

Low-risk tasks can be handled by cheaper models.

1️⃣3️⃣ Multi-stage Model Routing

Multi-stage Pattern

Some systems use multiple models in sequence.

Mini model → classify task
Mini model → draft response
Large model → verify or improve
Validator → final check

Example

Mini model extracts structured fields.
Large model handles ambiguous reasoning.
Mini model formats final JSON.

Why Useful

Each model does what it is good at.

👉 Interview Answer

Multi-stage routing uses different models for different stages.

A small model may classify, extract, or format, while a larger model handles reasoning or verification.

This can reduce cost while preserving quality.

1️⃣4️⃣ Evaluation and Metrics

What to Measure

Accuracy
Latency
Cost per request
Escalation rate
Fallback rate
Invalid output rate
User satisfaction
Model-specific error rate
Quality difference by task type

Important Metric

Cost saved without quality loss

Why Evaluation Matters

Bad routing can silently reduce quality.

👉 Interview Answer

Model routing must be evaluated continuously.

I would track accuracy, latency, cost, escalation rate, fallback rate, invalid output rate, and quality by task type.

The goal is to reduce cost without hurting quality.

1️⃣5️⃣ Common Failure Modes

Failure Modes

Model routing can fail because:

Router misclassifies task
Mini model used for hard problem
Large model overused
Escalation threshold too low
Fallback loops
Cost policy overrides quality
Risk classification is wrong
Evaluation data is weak

Example

Router labels complex legal question as simple Q&A
→ Mini model answers confidently
→ User gets incorrect answer

👉 Interview Answer

The biggest risk in model routing is under-routing: sending a hard or risky task to a weak model.

Production systems need validation, escalation, risk detection, and continuous evaluation.

1️⃣6️⃣ Best Practices

Practical Rules

Start with rule-based routing
Add classifier-based routing over time
Use mini models for simple tasks
Use large models for complex or risky tasks
Validate mini model outputs
Escalate on low confidence
Avoid fallback loops
Track routing decisions
Evaluate quality by task type
Optimize for cost saved without quality loss

Design Principle

Use the cheapest model that can reliably solve the task.

👉 Interview Answer

The best model routing systems use the cheapest model that can reliably solve the task.

They combine rules, classifiers, validation, fallback, risk detection, and continuous evaluation.

🧠 Staff-Level Answer Final

👉 Interview Answer Full Version

Model routing is the process of choosing the right model for each request.

The main goal is to balance quality, latency, and cost.

Not every request needs the largest model.

Simple classification, extraction, formatting, and short summarization can often be handled by mini models.

Complex reasoning, long-context analysis, ambiguous questions, code debugging, or high-risk decisions should use larger models.

A production model routing system usually includes a request router, task classifier, policy engine, model selector, inference service, output validator, fallback logic, and evaluation pipeline.

Routing decisions can be rule-based, classifier-based, or multi-stage.

Rule-based routing is predictable and easy to debug.

Classifier-based routing is more flexible because it can infer task type, complexity, and risk.

A common pattern is confidence-based escalation: try a cheaper mini model first, validate the output, and escalate to a larger model if the result is invalid, low-confidence, unsafe, or incomplete.

The system should also be cost-aware, latency-aware, and risk-aware.

For low-latency workloads, route to faster models.

For high-risk or correctness-critical tasks, route to stronger models and add validation or human review.

The biggest failure mode is under-routing: sending a complex or risky task to a weak model.

The opposite failure is over-routing: sending everything to the largest model and wasting cost.

Model routing must be evaluated continuously using accuracy, latency, cost per request, escalation rate, fallback rate, invalid output rate, and quality by task type.

The core principle is: use the cheapest model that can reliably solve the task.

⭐ Final Insight

Model Routing 的核心不是：

“大模型更聪明，所以全部用大模型”

真正的 production routing 是：

Task Classification

Complexity Detection

Risk Detection

Cost Policy

Latency Policy

Output Validation

Fallback / Escalation。

Mini model 负责 cheap and fast。

Large model 负责 hard and risky。

最重要的一句话：

Use the cheapest model that can reliably solve the task.

中文部分

🎯 Model Routing Systems: GPT-4 vs Mini Models

1️⃣ 核心框架

讨论 Model Routing Systems 时，我通常从这些方面分析：

为什么需要 model routing
Large models vs mini models
Routing signals
Cost and latency trade-offs
Quality and fallback strategy
Confidence-based escalation
Multi-model orchestration
核心权衡：quality vs cost vs speed

2️⃣ 为什么需要 Model Routing？

不是每个 request 都需要最大模型。

有些 tasks 很简单。

有些 tasks 很复杂。

好的系统会把每个 request 路由到合适的 model。

Basic Idea

User Request
→ Classify task complexity
→ Choose model
→ Run inference
→ Validate output
→ Escalate if needed

👉 面试回答

Model routing 是为每个 request 选择合适 model 的过程。

目标是在 quality、latency 和 cost 之间取得平衡。

简单 tasks 可以使用 smaller models，复杂 reasoning、高风险或 ambiguous tasks 应使用 larger models。

3️⃣ Large Models vs Mini Models

Large Models

Large models 更擅长：

Complex reasoning
Ambiguous tasks
Long context
Multi-step planning
Code generation
Hard instruction following
High-risk decisions

Mini Models

Mini models 更适合：

Simple classification
Short summarization
Data extraction
Routing decisions
Formatting
Lightweight Q&A
Cheap high-volume workloads

Comparison

Dimension	Large Model	Mini Model
Quality	Higher	Lower
Cost	Higher	Lower
Latency	Higher	Lower
Reasoning	Stronger	Weaker
Throughput	Lower	Higher
Best for	Complex tasks	Simple tasks

👉 面试回答

Large models 提供更强 reasoning 和 instruction following，但更慢、更贵。

Mini models 更便宜、更快，但在 complex 或 ambiguous tasks 上较弱。

Model routing 让系统把每种 model 用在最适合的位置。

4️⃣ High-Level Architecture

Architecture

User Request
→ API Gateway
→ Request Validator
→ Router / Classifier
→ Model Policy Engine
→ Model Selection
→ Inference Service
→ Output Validator
→ Fallback / Escalation
→ Final Response

Core Components

Router

分类 request type 和 complexity。

Policy Engine

应用 business rules、cost limits、 risk rules 和 model permissions。

Inference Service

运行 selected model。

Output Validator

检查 correctness、format、safety 和 confidence。

👉 面试回答

Model routing system 通常包括 router、 policy engine、inference service、 output validator 和 fallback logic。

Router 选择 initial model，系统在需要时可以 escalate 到更强 model。

5️⃣ Routing Signals

应该使用哪些 Signals？

Router 可以考虑：

Task type
Prompt length
User tier
Required latency
Required quality
Risk level
Tool usage
Output format
Domain complexity
Previous failure history

Example

Short JSON extraction
→ Mini model

Legal policy interpretation
→ Large model

Complex debugging question
→ Large model

👉 面试回答

Model routing 应使用 task type、 prompt length、risk level、 latency requirement、user tier、 output format 和 expected complexity 等 signals。

目标是选择能可靠完成任务的最便宜 model。

6️⃣ Rule-based Routing

什么是 Rule-based Routing？

Rule-based routing 使用 explicit rules。

Example Rules

If task = summarization and input < 2,000 tokens
→ Use mini model

If task = complex reasoning
→ Use large model

If risk = high
→ Use large model

If output must be strict JSON
→ Use mini model first, validate, then fallback

Advantages

Easy to understand
Easy to debug
Predictable
Good for early systems

Disadvantages

Rigid
Requires maintenance
May miss subtle complexity

👉 面试回答

Rule-based routing 简单且 explainable。

当 task types 很清楚时，它很有效。

但随着 use cases 增长，它可能变得 brittle，因为 request complexity 不总是能用 static rules 捕捉。

7️⃣ Classifier-based Routing

什么是 Classifier-based Routing？

轻量模型先对 request 分类。

User Request
→ Small classifier model
→ Task type + complexity + risk
→ Route to selected model

Classifier Output

{
  "task_type": "code_debugging",
  "complexity": "high",
  "risk": "medium",
  "recommended_model": "large"
}

为什么有用？

它比 static rules 能处理更灵活的 inputs。

👉 面试回答

Classifier-based routing 使用小模型或 classifier 预测 task type、complexity 和 risk。

它比 hard-coded rules 更灵活，同时 routing cost 较低。

8️⃣ Confidence-based Escalation

Key Pattern

先使用更便宜 model。

只有需要时再升级。

Flow

Mini model
→ Generate answer
→ Validate confidence / format / quality
→ If pass, return
→ If fail, escalate to large model

When to Escalate

Low confidence
Invalid JSON
Failed safety check
Contradictory answer
User asks for high accuracy
Task is more complex than expected

👉 面试回答

常见 production pattern 是 confidence-based escalation。

系统先尝试 smaller model， validate output，如果结果 low-confidence、invalid、 unsafe 或 incomplete，再升级到 larger model。

9️⃣ Fallback Strategy

为什么需要 Fallbacks？

Model calls 可能失败。

Failures include:

Timeout
Overload
Rate limit
Invalid output
Safety block
Low confidence
Model unavailable

Fallback Options

Large model unavailable
→ Use another large model

Mini model fails validation
→ Retry with large model

All models fail
→ Return graceful error

👉 面试回答

Model routing systems 需要 fallback strategies。

如果 model fails、times out 或产生 invalid output，系统可以 retry、route 到 another model、 escalate 到 stronger model，或返回 graceful failure。

🔟 Cost-aware Routing

为什么 Cost 重要？

Large models 很贵。

所有请求都用 large model 会浪费钱。

Cost-aware Policy

Low-value simple request
→ Mini model

High-value enterprise request
→ Large model

Batch job
→ Cheaper model or async processing

Cost Controls

Model budgets
User tier limits
Token limits
Fallback thresholds
Batch processing
Cache before model call

👉 面试回答

Cost-aware routing 根据 task value、user tier、 complexity 和 budget 选择 models。

系统应该避免对 simple tasks 使用 large models，如果 smaller models 已经能可靠解决。

1️⃣1️⃣ Latency-aware Routing

为什么 Latency 重要？

有些 use cases 需要 fast responses。

Examples:

Chat UI
Autocomplete
Real-time agent
Customer support bot

Latency Strategy

Low-latency request
→ Mini model

Complex background task
→ Large model

Timeout approaching
→ Fallback to faster model

Trade-off

Fast models 可能降低 quality。

👉 面试回答

Latency-aware routing 把 time-sensitive requests 发送给 faster models，同时允许 complex 或 asynchronous tasks 使用 slower large models。

这样可以平衡 user experience 和 output quality。

1️⃣2️⃣ Risk-aware Routing

为什么 Risk 重要？

High-risk tasks 需要更强 models 和更严格 controls。

High-risk Examples

Financial advice
Legal interpretation
Medical information
Production changes
Security decisions
Customer-facing actions

Risk-aware Flow

Risk classifier
→ If low risk: mini model
→ If high risk: large model + validation + human review

👉 面试回答

Risk-aware routing 把 high-impact 或 sensitive tasks 路由到 stronger models，通常还加 extra validation 或 human review。

Low-risk tasks 可以由 cheaper models 处理。

1️⃣3️⃣ Multi-stage Model Routing

Multi-stage Pattern

有些系统会按顺序使用多个 models。

Mini model → classify task
Mini model → draft response
Large model → verify or improve
Validator → final check

Example

Mini model extracts structured fields.
Large model handles ambiguous reasoning.
Mini model formats final JSON.

为什么有用？

每个 model 负责自己擅长的部分。

👉 面试回答

Multi-stage routing 会让不同 models 负责不同 stages。

小模型可以 classify、extract 或 format，大模型处理 reasoning 或 verification。

这可以降低 cost，同时保持 quality。

1️⃣4️⃣ Evaluation and Metrics

What to Measure

Accuracy
Latency
Cost per request
Escalation rate
Fallback rate
Invalid output rate
User satisfaction
Model-specific error rate
Quality difference by task type

Important Metric

Cost saved without quality loss

为什么 Evaluation 重要？

Bad routing 可能默默降低 quality。

👉 面试回答

Model routing 必须 continuous evaluate。

我会追踪 accuracy、latency、cost、 escalation rate、fallback rate、 invalid output rate 和 quality by task type。

目标是降低 cost，同时不损害 quality。

1️⃣5️⃣ Common Failure Modes

Failure Modes

Model routing 可能失败因为：

Router misclassifies task
Mini model used for hard problem
Large model overused
Escalation threshold too low
Fallback loops
Cost policy overrides quality
Risk classification is wrong
Evaluation data is weak

Example

Router labels complex legal question as simple Q&A
→ Mini model answers confidently
→ User gets incorrect answer

👉 面试回答

Model routing 最大风险是 under-routing：把 hard 或 risky task 发送给 weak model。

Production systems 需要 validation、 escalation、risk detection 和 continuous evaluation。

1️⃣6️⃣ Best Practices

Practical Rules

Start with rule-based routing
Add classifier-based routing over time
Use mini models for simple tasks
Use large models for complex or risky tasks
Validate mini model outputs
Escalate on low confidence
Avoid fallback loops
Track routing decisions
Evaluate quality by task type
Optimize for cost saved without quality loss

Design Principle

Use the cheapest model that can reliably solve the task.

👉 面试回答

最好的 model routing systems 使用能可靠解决任务的最便宜 model。

它们结合 rules、classifiers、validation、 fallback、risk detection 和 continuous evaluation。

🧠 Staff-Level Answer Final

👉 面试回答完整版本

Model routing 是为每个 request 选择合适 model 的过程。

主要目标是在 quality、latency 和 cost 之间取得平衡。

不是每个 request 都需要最大模型。

Simple classification、extraction、formatting 和 short summarization 通常可以由 mini models 处理。

Complex reasoning、long-context analysis、 ambiguous questions、code debugging 或 high-risk decisions 应该使用 larger models。

Production model routing system 通常包含 request router、task classifier、 policy engine、model selector、 inference service、output validator、 fallback logic 和 evaluation pipeline。

Routing decisions 可以是 rule-based、 classifier-based 或 multi-stage。

Rule-based routing predictable，且容易 debug。

Classifier-based routing 更灵活，因为它可以推断 task type、complexity 和 risk。

常见 pattern 是 confidence-based escalation：先尝试 cheaper mini model， validate output，如果结果 invalid、low-confidence、 unsafe 或 incomplete，再升级到 larger model。

系统也应该是 cost-aware、latency-aware 和 risk-aware。

对 low-latency workloads， route 到 faster models。

对 high-risk 或 correctness-critical tasks， route 到 stronger models，并添加 validation 或 human review。

最大 failure mode 是 under-routing：把 complex 或 risky task 发送给 weak model。

相反的 failure 是 over-routing：所有请求都发给最大模型，导致 cost 浪费。

Model routing 必须 continuous evaluate，使用 accuracy、latency、cost per request、 escalation rate、fallback rate、 invalid output rate 和 quality by task type。

核心原则是： use the cheapest model that can reliably solve the task。

⭐ Final Insight

Model Routing 的核心不是：

“大模型更聪明，所以全部用大模型”

真正的 production routing 是：

Task Classification

Complexity Detection

Risk Detection

Cost Policy

Latency Policy

Output Validation

Fallback / Escalation。

Mini model 负责 cheap and fast。

Large model 负责 hard and risky。

最重要的一句话：

Use the cheapest model that can reliably solve the task.