🎯 Model Routing Systems: GPT-4 vs Mini Models
1️⃣ Core Framework
When discussing Model Routing Systems, I frame it as:
- Why model routing is needed
- Large models vs mini models
- Routing signals
- Cost and latency trade-offs
- Quality and fallback strategy
- Confidence-based escalation
- Multi-model orchestration
- Trade-offs: quality vs cost vs speed
2️⃣ Why Model Routing Is Needed
Not every request needs the largest model.
Some tasks are simple.
Some tasks are complex.
A good system routes each request to the right model.
Basic Idea
User Request
→ Classify task complexity
→ Choose model
→ Run inference
→ Validate output
→ Escalate if needed
👉 Interview Answer
Model routing is the process of selecting the right model for each request.
The goal is to balance quality, latency, and cost.
Simple tasks can use smaller models, while complex reasoning, high-risk, or ambiguous tasks should use larger models.
3️⃣ Large Models vs Mini Models
Large Models
Large models are stronger at:
- Complex reasoning
- Ambiguous tasks
- Long context
- Multi-step planning
- Code generation
- Hard instruction following
- High-risk decisions
Mini Models
Mini models are better for:
- Simple classification
- Short summarization
- Data extraction
- Routing decisions
- Formatting
- Lightweight Q&A
- Cheap high-volume workloads
Comparison
| Dimension | Large Model | Mini Model |
|---|---|---|
| Quality | Higher | Lower |
| Cost | Higher | Lower |
| Latency | Higher | Lower |
| Reasoning | Stronger | Weaker |
| Throughput | Lower | Higher |
| Best for | Complex tasks | Simple tasks |
👉 Interview Answer
Large models provide better reasoning and instruction following, but they are slower and more expensive.
Mini models are cheaper and faster, but weaker on complex or ambiguous tasks.
Model routing lets the system use each model where it fits best.
4️⃣ High-Level Architecture
Architecture
User Request
→ API Gateway
→ Request Validator
→ Router / Classifier
→ Model Policy Engine
→ Model Selection
→ Inference Service
→ Output Validator
→ Fallback / Escalation
→ Final Response
Core Components
Router
Classifies request type and complexity.
Policy Engine
Applies business rules, cost limits, risk rules, and model permissions.
Inference Service
Runs selected model.
Output Validator
Checks correctness, format, safety, and confidence.
👉 Interview Answer
A model routing system usually includes a router, policy engine, inference service, output validator, and fallback logic.
The router chooses an initial model, and the system can escalate to a stronger model when needed.
5️⃣ Routing Signals
What Signals Should Be Used?
A router may consider:
- Task type
- Prompt length
- User tier
- Required latency
- Required quality
- Risk level
- Tool usage
- Output format
- Domain complexity
- Previous failure history
Example
Short JSON extraction
→ Mini model
Legal policy interpretation
→ Large model
Complex debugging question
→ Large model
👉 Interview Answer
Model routing should use signals such as task type, prompt length, risk level, latency requirement, user tier, output format, and expected complexity.
The goal is to choose the cheapest model that can reliably solve the task.
6️⃣ Rule-based Routing
What Is Rule-based Routing?
Rule-based routing uses explicit rules.
Example Rules
If task = summarization and input < 2,000 tokens
→ Use mini model
If task = complex reasoning
→ Use large model
If risk = high
→ Use large model
If output must be strict JSON
→ Use mini model first, validate, then fallback
Advantages
- Easy to understand
- Easy to debug
- Predictable
- Good for early systems
Disadvantages
- Rigid
- Requires maintenance
- May miss subtle complexity
👉 Interview Answer
Rule-based routing is simple and explainable.
It works well when task types are clear.
But it can become brittle as use cases grow, because request complexity is not always easy to capture with static rules.
7️⃣ Classifier-based Routing
What Is Classifier-based Routing?
A lightweight model classifies the request first.
User Request
→ Small classifier model
→ Task type + complexity + risk
→ Route to selected model
Classifier Output
{
"task_type": "code_debugging",
"complexity": "high",
"risk": "medium",
"recommended_model": "large"
}
Why Useful
It handles more flexible inputs than static rules.
👉 Interview Answer
Classifier-based routing uses a small model or classifier to predict task type, complexity, and risk.
This allows more flexible routing than hard-coded rules, while keeping routing cost low.
8️⃣ Confidence-based Escalation
Key Pattern
Start with a cheaper model.
Escalate only when needed.
Flow
Mini model
→ Generate answer
→ Validate confidence / format / quality
→ If pass, return
→ If fail, escalate to large model
When to Escalate
- Low confidence
- Invalid JSON
- Failed safety check
- Contradictory answer
- User asks for high accuracy
- Task is more complex than expected
👉 Interview Answer
A common production pattern is confidence-based escalation.
The system tries a smaller model first, validates the output, and escalates to a larger model only if the result is low-confidence, invalid, unsafe, or incomplete.
9️⃣ Fallback Strategy
Why Fallbacks Matter
Model calls can fail.
Failures include:
- Timeout
- Overload
- Rate limit
- Invalid output
- Safety block
- Low confidence
- Model unavailable
Fallback Options
Large model unavailable
→ Use another large model
Mini model fails validation
→ Retry with large model
All models fail
→ Return graceful error
👉 Interview Answer
Model routing systems need fallback strategies.
If a model fails, times out, or produces invalid output, the system can retry, route to another model, escalate to a stronger model, or return a graceful failure.
🔟 Cost-aware Routing
Why Cost Matters
Large models are expensive.
Using them for every request wastes money.
Cost-aware Policy
Low-value simple request
→ Mini model
High-value enterprise request
→ Large model
Batch job
→ Cheaper model or async processing
Cost Controls
- Model budgets
- User tier limits
- Token limits
- Fallback thresholds
- Batch processing
- Cache before model call
👉 Interview Answer
Cost-aware routing chooses models based on task value, user tier, complexity, and budget.
The system should avoid using large models for simple tasks that smaller models can solve reliably.
1️⃣1️⃣ Latency-aware Routing
Why Latency Matters
Some use cases need fast responses.
Examples:
- Chat UI
- Autocomplete
- Real-time agent
- Customer support bot
Latency Strategy
Low-latency request
→ Mini model
Complex background task
→ Large model
Timeout approaching
→ Fallback to faster model
Trade-off
Fast models may reduce quality.
👉 Interview Answer
Latency-aware routing sends time-sensitive requests to faster models, while allowing slower large models for complex or asynchronous tasks.
This helps balance user experience against output quality.
1️⃣2️⃣ Risk-aware Routing
Why Risk Matters
High-risk tasks need stronger models and stricter controls.
High-risk Examples
- Financial advice
- Legal interpretation
- Medical information
- Production changes
- Security decisions
- Customer-facing actions
Risk-aware Flow
Risk classifier
→ If low risk: mini model
→ If high risk: large model + validation + human review
👉 Interview Answer
Risk-aware routing routes high-impact or sensitive tasks to stronger models, often with extra validation or human review.
Low-risk tasks can be handled by cheaper models.
1️⃣3️⃣ Multi-stage Model Routing
Multi-stage Pattern
Some systems use multiple models in sequence.
Mini model → classify task
Mini model → draft response
Large model → verify or improve
Validator → final check
Example
Mini model extracts structured fields.
Large model handles ambiguous reasoning.
Mini model formats final JSON.
Why Useful
Each model does what it is good at.
👉 Interview Answer
Multi-stage routing uses different models for different stages.
A small model may classify, extract, or format, while a larger model handles reasoning or verification.
This can reduce cost while preserving quality.
1️⃣4️⃣ Evaluation and Metrics
What to Measure
- Accuracy
- Latency
- Cost per request
- Escalation rate
- Fallback rate
- Invalid output rate
- User satisfaction
- Model-specific error rate
- Quality difference by task type
Important Metric
Cost saved without quality loss
Why Evaluation Matters
Bad routing can silently reduce quality.
👉 Interview Answer
Model routing must be evaluated continuously.
I would track accuracy, latency, cost, escalation rate, fallback rate, invalid output rate, and quality by task type.
The goal is to reduce cost without hurting quality.
1️⃣5️⃣ Common Failure Modes
Failure Modes
Model routing can fail because:
- Router misclassifies task
- Mini model used for hard problem
- Large model overused
- Escalation threshold too low
- Fallback loops
- Cost policy overrides quality
- Risk classification is wrong
- Evaluation data is weak
Example
Router labels complex legal question as simple Q&A
→ Mini model answers confidently
→ User gets incorrect answer
👉 Interview Answer
The biggest risk in model routing is under-routing: sending a hard or risky task to a weak model.
Production systems need validation, escalation, risk detection, and continuous evaluation.
1️⃣6️⃣ Best Practices
Practical Rules
- Start with rule-based routing
- Add classifier-based routing over time
- Use mini models for simple tasks
- Use large models for complex or risky tasks
- Validate mini model outputs
- Escalate on low confidence
- Avoid fallback loops
- Track routing decisions
- Evaluate quality by task type
- Optimize for cost saved without quality loss
Design Principle
Use the cheapest model that can reliably solve the task.
👉 Interview Answer
The best model routing systems use the cheapest model that can reliably solve the task.
They combine rules, classifiers, validation, fallback, risk detection, and continuous evaluation.
🧠 Staff-Level Answer Final
👉 Interview Answer Full Version
Model routing is the process of choosing the right model for each request.
The main goal is to balance quality, latency, and cost.
Not every request needs the largest model.
Simple classification, extraction, formatting, and short summarization can often be handled by mini models.
Complex reasoning, long-context analysis, ambiguous questions, code debugging, or high-risk decisions should use larger models.
A production model routing system usually includes a request router, task classifier, policy engine, model selector, inference service, output validator, fallback logic, and evaluation pipeline.
Routing decisions can be rule-based, classifier-based, or multi-stage.
Rule-based routing is predictable and easy to debug.
Classifier-based routing is more flexible because it can infer task type, complexity, and risk.
A common pattern is confidence-based escalation: try a cheaper mini model first, validate the output, and escalate to a larger model if the result is invalid, low-confidence, unsafe, or incomplete.
The system should also be cost-aware, latency-aware, and risk-aware.
For low-latency workloads, route to faster models.
For high-risk or correctness-critical tasks, route to stronger models and add validation or human review.
The biggest failure mode is under-routing: sending a complex or risky task to a weak model.
The opposite failure is over-routing: sending everything to the largest model and wasting cost.
Model routing must be evaluated continuously using accuracy, latency, cost per request, escalation rate, fallback rate, invalid output rate, and quality by task type.
The core principle is: use the cheapest model that can reliably solve the task.
⭐ Final Insight
Model Routing 的核心不是:
“大模型更聪明,所以全部用大模型”
真正的 production routing 是:
Task Classification
- Complexity Detection
- Risk Detection
- Cost Policy
- Latency Policy
- Output Validation
- Fallback / Escalation。
Mini model 负责 cheap and fast。
Large model 负责 hard and risky。
最重要的一句话:
Use the cheapest model that can reliably solve the task.
中文部分
🎯 Model Routing Systems: GPT-4 vs Mini Models
1️⃣ 核心框架
讨论 Model Routing Systems 时,我通常从这些方面分析:
- 为什么需要 model routing
- Large models vs mini models
- Routing signals
- Cost and latency trade-offs
- Quality and fallback strategy
- Confidence-based escalation
- Multi-model orchestration
- 核心权衡:quality vs cost vs speed
2️⃣ 为什么需要 Model Routing?
不是每个 request 都需要最大模型。
有些 tasks 很简单。
有些 tasks 很复杂。
好的系统会把每个 request 路由到合适的 model。
Basic Idea
User Request
→ Classify task complexity
→ Choose model
→ Run inference
→ Validate output
→ Escalate if needed
👉 面试回答
Model routing 是为每个 request 选择合适 model 的过程。
目标是在 quality、latency 和 cost 之间取得平衡。
简单 tasks 可以使用 smaller models, 复杂 reasoning、高风险 或 ambiguous tasks 应使用 larger models。
3️⃣ Large Models vs Mini Models
Large Models
Large models 更擅长:
- Complex reasoning
- Ambiguous tasks
- Long context
- Multi-step planning
- Code generation
- Hard instruction following
- High-risk decisions
Mini Models
Mini models 更适合:
- Simple classification
- Short summarization
- Data extraction
- Routing decisions
- Formatting
- Lightweight Q&A
- Cheap high-volume workloads
Comparison
| Dimension | Large Model | Mini Model |
|---|---|---|
| Quality | Higher | Lower |
| Cost | Higher | Lower |
| Latency | Higher | Lower |
| Reasoning | Stronger | Weaker |
| Throughput | Lower | Higher |
| Best for | Complex tasks | Simple tasks |
👉 面试回答
Large models 提供更强 reasoning 和 instruction following, 但更慢、更贵。
Mini models 更便宜、更快, 但在 complex 或 ambiguous tasks 上较弱。
Model routing 让系统把每种 model 用在最适合的位置。
4️⃣ High-Level Architecture
Architecture
User Request
→ API Gateway
→ Request Validator
→ Router / Classifier
→ Model Policy Engine
→ Model Selection
→ Inference Service
→ Output Validator
→ Fallback / Escalation
→ Final Response
Core Components
Router
分类 request type 和 complexity。
Policy Engine
应用 business rules、cost limits、 risk rules 和 model permissions。
Inference Service
运行 selected model。
Output Validator
检查 correctness、format、safety 和 confidence。
👉 面试回答
Model routing system 通常包括 router、 policy engine、inference service、 output validator 和 fallback logic。
Router 选择 initial model, 系统在需要时可以 escalate 到更强 model。
5️⃣ Routing Signals
应该使用哪些 Signals?
Router 可以考虑:
- Task type
- Prompt length
- User tier
- Required latency
- Required quality
- Risk level
- Tool usage
- Output format
- Domain complexity
- Previous failure history
Example
Short JSON extraction
→ Mini model
Legal policy interpretation
→ Large model
Complex debugging question
→ Large model
👉 面试回答
Model routing 应使用 task type、 prompt length、risk level、 latency requirement、user tier、 output format 和 expected complexity 等 signals。
目标是选择能可靠完成任务的最便宜 model。
6️⃣ Rule-based Routing
什么是 Rule-based Routing?
Rule-based routing 使用 explicit rules。
Example Rules
If task = summarization and input < 2,000 tokens
→ Use mini model
If task = complex reasoning
→ Use large model
If risk = high
→ Use large model
If output must be strict JSON
→ Use mini model first, validate, then fallback
Advantages
- Easy to understand
- Easy to debug
- Predictable
- Good for early systems
Disadvantages
- Rigid
- Requires maintenance
- May miss subtle complexity
👉 面试回答
Rule-based routing 简单且 explainable。
当 task types 很清楚时, 它很有效。
但随着 use cases 增长, 它可能变得 brittle, 因为 request complexity 不总是能用 static rules 捕捉。
7️⃣ Classifier-based Routing
什么是 Classifier-based Routing?
轻量模型先对 request 分类。
User Request
→ Small classifier model
→ Task type + complexity + risk
→ Route to selected model
Classifier Output
{
"task_type": "code_debugging",
"complexity": "high",
"risk": "medium",
"recommended_model": "large"
}
为什么有用?
它比 static rules 能处理更灵活的 inputs。
👉 面试回答
Classifier-based routing 使用小模型或 classifier 预测 task type、complexity 和 risk。
它比 hard-coded rules 更灵活, 同时 routing cost 较低。
8️⃣ Confidence-based Escalation
Key Pattern
先使用更便宜 model。
只有需要时再升级。
Flow
Mini model
→ Generate answer
→ Validate confidence / format / quality
→ If pass, return
→ If fail, escalate to large model
When to Escalate
- Low confidence
- Invalid JSON
- Failed safety check
- Contradictory answer
- User asks for high accuracy
- Task is more complex than expected
👉 面试回答
常见 production pattern 是 confidence-based escalation。
系统先尝试 smaller model, validate output, 如果结果 low-confidence、invalid、 unsafe 或 incomplete, 再升级到 larger model。
9️⃣ Fallback Strategy
为什么需要 Fallbacks?
Model calls 可能失败。
Failures include:
- Timeout
- Overload
- Rate limit
- Invalid output
- Safety block
- Low confidence
- Model unavailable
Fallback Options
Large model unavailable
→ Use another large model
Mini model fails validation
→ Retry with large model
All models fail
→ Return graceful error
👉 面试回答
Model routing systems 需要 fallback strategies。
如果 model fails、times out 或产生 invalid output, 系统可以 retry、route 到 another model、 escalate 到 stronger model, 或返回 graceful failure。
🔟 Cost-aware Routing
为什么 Cost 重要?
Large models 很贵。
所有请求都用 large model 会浪费钱。
Cost-aware Policy
Low-value simple request
→ Mini model
High-value enterprise request
→ Large model
Batch job
→ Cheaper model or async processing
Cost Controls
- Model budgets
- User tier limits
- Token limits
- Fallback thresholds
- Batch processing
- Cache before model call
👉 面试回答
Cost-aware routing 根据 task value、user tier、 complexity 和 budget 选择 models。
系统应该避免对 simple tasks 使用 large models, 如果 smaller models 已经能可靠解决。
1️⃣1️⃣ Latency-aware Routing
为什么 Latency 重要?
有些 use cases 需要 fast responses。
Examples:
- Chat UI
- Autocomplete
- Real-time agent
- Customer support bot
Latency Strategy
Low-latency request
→ Mini model
Complex background task
→ Large model
Timeout approaching
→ Fallback to faster model
Trade-off
Fast models 可能降低 quality。
👉 面试回答
Latency-aware routing 把 time-sensitive requests 发送给 faster models, 同时允许 complex 或 asynchronous tasks 使用 slower large models。
这样可以平衡 user experience 和 output quality。
1️⃣2️⃣ Risk-aware Routing
为什么 Risk 重要?
High-risk tasks 需要更强 models 和更严格 controls。
High-risk Examples
- Financial advice
- Legal interpretation
- Medical information
- Production changes
- Security decisions
- Customer-facing actions
Risk-aware Flow
Risk classifier
→ If low risk: mini model
→ If high risk: large model + validation + human review
👉 面试回答
Risk-aware routing 把 high-impact 或 sensitive tasks 路由到 stronger models, 通常还加 extra validation 或 human review。
Low-risk tasks 可以由 cheaper models 处理。
1️⃣3️⃣ Multi-stage Model Routing
Multi-stage Pattern
有些系统会按顺序使用多个 models。
Mini model → classify task
Mini model → draft response
Large model → verify or improve
Validator → final check
Example
Mini model extracts structured fields.
Large model handles ambiguous reasoning.
Mini model formats final JSON.
为什么有用?
每个 model 负责自己擅长的部分。
👉 面试回答
Multi-stage routing 会让不同 models 负责不同 stages。
小模型可以 classify、extract 或 format, 大模型处理 reasoning 或 verification。
这可以降低 cost, 同时保持 quality。
1️⃣4️⃣ Evaluation and Metrics
What to Measure
- Accuracy
- Latency
- Cost per request
- Escalation rate
- Fallback rate
- Invalid output rate
- User satisfaction
- Model-specific error rate
- Quality difference by task type
Important Metric
Cost saved without quality loss
为什么 Evaluation 重要?
Bad routing 可能默默降低 quality。
👉 面试回答
Model routing 必须 continuous evaluate。
我会追踪 accuracy、latency、cost、 escalation rate、fallback rate、 invalid output rate 和 quality by task type。
目标是降低 cost, 同时不损害 quality。
1️⃣5️⃣ Common Failure Modes
Failure Modes
Model routing 可能失败因为:
- Router misclassifies task
- Mini model used for hard problem
- Large model overused
- Escalation threshold too low
- Fallback loops
- Cost policy overrides quality
- Risk classification is wrong
- Evaluation data is weak
Example
Router labels complex legal question as simple Q&A
→ Mini model answers confidently
→ User gets incorrect answer
👉 面试回答
Model routing 最大风险是 under-routing: 把 hard 或 risky task 发送给 weak model。
Production systems 需要 validation、 escalation、risk detection 和 continuous evaluation。
1️⃣6️⃣ Best Practices
Practical Rules
- Start with rule-based routing
- Add classifier-based routing over time
- Use mini models for simple tasks
- Use large models for complex or risky tasks
- Validate mini model outputs
- Escalate on low confidence
- Avoid fallback loops
- Track routing decisions
- Evaluate quality by task type
- Optimize for cost saved without quality loss
Design Principle
Use the cheapest model that can reliably solve the task.
👉 面试回答
最好的 model routing systems 使用能可靠解决任务的最便宜 model。
它们结合 rules、classifiers、validation、 fallback、risk detection 和 continuous evaluation。
🧠 Staff-Level Answer Final
👉 面试回答完整版本
Model routing 是为每个 request 选择合适 model 的过程。
主要目标是在 quality、latency 和 cost 之间取得平衡。
不是每个 request 都需要最大模型。
Simple classification、extraction、formatting 和 short summarization 通常可以由 mini models 处理。
Complex reasoning、long-context analysis、 ambiguous questions、code debugging 或 high-risk decisions 应该使用 larger models。
Production model routing system 通常包含 request router、task classifier、 policy engine、model selector、 inference service、output validator、 fallback logic 和 evaluation pipeline。
Routing decisions 可以是 rule-based、 classifier-based 或 multi-stage。
Rule-based routing predictable, 且容易 debug。
Classifier-based routing 更灵活, 因为它可以推断 task type、complexity 和 risk。
常见 pattern 是 confidence-based escalation: 先尝试 cheaper mini model, validate output, 如果结果 invalid、low-confidence、 unsafe 或 incomplete, 再升级到 larger model。
系统也应该是 cost-aware、latency-aware 和 risk-aware。
对 low-latency workloads, route 到 faster models。
对 high-risk 或 correctness-critical tasks, route 到 stronger models, 并添加 validation 或 human review。
最大 failure mode 是 under-routing: 把 complex 或 risky task 发送给 weak model。
相反的 failure 是 over-routing: 所有请求都发给最大模型, 导致 cost 浪费。
Model routing 必须 continuous evaluate, 使用 accuracy、latency、cost per request、 escalation rate、fallback rate、 invalid output rate 和 quality by task type。
核心原则是: use the cheapest model that can reliably solve the task。
⭐ Final Insight
Model Routing 的核心不是:
“大模型更聪明,所以全部用大模型”
真正的 production routing 是:
Task Classification
- Complexity Detection
- Risk Detection
- Cost Policy
- Latency Policy
- Output Validation
- Fallback / Escalation。
Mini model 负责 cheap and fast。
Large model 负责 hard and risky。
最重要的一句话:
Use the cheapest model that can reliably solve the task.
Implement