🎯 Building Multi-Agent Systems: Architecture Explained
1️⃣ Core Framework
When discussing Multi-Agent Systems, I frame it as:
- Single agent vs multi-agent
- Agent roles and responsibilities
- Coordinator / supervisor design
- Communication patterns
- Shared memory and state
- Tool access and permissions
- Reliability and failure handling
- Trade-offs: specialization vs complexity
2️⃣ What Is a Multi-Agent System?
A multi-agent system uses multiple AI agents that work together to complete a larger task.
Each agent usually has a specialized role.
User Goal
→ Coordinator Agent
→ Specialist Agents
→ Tools / Memory / APIs
→ Final Result
Example
Research Agent
→ Coding Agent
→ Testing Agent
→ Review Agent
→ Reporting Agent
👉 Interview Answer
A multi-agent system is an architecture where multiple specialized AI agents collaborate on a task.
Instead of one agent doing everything, different agents handle planning, research, execution, validation, and reporting.
This improves specialization, but adds coordination complexity.
3️⃣ Why Use Multiple Agents?
Single Agent Limitation
A single agent may struggle with:
- Large tasks
- Many tools
- Long context
- Complex planning
- Parallel work
- Specialized reasoning
Multi-Agent Benefits
- Separation of concerns
- Better specialization
- Parallel execution
- Easier modularity
- Better validation
- More scalable workflows
Example
Incident Investigation
→ Metrics Agent checks dashboards
→ Logs Agent searches logs
→ Deploy Agent checks recent releases
→ Summary Agent generates final report
👉 Interview Answer
Multi-agent systems are useful when a task is too broad for one agent.
By splitting responsibilities across specialized agents, the system can improve modularity, parallelism, and validation quality.
4️⃣ Core Architecture
High-Level Architecture
User
→ API Layer
→ Coordinator Agent
→ Task Router
→ Specialist Agents
→ Tools / APIs / Memory
→ Validator Agent
→ Final Response
Core Components
Coordinator Agent
Responsible for:
- Understanding the goal
- Breaking down the task
- Assigning work
- Tracking progress
- Combining results
Specialist Agents
Each agent handles one domain.
Examples:
- Research Agent
- Code Agent
- Data Agent
- Validation Agent
- Reporting Agent
Shared Memory
Stores:
- Task state
- Intermediate results
- Agent outputs
- Final artifacts
Tool Layer
Provides access to:
- Databases
- Search APIs
- GitHub
- Logs
- Metrics
- Documents
- Internal services
👉 Interview Answer
A typical multi-agent architecture has a coordinator agent, multiple specialist agents, shared memory, tool access, and validation.
The coordinator manages the workflow, while specialist agents perform focused tasks.
5️⃣ Coordinator Pattern
What Is Coordinator Pattern?
One main agent controls the workflow and delegates tasks.
Coordinator
→ Research Agent
→ Code Agent
→ Test Agent
→ Summary Agent
Advantages
- Centralized control
- Easier orchestration
- Clear ownership
- Better safety enforcement
Disadvantages
- Coordinator can become bottleneck
- Less parallelism if poorly designed
- Coordinator failure affects whole workflow
👉 Interview Answer
The coordinator pattern uses one central agent to manage task planning and delegation.
It is easier to control and debug, but the coordinator can become a bottleneck or single point of failure.
6️⃣ Peer-to-Peer Pattern
What Is Peer-to-Peer?
Agents communicate directly with each other.
Research Agent ↔ Code Agent ↔ Test Agent ↔ Review Agent
Advantages
- More flexible
- More decentralized
- Better for collaborative reasoning
Disadvantages
- Harder to debug
- Harder to control
- Risk of message loops
- More complex state management
👉 Interview Answer
In a peer-to-peer multi-agent system, agents communicate directly with each other.
This can improve flexibility, but it makes coordination, observability, and safety much harder.
7️⃣ Hierarchical Pattern
What Is Hierarchical Multi-Agent Design?
Agents are organized in layers.
Manager Agent
→ Team Lead Agents
→ Specialist Agents
→ Tools
Example
Manager Agent
→ Engineering Lead Agent
→ Code Agent
→ Test Agent
→ Research Lead Agent
→ Web Research Agent
→ Document Agent
Best For
- Large workflows
- Enterprise automation
- Complex task decomposition
- Multi-domain projects
👉 Interview Answer
A hierarchical multi-agent system organizes agents into layers.
Higher-level agents manage planning and delegation, while lower-level agents execute specialized tasks.
This pattern works well for large and complex workflows.
8️⃣ Communication Between Agents
Communication Methods
Agents can communicate through:
- Direct messages
- Shared memory
- Event queues
- Task queues
- State machines
- Workflow engines
Example with Shared Memory
Research Agent writes findings
→ Shared Memory
→ Summary Agent reads findings
→ Final report
Example with Queue
Coordinator creates task
→ Queue
→ Worker Agent picks task
→ Writes result
→ Coordinator reviews result
👉 Interview Answer
Agents need a communication mechanism.
For simple systems, shared memory may be enough.
For production systems, queues, workflow engines, and structured state machines are often better because they improve reliability and observability.
9️⃣ Memory and State Management
Why State Matters
Multi-agent systems need to track:
- Current goal
- Assigned tasks
- Completed tasks
- Intermediate outputs
- Agent decisions
- Tool results
- Errors and retries
State Store Options
| Option | Use Case |
|---|---|
| In-memory state | Simple demos |
| Database | Persistent workflows |
| Object storage | Large artifacts |
| Vector DB / RAG | Semantic memory |
| Queue / workflow engine | Async execution |
Important Rule
Do not rely only on LLM memory.
Use explicit state storage.
👉 Interview Answer
Multi-agent systems should not rely only on the model context window for memory.
Production systems need explicit state storage, such as databases, queues, object storage, or vector databases, so task progress can be tracked and recovered.
🔟 Tool Access and Permissions
Why Tool Control Matters
Different agents should not have the same permissions.
Example:
Research Agent → read-only search tools
Code Agent → repository access
Deploy Agent → deployment tools with approval
Permission Design
- Least privilege
- Tool allowlist
- Read/write separation
- Human approval for risky tools
- Audit logs for tool calls
👉 Interview Answer
In multi-agent systems, tool access should be scoped by agent role.
A research agent may only need read-only access, while a deployment agent needs stronger approval controls.
This reduces risk and improves security.
1️⃣1️⃣ Validation Agent
Why Use a Validation Agent?
One agent can check another agent’s work.
Code Agent writes code
→ Test Agent runs tests
→ Review Agent checks quality
→ Coordinator approves
What Validation Can Check
- Output format
- Factual correctness
- Policy compliance
- Test results
- Tool result consistency
- Hallucination risk
👉 Interview Answer
A validation agent improves reliability by reviewing outputs from other agents.
However, validation should not rely only on another LLM.
It should also use deterministic checks, schemas, tests, and business rules.
1️⃣2️⃣ Failure Handling
Common Failures
Multi-agent systems can fail because of:
- Agent loops
- Conflicting outputs
- Tool failures
- State inconsistency
- Bad delegation
- Context loss
- Cost explosion
Controls
- Max step limits
- Retry limits
- Timeouts
- Dead-letter queues
- Human escalation
- Deterministic validators
- Workflow checkpoints
👉 Interview Answer
Failure handling is critical in multi-agent systems.
The system should limit retries, prevent infinite loops, checkpoint progress, and escalate to humans when agents cannot resolve the task safely.
1️⃣3️⃣ Observability
What to Log
- Agent name
- Agent role
- Task assignment
- Prompt version
- Model version
- Tool calls
- Tool results
- State transitions
- Cost
- Latency
- Failures
- Final output
Agent Trace
User Goal
→ Coordinator creates plan
→ Research Agent gathers context
→ Data Agent queries metrics
→ Validation Agent checks result
→ Reporting Agent creates final answer
👉 Interview Answer
Observability is essential because multi-agent systems have dynamic execution paths.
I would log each agent step, tool call, state transition, cost, latency, and validation result.
Without tracing, debugging multi-agent systems becomes very difficult.
1️⃣4️⃣ When Not to Use Multi-Agent Systems
Avoid Multi-Agent If
- Task is simple
- Workflow is deterministic
- Latency must be very low
- Cost must be tightly controlled
- One agent is enough
- Rules can be implemented directly in code
Bad Example
User asks account balance
→ Multi-agent workflow
This is unnecessary.
A simple backend API is better.
👉 Interview Answer
I would not use multi-agent systems for simple or deterministic tasks.
Multi-agent architecture adds latency, cost, coordination complexity, and debugging difficulty.
It should only be used when specialization and orchestration provide clear value.
1️⃣5️⃣ Best Design Principle
Start Simple
Recommended evolution:
Single LLM call
→ Single agent with tools
→ Agentic workflow
→ Multi-agent system
Do Not Over-Engineer
Use multi-agent only when:
- Task needs specialization
- Parallel work matters
- Validation is valuable
- Context is too large for one agent
- Workflow is naturally multi-role
👉 Interview Answer
I would start with the simplest architecture that solves the problem.
If a single agent with tools works, I would not immediately build a multi-agent system.
Multi-agent design is useful, but it should be introduced only when the task complexity justifies it.
🧠 Staff-Level Answer Final
👉 Interview Answer Full Version
A multi-agent system is an architecture where multiple specialized AI agents collaborate to complete a larger task.
Instead of one agent doing everything, we split responsibilities across agents such as coordinator, researcher, coder, tester, validator, and reporter.
The main benefit is specialization.
Each agent can focus on a smaller responsibility, use a narrower set of tools, and produce more focused outputs.
This can improve modularity, parallelism, and validation quality.
A common architecture uses a coordinator agent. The coordinator understands the user goal, decomposes it into tasks, assigns work to specialist agents, tracks progress, and combines the final result.
Specialist agents perform focused work, such as searching documents, querying metrics, writing code, running tests, or validating outputs.
Production systems also need explicit memory and state management. We should not rely only on the LLM context window. Task state, intermediate outputs, tool results, retries, and final artifacts should be stored in databases, queues, object storage, or vector databases.
The hardest parts are coordination, reliability, safety, and observability.
Multi-agent systems can create loops, conflicting outputs, stale state, tool failures, and cost explosion.
So I would add max step limits, retry limits, tool permissions, checkpoints, deterministic validators, human escalation, and detailed agent traces.
I would not use multi-agent systems for simple deterministic workflows. I would start with a single LLM call, then a single tool-using agent, and only introduce multiple agents when specialization, parallelism, or validation clearly justifies the complexity.
⭐ Final Insight
Multi-Agent System 的核心不是“放很多 agents 一起聊天”。
真正的核心是:
Coordinator + Specialist Agents + Shared State + Tools + Validation + Guardrails。
它适合复杂、多步骤、多角色的任务。
但它也会带来更高的 coordination complexity、 latency、 cost 和 debugging 难度。
所以最好的原则是:
Start simple.
Add multiple agents only when the problem truly needs them.
中文部分
🎯 Building Multi-Agent Systems: Architecture Explained
1️⃣ 核心框架
讨论 Multi-Agent Systems 时,我通常从以下几个方面分析:
- Single agent vs multi-agent
- Agent roles and responsibilities
- Coordinator / supervisor design
- Communication patterns
- Shared memory and state
- Tool access and permissions
- Reliability and failure handling
- 核心权衡:specialization vs complexity
2️⃣ 什么是 Multi-Agent System?
Multi-agent system 是使用多个 AI agents 协作完成一个大任务的系统。
每个 agent 通常有专门的角色。
User Goal
→ Coordinator Agent
→ Specialist Agents
→ Tools / Memory / APIs
→ Final Result
Example
Research Agent
→ Coding Agent
→ Testing Agent
→ Review Agent
→ Reporting Agent
👉 面试回答
Multi-agent system 是一种多个 specialized AI agents 协作完成任务的架构。
不是让一个 agent 做所有事情, 而是让不同 agents 分别负责 planning、 research、execution、validation 和 reporting。
这样可以提升 specialization, 但也会增加 coordination complexity。
3️⃣ 为什么需要多个 Agents?
Single Agent 的限制
单个 agent 可能难以处理:
- Large tasks
- Many tools
- Long context
- Complex planning
- Parallel work
- Specialized reasoning
Multi-Agent 的好处
- Separation of concerns
- Better specialization
- Parallel execution
- Easier modularity
- Better validation
- More scalable workflows
Example
Incident Investigation
→ Metrics Agent checks dashboards
→ Logs Agent searches logs
→ Deploy Agent checks recent releases
→ Summary Agent generates final report
👉 面试回答
当一个任务对单个 agent 来说太大、太复杂时, multi-agent system 就有价值。
通过把职责拆分给 specialized agents, 系统可以提高 modularity、parallelism 和 validation quality。
4️⃣ 核心架构
High-Level Architecture
User
→ API Layer
→ Coordinator Agent
→ Task Router
→ Specialist Agents
→ Tools / APIs / Memory
→ Validator Agent
→ Final Response
核心组件
Coordinator Agent
负责:
- 理解目标
- 拆分任务
- 分配工作
- 跟踪进度
- 汇总结果
Specialist Agents
每个 agent 处理一个 domain。
Examples:
- Research Agent
- Code Agent
- Data Agent
- Validation Agent
- Reporting Agent
Shared Memory
存储:
- Task state
- Intermediate results
- Agent outputs
- Final artifacts
Tool Layer
提供访问:
- Databases
- Search APIs
- GitHub
- Logs
- Metrics
- Documents
- Internal services
👉 面试回答
典型 multi-agent architecture 包含 coordinator agent、 specialist agents、shared memory、 tool access 和 validation。
Coordinator 管理 workflow, specialist agents 负责执行具体任务。
5️⃣ Coordinator Pattern
什么是 Coordinator Pattern?
一个主 agent 控制 workflow 并分配任务。
Coordinator
→ Research Agent
→ Code Agent
→ Test Agent
→ Summary Agent
优点
- Centralized control
- Easier orchestration
- Clear ownership
- Better safety enforcement
缺点
- Coordinator 可能成为瓶颈
- 如果设计不好,并行度低
- Coordinator failure 会影响整个 workflow
👉 面试回答
Coordinator pattern 使用一个 central agent 来负责 planning 和 delegation。
它更容易控制和 debug, 但 coordinator 也可能成为 bottleneck 或 single point of failure。
6️⃣ Peer-to-Peer Pattern
什么是 Peer-to-Peer?
Agents 之间直接通信。
Research Agent ↔ Code Agent ↔ Test Agent ↔ Review Agent
优点
- More flexible
- More decentralized
- Better for collaborative reasoning
缺点
- 更难 debug
- 更难控制
- 容易出现 message loops
- State management 更复杂
👉 面试回答
在 peer-to-peer multi-agent system 中, agents 之间可以直接通信。
这种方式更灵活, 但 coordination、observability 和 safety 会更难。
7️⃣ Hierarchical Pattern
什么是 Hierarchical Multi-Agent Design?
Agents 按层级组织。
Manager Agent
→ Team Lead Agents
→ Specialist Agents
→ Tools
Example
Manager Agent
→ Engineering Lead Agent
→ Code Agent
→ Test Agent
→ Research Lead Agent
→ Web Research Agent
→ Document Agent
适合场景
- Large workflows
- Enterprise automation
- Complex task decomposition
- Multi-domain projects
👉 面试回答
Hierarchical multi-agent system 把 agents 组织成多层结构。
高层 agents 负责 planning 和 delegation, 底层 agents 负责 specialized execution。
这种模式适合大型复杂 workflow。
8️⃣ Agents 如何通信?
Communication Methods
Agents 可以通过这些方式通信:
- Direct messages
- Shared memory
- Event queues
- Task queues
- State machines
- Workflow engines
Shared Memory Example
Research Agent writes findings
→ Shared Memory
→ Summary Agent reads findings
→ Final report
Queue Example
Coordinator creates task
→ Queue
→ Worker Agent picks task
→ Writes result
→ Coordinator reviews result
👉 面试回答
Agents 之间需要 communication mechanism。
简单系统可以使用 shared memory。
Production 系统通常更适合使用 queues、 workflow engines 和 structured state machines, 因为这样 reliability 和 observability 更好。
9️⃣ Memory and State Management
为什么 State 很重要?
Multi-agent systems 需要追踪:
- Current goal
- Assigned tasks
- Completed tasks
- Intermediate outputs
- Agent decisions
- Tool results
- Errors and retries
State Store Options
| Option | Use Case |
|---|---|
| In-memory state | Simple demos |
| Database | Persistent workflows |
| Object storage | Large artifacts |
| Vector DB / RAG | Semantic memory |
| Queue / workflow engine | Async execution |
Important Rule
不要只依赖 LLM memory。
要使用显式 state storage。
👉 面试回答
Multi-agent systems 不应该只依赖 model context window 作为 memory。
Production systems 需要 explicit state storage, 比如 databases、queues、object storage 或 vector databases, 这样 task progress 才能被追踪和恢复。
🔟 Tool Access and Permissions
为什么 Tool Control 很重要?
不同 agents 不应该拥有相同权限。
Example:
Research Agent → read-only search tools
Code Agent → repository access
Deploy Agent → deployment tools with approval
Permission Design
- Least privilege
- Tool allowlist
- Read/write separation
- Human approval for risky tools
- Audit logs for tool calls
👉 面试回答
在 multi-agent systems 中, tool access 应该根据 agent role 做权限隔离。
Research agent 可能只需要 read-only access, deployment agent 则需要更严格的 approval controls。
这样可以降低风险并提升安全性。
1️⃣1️⃣ Validation Agent
为什么需要 Validation Agent?
一个 agent 可以检查另一个 agent 的工作。
Code Agent writes code
→ Test Agent runs tests
→ Review Agent checks quality
→ Coordinator approves
Validation 可以检查
- Output format
- Factual correctness
- Policy compliance
- Test results
- Tool result consistency
- Hallucination risk
👉 面试回答
Validation agent 可以通过 review 其他 agents 的输出提高可靠性。
但 validation 不应该只依赖另一个 LLM。
它还应该结合 deterministic checks、 schemas、tests 和 business rules。
1️⃣2️⃣ Failure Handling
常见失败原因
Multi-agent systems 可能因为这些原因失败:
- Agent loops
- Conflicting outputs
- Tool failures
- State inconsistency
- Bad delegation
- Context loss
- Cost explosion
Controls
- Max step limits
- Retry limits
- Timeouts
- Dead-letter queues
- Human escalation
- Deterministic validators
- Workflow checkpoints
👉 面试回答
Failure handling 对 multi-agent systems 非常关键。
系统应该限制 retries, 防止 infinite loops, checkpoint progress, 并在 agents 无法安全解决任务时升级给 humans。
1️⃣3️⃣ Observability
需要记录什么?
- Agent name
- Agent role
- Task assignment
- Prompt version
- Model version
- Tool calls
- Tool results
- State transitions
- Cost
- Latency
- Failures
- Final output
Agent Trace
User Goal
→ Coordinator creates plan
→ Research Agent gathers context
→ Data Agent queries metrics
→ Validation Agent checks result
→ Reporting Agent creates final answer
👉 面试回答
Observability 对 multi-agent systems 非常重要, 因为 execution path 是动态的。
我会记录每个 agent step、 tool call、state transition、 cost、latency 和 validation result。
没有 tracing, multi-agent systems 会非常难 debug。
1️⃣4️⃣ 什么时候不要用 Multi-Agent?
Avoid Multi-Agent If
- Task is simple
- Workflow is deterministic
- Latency must be very low
- Cost must be tightly controlled
- One agent is enough
- Rules can be implemented directly in code
Bad Example
User asks account balance
→ Multi-agent workflow
这完全没必要。
一个简单 backend API 更合适。
👉 面试回答
我不会把 multi-agent systems 用在简单或 deterministic tasks 上。
Multi-agent architecture 会增加 latency、 cost、coordination complexity 和 debugging difficulty。
只有当 specialization 和 orchestration 有清晰价值时, 才值得使用 multi-agent。
1️⃣5️⃣ Best Design Principle
Start Simple
推荐演进路径:
Single LLM call
→ Single agent with tools
→ Agentic workflow
→ Multi-agent system
Do Not Over-Engineer
只有在这些情况下才使用 multi-agent:
- Task needs specialization
- Parallel work matters
- Validation is valuable
- Context is too large for one agent
- Workflow is naturally multi-role
👉 面试回答
我会从能解决问题的最简单架构开始。
如果 single agent with tools 就可以解决, 我不会一开始就设计 multi-agent system。
Multi-agent design 有价值, 但只有当 task complexity justify it 时才应该引入。
🧠 Staff-Level Answer Final
👉 面试回答完整版本
Multi-agent system 是一种多个 specialized AI agents 协作完成复杂任务的架构。
它不是让一个 agent 做所有事情, 而是把职责拆分给不同 agents, 比如 coordinator、researcher、coder、 tester、validator 和 reporter。
最大好处是 specialization。
每个 agent 可以专注于更小的职责, 使用更窄的工具集合, 产生更聚焦的输出。
这可以提升 modularity、parallelism 和 validation quality。
常见架构是 coordinator pattern。 Coordinator agent 理解 user goal, 把任务拆成多个 steps, 分配给 specialist agents, 跟踪进度, 最后合并结果。
Specialist agents 负责具体工作, 比如搜索文档、查询 metrics、 写代码、跑测试或验证输出。
Production 系统还需要显式 memory 和 state management。 不能只依赖 LLM context window。 Task state、intermediate outputs、 tool results、retries 和 final artifacts 应该存储在 databases、queues、 object storage 或 vector databases 中。
最难的部分是 coordination、reliability、 safety 和 observability。
Multi-agent systems 可能出现 loops、 conflicting outputs、stale state、 tool failures 和 cost explosion。
所以我会加入 max step limits、 retry limits、tool permissions、 checkpoints、deterministic validators、 human escalation 和 detailed agent traces。
我不会把 multi-agent systems 用在简单 deterministic workflows 上。 我会先从 single LLM call 开始, 再到 single tool-using agent, 只有当 specialization、parallelism 或 validation 明显需要时, 才引入 multiple agents。
⭐ Final Insight
Multi-Agent System 的核心不是“让很多 agents 一起聊天”。
真正的核心是:
Coordinator + Specialist Agents + Shared State + Tools + Validation + Guardrails。
它适合复杂、多步骤、多角色的任务。
但它也会带来更高的 coordination complexity、 latency、cost 和 debugging 难度。
所以最好的原则是:
Start simple.
Add multiple agents only when the problem truly needs them.
📌 Staff Memorization Pack
30-Second Answer
A multi-agent system splits complex work across specialized agents, usually coordinated by a supervisor that assigns tasks, controls permissions, merges results, and handles failures.
In production, I would design it with explicit boundaries around planning, execution, validation, permissions, state, observability, and fallback behavior.
2-Minute Staff Answer
For Building Multi-Agent Systems Architecture Explained, I would start by separating the model’s reasoning role from the system’s execution guarantees.
The LLM can interpret ambiguous intent, produce plans, choose tools, summarize context, and adapt to observations. But the surrounding platform must enforce deterministic controls: schemas, permissions, timeouts, retries, idempotency, audit logging, and policy checks.
My design would include a clear orchestration layer, bounded tool access, managed state, validation after important steps, and human approval for high-risk actions. I would also add tracing for every model call, tool call, decision point, and failure so the system can be debugged and improved.
The staff-level trade-off is autonomy versus control. More autonomy improves flexibility, but it increases cost, latency, unpredictability, and safety risk. A production design should give the agent enough freedom to solve ambiguous tasks while keeping irreversible or correctness-critical actions inside deterministic backend systems.
Architecture Points to Memorize
- Supervisor receives the goal and decomposes work
- Specialized agents own narrow responsibilities such as research, coding, analysis, or validation
- Shared task state tracks progress and dependencies
- Message bus or orchestrator controls communication
- Tool permissions are scoped per agent role
- Validator or critic agent reviews outputs
- Final synthesizer merges partial results into one answer
- Trace system records every handoff and decision
Failure Modes to Call Out
- too much inter-agent chatter
- unclear ownership
- conflicting outputs
- shared memory corruption
- duplicated work
- higher latency and token cost
- harder debugging
- coordination failure
Guardrails and Controls
A strong production answer should mention:
- tool allowlists and per-tool permissions
- input and output schema validation
- max step limits and cost budgets
- timeout and retry policy
- idempotency keys for side-effecting actions
- human approval for high-risk operations
- prompt, model, and tool version tracking
- agent trace logging
- evaluation datasets and regression tests
- fallback to deterministic backend or manual review
Common Follow-up Questions
How do you make it reliable?
I would constrain the action space, validate every tool call, make side effects idempotent, add step limits, log full traces, and convert production failures into eval cases. Reliability comes from the system around the model, not from trusting the model blindly.
How do you control cost and latency?
I would use smaller models for simple steps, cache stable context, limit retrieval size, set max iterations, parallelize safe independent work, and stop early when confidence is high enough. I would track cost per task, tokens per step, tool latency, and timeout rate.
How do you handle unsafe actions?
I would classify actions by risk. Read-only actions can be more automated, but writes, money movement, permission changes, deletion, external communication, and compliance-sensitive actions should require deterministic validation or human approval.
How do you debug failures?
I would inspect the agent trace: user goal, prompt version, retrieved context, plan, tool calls, observations, validation results, and final output. Without step-level traces, agent failures are almost impossible to debug at production quality.
中文背诵版
Building Multi-Agent Systems Architecture Explained 的 Staff 级回答,核心不是说模型有多聪明,而是说怎么把 agent 做成可控的生产系统。
LLM 负责理解目标、拆解任务、选择工具、总结上下文和根据观察调整计划。 但是 deterministic backend 必须负责权限、schema 校验、业务规则、幂等、事务、审计和合规。
我会把系统拆成 orchestrator、planner、tool router、execution layer、memory/state store、validator、guardrails、observability 和 fallback path。 每一步都要有 trace,每个 tool call 都要有权限和参数校验,高风险动作要有人审或 deterministic validation。
Staff 级 trade-off 是 autonomy versus control。 Autonomy 越高,系统越灵活,但 latency、cost、debug 难度和 safety risk 也越高。 所以生产设计要限制 agent 的 action space,把不可逆和 correctness-critical 的动作留给传统后端执行。
Staff-Level Final Sentence
At staff level, I would not create multiple agents unless specialization reduces complexity. Multi-agent design is useful when roles, permissions, and validation boundaries are clear; otherwise a single well-instrumented agent is usually simpler.
Implement