🎯 Design an AI Chat Assistant System like ChatGPT
1️⃣ Core Framework
When designing an AI Chat Assistant System, I frame it as:
- Product requirements
- Chat session and message storage
- Prompt building and context management
- Model routing and inference
- Tool calling and retrieval
- Streaming response
- Safety and moderation
- Trade-offs: quality vs latency vs cost
2️⃣ Product Goal
An AI chat assistant allows users to ask questions, continue conversations, upload context, use tools, and receive generated answers.
Basic Flow
User Message
→ Chat Backend
→ Context Builder
→ LLM
→ Streaming Response
→ Store Conversation
👉 Interview Answer
An AI chat assistant is a conversational system built around an LLM.
It receives user messages, retrieves conversation context, builds a prompt, calls the model, streams the response, stores the conversation, and applies safety, permissions, logging, and cost controls.
3️⃣ Functional Requirements
Core Features
The system should support:
- User login
- Create chat session
- Send message
- Stream assistant response
- Store conversation history
- Continue previous conversation
- Regenerate response
- Delete conversation
- Attach files or images
- Use tools when needed
Advanced Features
- RAG retrieval
- Tool calling
- Web search
- Code execution
- Memory
- Voice input
- Multimodal input
- Team workspace
👉 Interview Answer
The core requirements are chat sessions, message storage, context retrieval, model inference, streaming responses, and conversation history.
Advanced requirements include tool calling, RAG, file uploads, memory, multimodal input, and workspace features.
4️⃣ Non-functional Requirements
Important System Qualities
The system should optimize for:
- Low latency
- High availability
- Scalability
- Safety
- Privacy
- Cost control
- Reliability
- Observability
- Multi-tenancy
Key Trade-off
Better answers often require more context,
more tools,
and larger models.
But that increases latency and cost.
👉 Interview Answer
Non-functional requirements include low latency, scalability, high availability, safety, privacy, observability, and cost efficiency.
The main trade-off is balancing answer quality against latency and cost.
5️⃣ High-Level Architecture
Architecture
Client
→ API Gateway
→ Auth Service
→ Chat Service
→ Conversation Store
→ Context Builder
→ Tool / Retrieval Layer
→ Model Router
→ LLM Inference Service
→ Response Streamer
→ Safety / Logging
Core Components
Chat Service
Handles sessions and messages.
Context Builder
Builds prompt context from conversation history, memory, files, and tools.
Model Router
Chooses the right model.
Inference Service
Calls the LLM.
Response Streamer
Streams generated tokens back to the client.
👉 Interview Answer
A high-level AI assistant architecture includes a client, API gateway, authentication, chat service, conversation store, context builder, tool and retrieval layer, model router, LLM inference service, response streamer, safety layer, and observability pipeline.
6️⃣ Data Model
Main Entities
User
Workspace
Conversation
Message
Attachment
ToolCall
ModelRun
UsageRecord
Conversation Table
{
"conversation_id": "conv_123",
"user_id": "user_456",
"title": "System design discussion",
"created_at": "2026-05-24T10:00:00Z",
"updated_at": "2026-05-24T10:10:00Z"
}
Message Table
{
"message_id": "msg_123",
"conversation_id": "conv_123",
"role": "user",
"content": "Explain caching",
"created_at": "2026-05-24T10:01:00Z"
}
👉 Interview Answer
I would model users, conversations, messages, attachments, tool calls, model runs, and usage records.
Conversation and message storage is the source of truth for chat history and context reconstruction.
7️⃣ Message Flow
End-to-End Flow
User sends message
→ Backend validates request
→ Save user message
→ Load conversation history
→ Build prompt
→ Call model
→ Stream response
→ Save assistant message
→ Emit usage and logs
Why Save User Message First?
If model generation fails, the user message still exists and can be retried.
👉 Interview Answer
The system should save the user message before calling the model.
Then it builds context, calls the model, streams the answer, saves the assistant response, and emits usage and observability events.
8️⃣ Context Management
Why Context Management Matters
LLMs have context window limits.
A long conversation may not fit.
Context Builder Inputs
- Recent messages
- System instructions
- User profile
- Memory
- Retrieved documents
- Tool results
- Uploaded files
- Safety instructions
Context Strategy
Keep recent messages
+ summarize older messages
+ retrieve relevant history
+ add tool results
👉 Interview Answer
Context management decides what information is sent to the model.
The system should include recent conversation, relevant retrieved history, memory, tool results, and uploaded context while staying within the model context window.
9️⃣ Conversation Summarization
Why Summarize?
Long conversations exceed context limits.
Strategy
Old messages
→ Summarizer
→ Conversation summary
→ Store summary
→ Use summary in future prompts
Trade-off
Summaries reduce tokens, but may lose details.
👉 Interview Answer
For long conversations, I would summarize older messages and keep recent messages verbatim.
This reduces token usage while preserving continuity, but summaries must be refreshed carefully to avoid losing important details.
🔟 Model Routing
Why Route Models?
Not every message needs the largest model.
Routing Examples
Simple question
→ Mini model
Complex reasoning
→ Large model
Code debugging
→ Strong coding model
Image input
→ Multimodal model
Routing Signals
- Task type
- Prompt length
- User tier
- Risk level
- Tool requirement
- Latency target
- Cost budget
👉 Interview Answer
Model routing chooses the right model based on task type, complexity, prompt length, risk, user tier, latency, and cost.
This reduces cost while preserving quality.
1️⃣1️⃣ Streaming Response
Why Streaming Is Needed
LLM generation can take seconds.
Streaming improves user experience.
Streaming Flow
LLM generates token
→ Backend sends event
→ Client renders partial response
→ Final message saved
Benefits
- Lower perceived latency
- Better UX
- Early cancellation
- Useful for long responses
👉 Interview Answer
The assistant should stream responses token by token.
Streaming reduces perceived latency and lets users see partial output immediately, while the backend stores the final completed response.
1️⃣2️⃣ Tool Calling
Why Tools Are Needed
The assistant may need external capabilities.
Examples:
- Search documents
- Browse web
- Run code
- Read files
- Query database
- Call calendar
- Send email
Tool Flow
Model proposes tool call
→ Backend validates tool call
→ Execute tool
→ Return tool result to model
→ Generate final answer
Important Rule
The model proposes.
The backend executes.
👉 Interview Answer
Tool calling allows the assistant to interact with external systems.
The model should only propose structured tool calls.
The backend validates permissions, executes tools, and returns results to the model.
1️⃣3️⃣ RAG Integration
Why RAG Is Useful
The assistant may need private or updated knowledge.
RAG Flow
User question
→ Retrieve relevant documents
→ Add chunks to prompt
→ LLM answers with sources
Use Cases
- Internal documentation
- Help center
- Company policies
- Uploaded files
- Knowledge base Q&A
👉 Interview Answer
RAG lets the assistant answer using external knowledge.
The system retrieves relevant chunks, applies permission filters, builds context, and asks the LLM to answer based on retrieved sources.
1️⃣4️⃣ Safety and Moderation
Safety Layer
The system should check:
- User input safety
- Model output safety
- Tool action safety
- Prompt injection
- PII exposure
- Abuse patterns
- Policy violations
Safety Flow
User message
→ Input safety check
→ Model call
→ Output safety check
→ Tool safety check if needed
👉 Interview Answer
Safety should be applied before, during, and after model execution.
The system should moderate inputs, outputs, tool calls, prompt injection, PII leakage, and abuse behavior.
1️⃣5️⃣ Cost Control and Billing
Cost Drivers
- Model size
- Input tokens
- Output tokens
- Tool calls
- RAG retrieval
- File processing
- Long conversations
Controls
- Token limits
- Model routing
- Caching
- Quotas
- Rate limits
- Prompt compression
- Usage tracking
👉 Interview Answer
Cost control is essential because chat assistants can generate large token usage.
I would track input tokens, output tokens, model usage, tool calls, and use model routing, caching, quotas, and token limits to control cost.
1️⃣6️⃣ Observability
What to Monitor
- Request latency
- Time to first token
- Token usage
- Model errors
- Tool errors
- Safety blocks
- Retrieval quality
- User feedback
- Cost per request
- Conversation failure rate
Why Important
AI failures are hard to debug without traces.
👉 Interview Answer
Observability should capture request traces, prompt versions, model runs, tool calls, token usage, latency, errors, safety decisions, retrieval results, and user feedback.
1️⃣7️⃣ Common Failure Modes
Failure Modes
AI chat assistants can fail because of:
- Hallucination
- Wrong context selection
- Long conversation drift
- Tool call errors
- Prompt injection
- Stale memory
- High latency
- Cost spikes
- Safety false positives
- Bad model routing
Example
Assistant retrieves wrong document
→ Model answers confidently
→ User receives incorrect answer
👉 Interview Answer
Chat assistant failures often come from wrong context, hallucination, tool failures, unsafe outputs, stale memory, or bad routing.
The system needs validation, observability, safety checks, and user feedback loops.
1️⃣8️⃣ Best Practices
Practical Rules
- Store every message durably
- Save user message before model call
- Use streaming responses
- Keep context within token budget
- Summarize long conversations
- Use RAG for external knowledge
- Validate tool calls
- Route models by task complexity
- Track token usage and cost
- Add safety checks and audit logs
Design Principle
The chat assistant is not just an LLM.
It is a product system around the LLM.
👉 Interview Answer
A production AI chat assistant is not just an LLM wrapper.
It requires conversation storage, context management, model routing, tool execution, RAG, streaming, safety, cost control, observability, and feedback loops.
🧠 Staff-Level Answer Final
👉 Interview Answer Full Version
To design an AI chat assistant like ChatGPT, I would treat it as a full product system around an LLM, not just a model API call.
The system needs to support user authentication, chat sessions, message storage, conversation history, streaming responses, context management, model routing, tool calling, RAG, safety, cost control, and observability.
The high-level architecture starts with a client and API gateway.
The chat service validates the request, authenticates the user, saves the user message, loads conversation state, and calls a context builder.
The context builder decides what to include in the model prompt: recent messages, summarized older history, system instructions, memory, retrieved documents, uploaded files, and tool results.
Since context windows are limited, the system should not send the entire conversation forever.
It should keep recent messages, summarize older messages, and retrieve relevant context when needed.
The model router selects the right model based on task complexity, prompt length, risk, latency target, user tier, and cost.
The inference service calls the model and streams tokens back to the client.
The assistant response is saved after generation completes, while partial streaming improves perceived latency.
For advanced capabilities, the assistant can use tools.
The model may propose a tool call, but the backend must validate permissions, execute the tool, and return the result to the model.
RAG is used when the assistant needs private, uploaded, or updated knowledge.
The retrieval layer should enforce access control before adding context to the prompt.
Safety is required across the system: input moderation, output moderation, tool safety, prompt injection defense, PII protection, and abuse detection.
Cost control is also important because long conversations, large models, and tool calls can become expensive.
The system should track token usage, use model routing, cache stable results, enforce quotas, and limit unnecessary context.
Finally, observability is critical.
We need traces for prompts, model runs, tool calls, retrieval results, latency, token usage, safety decisions, and user feedback.
The core principle is: the chat assistant is not just an LLM.
It is a full product system around the LLM.
⭐ Final Insight
Design ChatGPT-like System 的核心不是:
“前端输入 + 后端调用 LLM”
真正的系统是:
Chat Service
- Conversation Store
- Context Builder
- Model Router
- LLM Inference
- Streaming
- Tool Calling
- RAG
- Safety
- Cost Control
- Observability。
最重要的一句话:
The chat assistant is not just an LLM.
It is a product system around the LLM.
中文部分
🎯 Design an AI Chat Assistant System like ChatGPT
1️⃣ 核心框架
设计 AI Chat Assistant System 时,我通常从这些方面分析:
- Product requirements
- Chat session and message storage
- Prompt building and context management
- Model routing and inference
- Tool calling and retrieval
- Streaming response
- Safety and moderation
- 核心权衡:quality vs latency vs cost
2️⃣ Product Goal
AI chat assistant 允许用户提问、 持续对话、 上传上下文、 使用 tools, 并接收生成答案。
Basic Flow
User Message
→ Chat Backend
→ Context Builder
→ LLM
→ Streaming Response
→ Store Conversation
👉 面试回答
AI chat assistant 是围绕 LLM 构建的 conversational system。
它接收 user messages, 检索 conversation context, 构建 prompt, 调用 model, stream response, 存储 conversation, 并应用 safety、permissions、 logging 和 cost controls。
3️⃣ Functional Requirements
Core Features
系统应该支持:
- User login
- Create chat session
- Send message
- Stream assistant response
- Store conversation history
- Continue previous conversation
- Regenerate response
- Delete conversation
- Attach files or images
- Use tools when needed
Advanced Features
- RAG retrieval
- Tool calling
- Web search
- Code execution
- Memory
- Voice input
- Multimodal input
- Team workspace
👉 面试回答
核心需求包括 chat sessions、message storage、 context retrieval、model inference、 streaming responses 和 conversation history。
Advanced requirements 包括 tool calling、 RAG、file uploads、memory、 multimodal input 和 workspace features。
4️⃣ Non-functional Requirements
Important System Qualities
系统应该优化:
- Low latency
- High availability
- Scalability
- Safety
- Privacy
- Cost control
- Reliability
- Observability
- Multi-tenancy
Key Trade-off
Better answers often require more context,
more tools,
and larger models.
But that increases latency and cost.
👉 面试回答
Non-functional requirements 包括 low latency、 scalability、high availability、safety、 privacy、observability 和 cost efficiency。
核心权衡是 answer quality 和 latency / cost 的平衡。
5️⃣ High-Level Architecture
Architecture
Client
→ API Gateway
→ Auth Service
→ Chat Service
→ Conversation Store
→ Context Builder
→ Tool / Retrieval Layer
→ Model Router
→ LLM Inference Service
→ Response Streamer
→ Safety / Logging
Core Components
Chat Service
处理 sessions 和 messages。
Context Builder
从 conversation history、memory、 files 和 tools 构建 prompt context。
Model Router
选择正确 model。
Inference Service
调用 LLM。
Response Streamer
把 generated tokens stream 回 client。
👉 面试回答
High-level AI assistant architecture 包括 client、API gateway、authentication、 chat service、conversation store、 context builder、tool and retrieval layer、 model router、LLM inference service、 response streamer、safety layer 和 observability pipeline。
6️⃣ Data Model
Main Entities
User
Workspace
Conversation
Message
Attachment
ToolCall
ModelRun
UsageRecord
Conversation Table
{
"conversation_id": "conv_123",
"user_id": "user_456",
"title": "System design discussion",
"created_at": "2026-05-24T10:00:00Z",
"updated_at": "2026-05-24T10:10:00Z"
}
Message Table
{
"message_id": "msg_123",
"conversation_id": "conv_123",
"role": "user",
"content": "Explain caching",
"created_at": "2026-05-24T10:01:00Z"
}
👉 面试回答
我会建模 users、conversations、messages、 attachments、tool calls、model runs 和 usage records。
Conversation 和 message storage 是 chat history 和 context reconstruction 的 source of truth。
7️⃣ Message Flow
End-to-End Flow
User sends message
→ Backend validates request
→ Save user message
→ Load conversation history
→ Build prompt
→ Call model
→ Stream response
→ Save assistant message
→ Emit usage and logs
为什么先 Save User Message?
如果 model generation 失败, user message 仍然存在, 可以 retry。
👉 面试回答
系统应该先保存 user message, 再调用 model。
然后构建 context、调用 model、 stream answer、保存 assistant response, 并发出 usage 和 observability events。
8️⃣ Context Management
为什么 Context Management 重要?
LLM 有 context window 限制。
长对话可能放不下。
Context Builder Inputs
- Recent messages
- System instructions
- User profile
- Memory
- Retrieved documents
- Tool results
- Uploaded files
- Safety instructions
Context Strategy
Keep recent messages
+ summarize older messages
+ retrieve relevant history
+ add tool results
👉 面试回答
Context management 决定哪些信息发送给 model。
系统应该包含 recent conversation、 relevant retrieved history、memory、 tool results 和 uploaded context, 同时保持在 model context window 内。
9️⃣ Conversation Summarization
为什么 Summarize?
Long conversations 超过 context limits。
Strategy
Old messages
→ Summarizer
→ Conversation summary
→ Store summary
→ Use summary in future prompts
Trade-off
Summaries 减少 tokens, 但可能丢失细节。
👉 面试回答
对 long conversations, 我会 summarize older messages, 同时 verbatim 保留 recent messages。
这可以减少 token usage 并保持 continuity, 但 summaries 必须 carefully refreshed, 避免丢失重要细节。
🔟 Model Routing
为什么 Route Models?
不是每条 message 都需要最大模型。
Routing Examples
Simple question
→ Mini model
Complex reasoning
→ Large model
Code debugging
→ Strong coding model
Image input
→ Multimodal model
Routing Signals
- Task type
- Prompt length
- User tier
- Risk level
- Tool requirement
- Latency target
- Cost budget
👉 面试回答
Model routing 根据 task type、complexity、 prompt length、risk、user tier、 latency 和 cost 选择正确 model。
这样可以降低 cost, 同时保持 quality。
1️⃣1️⃣ Streaming Response
为什么需要 Streaming?
LLM generation 可能需要几秒。
Streaming 改善用户体验。
Streaming Flow
LLM generates token
→ Backend sends event
→ Client renders partial response
→ Final message saved
Benefits
- Lower perceived latency
- Better UX
- Early cancellation
- Useful for long responses
👉 面试回答
Assistant 应该 token by token stream responses。
Streaming 降低 perceived latency, 让用户可以立即看到 partial output, 同时 backend 存储 final completed response。
1️⃣2️⃣ Tool Calling
为什么需要 Tools?
Assistant 可能需要 external capabilities。
Examples:
- Search documents
- Browse web
- Run code
- Read files
- Query database
- Call calendar
- Send email
Tool Flow
Model proposes tool call
→ Backend validates tool call
→ Execute tool
→ Return tool result to model
→ Generate final answer
Important Rule
Model proposes。
Backend executes。
👉 面试回答
Tool calling 让 assistant 可以和 external systems 交互。
Model 只应该提出 structured tool calls。
Backend 负责验证 permissions、 执行 tools, 并把 results 返回给 model。
1️⃣3️⃣ RAG Integration
为什么 RAG 有用?
Assistant 可能需要 private 或 updated knowledge。
RAG Flow
User question
→ Retrieve relevant documents
→ Add chunks to prompt
→ LLM answers with sources
Use Cases
- Internal documentation
- Help center
- Company policies
- Uploaded files
- Knowledge base Q&A
👉 面试回答
RAG 让 assistant 可以使用 external knowledge 回答。
系统检索 relevant chunks, 应用 permission filters, 构建 context, 并要求 LLM 基于 retrieved sources 回答。
1️⃣4️⃣ Safety and Moderation
Safety Layer
系统应该检查:
- User input safety
- Model output safety
- Tool action safety
- Prompt injection
- PII exposure
- Abuse patterns
- Policy violations
Safety Flow
User message
→ Input safety check
→ Model call
→ Output safety check
→ Tool safety check if needed
👉 面试回答
Safety 应用于 model execution 的前、中、后。
系统应该 moderate inputs、outputs、 tool calls、prompt injection、 PII leakage 和 abuse behavior。
1️⃣5️⃣ Cost Control and Billing
Cost Drivers
- Model size
- Input tokens
- Output tokens
- Tool calls
- RAG retrieval
- File processing
- Long conversations
Controls
- Token limits
- Model routing
- Caching
- Quotas
- Rate limits
- Prompt compression
- Usage tracking
👉 面试回答
Cost control 非常重要, 因为 chat assistants 会产生大量 token usage。
我会追踪 input tokens、output tokens、 model usage、tool calls, 并使用 model routing、caching、 quotas 和 token limits 控制 cost。
1️⃣6️⃣ Observability
What to Monitor
- Request latency
- Time to first token
- Token usage
- Model errors
- Tool errors
- Safety blocks
- Retrieval quality
- User feedback
- Cost per request
- Conversation failure rate
为什么重要?
没有 traces, AI failures 很难 debug。
👉 面试回答
Observability 应捕获 request traces、 prompt versions、model runs、tool calls、 token usage、latency、errors、 safety decisions、retrieval results 和 user feedback。
1️⃣7️⃣ Common Failure Modes
Failure Modes
AI chat assistants 可能失败因为:
- Hallucination
- Wrong context selection
- Long conversation drift
- Tool call errors
- Prompt injection
- Stale memory
- High latency
- Cost spikes
- Safety false positives
- Bad model routing
Example
Assistant retrieves wrong document
→ Model answers confidently
→ User receives incorrect answer
👉 面试回答
Chat assistant failures 通常来自 wrong context、 hallucination、tool failures、 unsafe outputs、stale memory 或 bad routing。
系统需要 validation、observability、 safety checks 和 user feedback loops。
1️⃣8️⃣ Best Practices
Practical Rules
- Store every message durably
- Save user message before model call
- Use streaming responses
- Keep context within token budget
- Summarize long conversations
- Use RAG for external knowledge
- Validate tool calls
- Route models by task complexity
- Track token usage and cost
- Add safety checks and audit logs
Design Principle
The chat assistant is not just an LLM.
It is a product system around the LLM.
👉 面试回答
Production AI chat assistant 不只是 LLM wrapper。
它需要 conversation storage、context management、 model routing、tool execution、RAG、 streaming、safety、cost control、 observability 和 feedback loops。
🧠 Staff-Level Answer Final
👉 面试回答完整版本
设计一个像 ChatGPT 的 AI chat assistant, 我会把它看作围绕 LLM 构建的完整 product system, 而不是简单的 model API call。
系统需要支持 user authentication、 chat sessions、message storage、 conversation history、streaming responses、 context management、model routing、 tool calling、RAG、safety、 cost control 和 observability。
High-level architecture 从 client 和 API gateway 开始。
Chat service validate request、 authenticate user、save user message、 load conversation state, 然后调用 context builder。
Context builder 决定 model prompt 中应该包含什么: recent messages、summarized older history、 system instructions、memory、 retrieved documents、uploaded files 和 tool results。
因为 context window 有限制, 系统不应该永远发送整个 conversation。
它应该保留 recent messages, summarize older messages, 并在需要时 retrieve relevant context。
Model router 根据 task complexity、 prompt length、risk、latency target、 user tier 和 cost 选择正确 model。
Inference service 调用 model, 并把 tokens stream 回 client。
Assistant response 在 generation 完成后保存, 同时 partial streaming 改善 perceived latency。
对 advanced capabilities, assistant 可以使用 tools。
Model 可以提出 tool call, 但 backend 必须 validate permissions、 execute tool, 并把 result 返回给 model。
RAG 用于 assistant 需要 private、 uploaded 或 updated knowledge 的场景。
Retrieval layer 应该在把 context 加入 prompt 前 执行 access control。
Safety 需要贯穿整个系统: input moderation、output moderation、 tool safety、prompt injection defense、 PII protection 和 abuse detection。
Cost control 也很重要, 因为 long conversations、large models 和 tool calls 都会变贵。
系统应该 track token usage、 使用 model routing、cache stable results、 enforce quotas, 并限制 unnecessary context。
最后,observability 很关键。
我们需要 traces 来记录 prompts、 model runs、tool calls、retrieval results、 latency、token usage、safety decisions 和 user feedback。
核心原则是: chat assistant 不是简单的 LLM。
它是围绕 LLM 构建的完整 product system。
⭐ Final Insight
Design ChatGPT-like System 的核心不是:
“前端输入 + 后端调用 LLM”
真正的系统是:
Chat Service
- Conversation Store
- Context Builder
- Model Router
- LLM Inference
- Streaming
- Tool Calling
- RAG
- Safety
- Cost Control
- Observability。
最重要的一句话:
The chat assistant is not just an LLM.
It is a product system around the LLM.
Implement