🎯 Design an AI Chat Assistant System like ChatGPT

1️⃣ Core Framework

When designing an AI Chat Assistant System, I frame it as:

Product requirements
Chat session and message storage
Prompt building and context management
Model routing and inference
Tool calling and retrieval
Streaming response
Safety and moderation
Trade-offs: quality vs latency vs cost

2️⃣ Product Goal

An AI chat assistant allows users to ask questions, continue conversations, upload context, use tools, and receive generated answers.

Basic Flow

User Message
→ Chat Backend
→ Context Builder
→ LLM
→ Streaming Response
→ Store Conversation

👉 Interview Answer

An AI chat assistant is a conversational system built around an LLM.

It receives user messages, retrieves conversation context, builds a prompt, calls the model, streams the response, stores the conversation, and applies safety, permissions, logging, and cost controls.

3️⃣ Functional Requirements

Core Features

The system should support:

User login
Create chat session
Send message
Stream assistant response
Store conversation history
Continue previous conversation
Regenerate response
Delete conversation
Attach files or images
Use tools when needed

Advanced Features

RAG retrieval
Tool calling
Web search
Code execution
Memory
Voice input
Multimodal input
Team workspace

👉 Interview Answer

The core requirements are chat sessions, message storage, context retrieval, model inference, streaming responses, and conversation history.

Advanced requirements include tool calling, RAG, file uploads, memory, multimodal input, and workspace features.

4️⃣ Non-functional Requirements

Important System Qualities

The system should optimize for:

Low latency
High availability
Scalability
Safety
Privacy
Cost control
Reliability
Observability
Multi-tenancy

Key Trade-off

Better answers often require more context,
more tools,
and larger models.

But that increases latency and cost.

👉 Interview Answer

Non-functional requirements include low latency, scalability, high availability, safety, privacy, observability, and cost efficiency.

The main trade-off is balancing answer quality against latency and cost.

5️⃣ High-Level Architecture

Architecture

Client
→ API Gateway
→ Auth Service
→ Chat Service
→ Conversation Store
→ Context Builder
→ Tool / Retrieval Layer
→ Model Router
→ LLM Inference Service
→ Response Streamer
→ Safety / Logging

Core Components

Chat Service

Handles sessions and messages.

Context Builder

Builds prompt context from conversation history, memory, files, and tools.

Model Router

Chooses the right model.

Inference Service

Calls the LLM.

Response Streamer

Streams generated tokens back to the client.

👉 Interview Answer

A high-level AI assistant architecture includes a client, API gateway, authentication, chat service, conversation store, context builder, tool and retrieval layer, model router, LLM inference service, response streamer, safety layer, and observability pipeline.

6️⃣ Data Model

Main Entities

User
Workspace
Conversation
Message
Attachment
ToolCall
ModelRun
UsageRecord

Conversation Table

{
  "conversation_id": "conv_123",
  "user_id": "user_456",
  "title": "System design discussion",
  "created_at": "2026-05-24T10:00:00Z",
  "updated_at": "2026-05-24T10:10:00Z"
}

Message Table

{
  "message_id": "msg_123",
  "conversation_id": "conv_123",
  "role": "user",
  "content": "Explain caching",
  "created_at": "2026-05-24T10:01:00Z"
}

👉 Interview Answer

I would model users, conversations, messages, attachments, tool calls, model runs, and usage records.

Conversation and message storage is the source of truth for chat history and context reconstruction.

7️⃣ Message Flow

End-to-End Flow

User sends message
→ Backend validates request
→ Save user message
→ Load conversation history
→ Build prompt
→ Call model
→ Stream response
→ Save assistant message
→ Emit usage and logs

Why Save User Message First?

If model generation fails, the user message still exists and can be retried.

👉 Interview Answer

The system should save the user message before calling the model.

Then it builds context, calls the model, streams the answer, saves the assistant response, and emits usage and observability events.

8️⃣ Context Management

Why Context Management Matters

LLMs have context window limits.

A long conversation may not fit.

Context Builder Inputs

Recent messages
System instructions
User profile
Memory
Retrieved documents
Tool results
Uploaded files
Safety instructions

Context Strategy

Keep recent messages
+ summarize older messages
+ retrieve relevant history
+ add tool results

👉 Interview Answer

Context management decides what information is sent to the model.

The system should include recent conversation, relevant retrieved history, memory, tool results, and uploaded context while staying within the model context window.

9️⃣ Conversation Summarization

Why Summarize?

Long conversations exceed context limits.

Strategy

Old messages
→ Summarizer
→ Conversation summary
→ Store summary
→ Use summary in future prompts

Trade-off

Summaries reduce tokens, but may lose details.

👉 Interview Answer

For long conversations, I would summarize older messages and keep recent messages verbatim.

This reduces token usage while preserving continuity, but summaries must be refreshed carefully to avoid losing important details.

🔟 Model Routing

Why Route Models?

Not every message needs the largest model.

Routing Examples

Simple question
→ Mini model

Complex reasoning
→ Large model

Code debugging
→ Strong coding model

Image input
→ Multimodal model

Routing Signals

Task type
Prompt length
User tier
Risk level
Tool requirement
Latency target
Cost budget

👉 Interview Answer

Model routing chooses the right model based on task type, complexity, prompt length, risk, user tier, latency, and cost.

This reduces cost while preserving quality.

1️⃣1️⃣ Streaming Response

Why Streaming Is Needed

LLM generation can take seconds.

Streaming improves user experience.

Streaming Flow

LLM generates token
→ Backend sends event
→ Client renders partial response
→ Final message saved

Benefits

Lower perceived latency
Better UX
Early cancellation
Useful for long responses

👉 Interview Answer

The assistant should stream responses token by token.

Streaming reduces perceived latency and lets users see partial output immediately, while the backend stores the final completed response.

1️⃣2️⃣ Tool Calling

Why Tools Are Needed

The assistant may need external capabilities.

Examples:

Search documents
Browse web
Run code
Read files
Query database
Call calendar
Send email

Tool Flow

Model proposes tool call
→ Backend validates tool call
→ Execute tool
→ Return tool result to model
→ Generate final answer

Important Rule

The model proposes.

The backend executes.

👉 Interview Answer

Tool calling allows the assistant to interact with external systems.

The model should only propose structured tool calls.

The backend validates permissions, executes tools, and returns results to the model.

1️⃣3️⃣ RAG Integration

Why RAG Is Useful

The assistant may need private or updated knowledge.

RAG Flow

User question
→ Retrieve relevant documents
→ Add chunks to prompt
→ LLM answers with sources

Use Cases

Internal documentation
Help center
Company policies
Uploaded files
Knowledge base Q&A

👉 Interview Answer

RAG lets the assistant answer using external knowledge.

The system retrieves relevant chunks, applies permission filters, builds context, and asks the LLM to answer based on retrieved sources.

1️⃣4️⃣ Safety and Moderation

Safety Layer

The system should check:

User input safety
Model output safety
Tool action safety
Prompt injection
PII exposure
Abuse patterns
Policy violations

Safety Flow

User message
→ Input safety check
→ Model call
→ Output safety check
→ Tool safety check if needed

👉 Interview Answer

Safety should be applied before, during, and after model execution.

The system should moderate inputs, outputs, tool calls, prompt injection, PII leakage, and abuse behavior.

1️⃣5️⃣ Cost Control and Billing

Cost Drivers

Model size
Input tokens
Output tokens
Tool calls
RAG retrieval
File processing
Long conversations

Controls

Token limits
Model routing
Caching
Quotas
Rate limits
Prompt compression
Usage tracking

👉 Interview Answer

Cost control is essential because chat assistants can generate large token usage.

I would track input tokens, output tokens, model usage, tool calls, and use model routing, caching, quotas, and token limits to control cost.

1️⃣6️⃣ Observability

What to Monitor

Request latency
Time to first token
Token usage
Model errors
Tool errors
Safety blocks
Retrieval quality
User feedback
Cost per request
Conversation failure rate

Why Important

AI failures are hard to debug without traces.

👉 Interview Answer

Observability should capture request traces, prompt versions, model runs, tool calls, token usage, latency, errors, safety decisions, retrieval results, and user feedback.

1️⃣7️⃣ Common Failure Modes

Failure Modes

AI chat assistants can fail because of:

Hallucination
Wrong context selection
Long conversation drift
Tool call errors
Prompt injection
Stale memory
High latency
Cost spikes
Safety false positives
Bad model routing

Example

Assistant retrieves wrong document
→ Model answers confidently
→ User receives incorrect answer

👉 Interview Answer

Chat assistant failures often come from wrong context, hallucination, tool failures, unsafe outputs, stale memory, or bad routing.

The system needs validation, observability, safety checks, and user feedback loops.

1️⃣8️⃣ Best Practices

Practical Rules

Store every message durably
Save user message before model call
Use streaming responses
Keep context within token budget
Summarize long conversations
Use RAG for external knowledge
Validate tool calls
Route models by task complexity
Track token usage and cost
Add safety checks and audit logs

Design Principle

The chat assistant is not just an LLM.
It is a product system around the LLM.

👉 Interview Answer

A production AI chat assistant is not just an LLM wrapper.

It requires conversation storage, context management, model routing, tool execution, RAG, streaming, safety, cost control, observability, and feedback loops.

🧠 Staff-Level Answer Final

👉 Interview Answer Full Version

To design an AI chat assistant like ChatGPT, I would treat it as a full product system around an LLM, not just a model API call.

The system needs to support user authentication, chat sessions, message storage, conversation history, streaming responses, context management, model routing, tool calling, RAG, safety, cost control, and observability.

The high-level architecture starts with a client and API gateway.

The chat service validates the request, authenticates the user, saves the user message, loads conversation state, and calls a context builder.

The context builder decides what to include in the model prompt: recent messages, summarized older history, system instructions, memory, retrieved documents, uploaded files, and tool results.

Since context windows are limited, the system should not send the entire conversation forever.

It should keep recent messages, summarize older messages, and retrieve relevant context when needed.

The model router selects the right model based on task complexity, prompt length, risk, latency target, user tier, and cost.

The inference service calls the model and streams tokens back to the client.

The assistant response is saved after generation completes, while partial streaming improves perceived latency.

For advanced capabilities, the assistant can use tools.

The model may propose a tool call, but the backend must validate permissions, execute the tool, and return the result to the model.

RAG is used when the assistant needs private, uploaded, or updated knowledge.

The retrieval layer should enforce access control before adding context to the prompt.

Safety is required across the system: input moderation, output moderation, tool safety, prompt injection defense, PII protection, and abuse detection.

Cost control is also important because long conversations, large models, and tool calls can become expensive.

The system should track token usage, use model routing, cache stable results, enforce quotas, and limit unnecessary context.

Finally, observability is critical.

We need traces for prompts, model runs, tool calls, retrieval results, latency, token usage, safety decisions, and user feedback.

The core principle is: the chat assistant is not just an LLM.

It is a full product system around the LLM.

⭐ Final Insight

Design ChatGPT-like System 的核心不是：

“前端输入 + 后端调用 LLM”

真正的系统是：

Chat Service

Conversation Store

Context Builder

Model Router

LLM Inference

Streaming

Tool Calling

RAG

Safety

Cost Control

Observability。

最重要的一句话：

The chat assistant is not just an LLM.

It is a product system around the LLM.

中文部分

🎯 Design an AI Chat Assistant System like ChatGPT

1️⃣ 核心框架

设计 AI Chat Assistant System 时，我通常从这些方面分析：

Product requirements
Chat session and message storage
Prompt building and context management
Model routing and inference
Tool calling and retrieval
Streaming response
Safety and moderation
核心权衡：quality vs latency vs cost

2️⃣ Product Goal

AI chat assistant 允许用户提问、持续对话、上传上下文、使用 tools，并接收生成答案。

Basic Flow

User Message
→ Chat Backend
→ Context Builder
→ LLM
→ Streaming Response
→ Store Conversation

👉 面试回答

AI chat assistant 是围绕 LLM 构建的 conversational system。

它接收 user messages，检索 conversation context，构建 prompt，调用 model， stream response，存储 conversation，并应用 safety、permissions、 logging 和 cost controls。

3️⃣ Functional Requirements

Core Features

系统应该支持：

User login
Create chat session
Send message
Stream assistant response
Store conversation history
Continue previous conversation
Regenerate response
Delete conversation
Attach files or images
Use tools when needed

Advanced Features

RAG retrieval
Tool calling
Web search
Code execution
Memory
Voice input
Multimodal input
Team workspace

👉 面试回答

核心需求包括 chat sessions、message storage、 context retrieval、model inference、 streaming responses 和 conversation history。

Advanced requirements 包括 tool calling、 RAG、file uploads、memory、 multimodal input 和 workspace features。

4️⃣ Non-functional Requirements

Important System Qualities

系统应该优化：

Low latency
High availability
Scalability
Safety
Privacy
Cost control
Reliability
Observability
Multi-tenancy

Key Trade-off

Better answers often require more context,
more tools,
and larger models.

But that increases latency and cost.

👉 面试回答

Non-functional requirements 包括 low latency、 scalability、high availability、safety、 privacy、observability 和 cost efficiency。

核心权衡是 answer quality 和 latency / cost 的平衡。

5️⃣ High-Level Architecture

Architecture

Client
→ API Gateway
→ Auth Service
→ Chat Service
→ Conversation Store
→ Context Builder
→ Tool / Retrieval Layer
→ Model Router
→ LLM Inference Service
→ Response Streamer
→ Safety / Logging

Core Components

Chat Service

处理 sessions 和 messages。

Context Builder

从 conversation history、memory、 files 和 tools 构建 prompt context。

Model Router

选择正确 model。

Inference Service

调用 LLM。

Response Streamer

把 generated tokens stream 回 client。

👉 面试回答

High-level AI assistant architecture 包括 client、API gateway、authentication、 chat service、conversation store、 context builder、tool and retrieval layer、 model router、LLM inference service、 response streamer、safety layer 和 observability pipeline。

6️⃣ Data Model

Main Entities

User
Workspace
Conversation
Message
Attachment
ToolCall
ModelRun
UsageRecord

Conversation Table

{
  "conversation_id": "conv_123",
  "user_id": "user_456",
  "title": "System design discussion",
  "created_at": "2026-05-24T10:00:00Z",
  "updated_at": "2026-05-24T10:10:00Z"
}

Message Table

{
  "message_id": "msg_123",
  "conversation_id": "conv_123",
  "role": "user",
  "content": "Explain caching",
  "created_at": "2026-05-24T10:01:00Z"
}

👉 面试回答

我会建模 users、conversations、messages、 attachments、tool calls、model runs 和 usage records。

Conversation 和 message storage 是 chat history 和 context reconstruction 的 source of truth。

7️⃣ Message Flow

End-to-End Flow

User sends message
→ Backend validates request
→ Save user message
→ Load conversation history
→ Build prompt
→ Call model
→ Stream response
→ Save assistant message
→ Emit usage and logs

为什么先 Save User Message？

如果 model generation 失败， user message 仍然存在，可以 retry。

👉 面试回答

系统应该先保存 user message，再调用 model。

然后构建 context、调用 model、 stream answer、保存 assistant response，并发出 usage 和 observability events。

8️⃣ Context Management

为什么 Context Management 重要？

LLM 有 context window 限制。

长对话可能放不下。

Context Builder Inputs

Recent messages
System instructions
User profile
Memory
Retrieved documents
Tool results
Uploaded files
Safety instructions

Context Strategy

Keep recent messages
+ summarize older messages
+ retrieve relevant history
+ add tool results

👉 面试回答

Context management 决定哪些信息发送给 model。

系统应该包含 recent conversation、 relevant retrieved history、memory、 tool results 和 uploaded context，同时保持在 model context window 内。

9️⃣ Conversation Summarization

为什么 Summarize？

Long conversations 超过 context limits。

Strategy

Old messages
→ Summarizer
→ Conversation summary
→ Store summary
→ Use summary in future prompts

Trade-off

Summaries 减少 tokens，但可能丢失细节。

👉 面试回答

对 long conversations，我会 summarize older messages，同时 verbatim 保留 recent messages。

这可以减少 token usage 并保持 continuity，但 summaries 必须 carefully refreshed，避免丢失重要细节。

🔟 Model Routing

为什么 Route Models？

不是每条 message 都需要最大模型。

Routing Examples

Simple question
→ Mini model

Complex reasoning
→ Large model

Code debugging
→ Strong coding model

Image input
→ Multimodal model

Routing Signals

Task type
Prompt length
User tier
Risk level
Tool requirement
Latency target
Cost budget

👉 面试回答

Model routing 根据 task type、complexity、 prompt length、risk、user tier、 latency 和 cost 选择正确 model。

这样可以降低 cost，同时保持 quality。

1️⃣1️⃣ Streaming Response

为什么需要 Streaming？

LLM generation 可能需要几秒。

Streaming 改善用户体验。

Streaming Flow

LLM generates token
→ Backend sends event
→ Client renders partial response
→ Final message saved

Benefits

Lower perceived latency
Better UX
Early cancellation
Useful for long responses

👉 面试回答

Assistant 应该 token by token stream responses。

Streaming 降低 perceived latency，让用户可以立即看到 partial output，同时 backend 存储 final completed response。

1️⃣2️⃣ Tool Calling

为什么需要 Tools？

Assistant 可能需要 external capabilities。

Examples:

Search documents
Browse web
Run code
Read files
Query database
Call calendar
Send email

Tool Flow

Model proposes tool call
→ Backend validates tool call
→ Execute tool
→ Return tool result to model
→ Generate final answer

Important Rule

Model proposes。

Backend executes。

👉 面试回答

Tool calling 让 assistant 可以和 external systems 交互。

Model 只应该提出 structured tool calls。

Backend 负责验证 permissions、执行 tools，并把 results 返回给 model。

1️⃣3️⃣ RAG Integration

为什么 RAG 有用？

Assistant 可能需要 private 或 updated knowledge。

RAG Flow

User question
→ Retrieve relevant documents
→ Add chunks to prompt
→ LLM answers with sources

Use Cases

Internal documentation
Help center
Company policies
Uploaded files
Knowledge base Q&A

👉 面试回答

RAG 让 assistant 可以使用 external knowledge 回答。

系统检索 relevant chunks，应用 permission filters，构建 context，并要求 LLM 基于 retrieved sources 回答。

1️⃣4️⃣ Safety and Moderation

Safety Layer

系统应该检查：

User input safety
Model output safety
Tool action safety
Prompt injection
PII exposure
Abuse patterns
Policy violations

Safety Flow

User message
→ Input safety check
→ Model call
→ Output safety check
→ Tool safety check if needed

👉 面试回答

Safety 应用于 model execution 的前、中、后。

系统应该 moderate inputs、outputs、 tool calls、prompt injection、 PII leakage 和 abuse behavior。

1️⃣5️⃣ Cost Control and Billing

Cost Drivers

Model size
Input tokens
Output tokens
Tool calls
RAG retrieval
File processing
Long conversations

Controls

Token limits
Model routing
Caching
Quotas
Rate limits
Prompt compression
Usage tracking

👉 面试回答

Cost control 非常重要，因为 chat assistants 会产生大量 token usage。

我会追踪 input tokens、output tokens、 model usage、tool calls，并使用 model routing、caching、 quotas 和 token limits 控制 cost。

1️⃣6️⃣ Observability

What to Monitor

Request latency
Time to first token
Token usage
Model errors
Tool errors
Safety blocks
Retrieval quality
User feedback
Cost per request
Conversation failure rate

为什么重要？

没有 traces， AI failures 很难 debug。

👉 面试回答

Observability 应捕获 request traces、 prompt versions、model runs、tool calls、 token usage、latency、errors、 safety decisions、retrieval results 和 user feedback。

1️⃣7️⃣ Common Failure Modes

Failure Modes

AI chat assistants 可能失败因为：

Hallucination
Wrong context selection
Long conversation drift
Tool call errors
Prompt injection
Stale memory
High latency
Cost spikes
Safety false positives
Bad model routing

Example

Assistant retrieves wrong document
→ Model answers confidently
→ User receives incorrect answer

👉 面试回答

Chat assistant failures 通常来自 wrong context、 hallucination、tool failures、 unsafe outputs、stale memory 或 bad routing。

系统需要 validation、observability、 safety checks 和 user feedback loops。

1️⃣8️⃣ Best Practices

Practical Rules

Store every message durably
Save user message before model call
Use streaming responses
Keep context within token budget
Summarize long conversations
Use RAG for external knowledge
Validate tool calls
Route models by task complexity
Track token usage and cost
Add safety checks and audit logs

Design Principle

The chat assistant is not just an LLM.
It is a product system around the LLM.

👉 面试回答

Production AI chat assistant 不只是 LLM wrapper。

它需要 conversation storage、context management、 model routing、tool execution、RAG、 streaming、safety、cost control、 observability 和 feedback loops。

🧠 Staff-Level Answer Final

👉 面试回答完整版本

设计一个像 ChatGPT 的 AI chat assistant，我会把它看作围绕 LLM 构建的完整 product system，而不是简单的 model API call。

系统需要支持 user authentication、 chat sessions、message storage、 conversation history、streaming responses、 context management、model routing、 tool calling、RAG、safety、 cost control 和 observability。

High-level architecture 从 client 和 API gateway 开始。

Chat service validate request、 authenticate user、save user message、 load conversation state，然后调用 context builder。

Context builder 决定 model prompt 中应该包含什么： recent messages、summarized older history、 system instructions、memory、 retrieved documents、uploaded files 和 tool results。

因为 context window 有限制，系统不应该永远发送整个 conversation。

它应该保留 recent messages， summarize older messages，并在需要时 retrieve relevant context。

Model router 根据 task complexity、 prompt length、risk、latency target、 user tier 和 cost 选择正确 model。

Inference service 调用 model，并把 tokens stream 回 client。

Assistant response 在 generation 完成后保存，同时 partial streaming 改善 perceived latency。

对 advanced capabilities， assistant 可以使用 tools。

Model 可以提出 tool call，但 backend 必须 validate permissions、 execute tool，并把 result 返回给 model。

RAG 用于 assistant 需要 private、 uploaded 或 updated knowledge 的场景。

Retrieval layer 应该在把 context 加入 prompt 前执行 access control。

Safety 需要贯穿整个系统： input moderation、output moderation、 tool safety、prompt injection defense、 PII protection 和 abuse detection。

Cost control 也很重要，因为 long conversations、large models 和 tool calls 都会变贵。

系统应该 track token usage、使用 model routing、cache stable results、 enforce quotas，并限制 unnecessary context。

最后，observability 很关键。

我们需要 traces 来记录 prompts、 model runs、tool calls、retrieval results、 latency、token usage、safety decisions 和 user feedback。

核心原则是： chat assistant 不是简单的 LLM。

它是围绕 LLM 构建的完整 product system。

⭐ Final Insight

Design ChatGPT-like System 的核心不是：

“前端输入 + 后端调用 LLM”

真正的系统是：

Chat Service

Conversation Store

Context Builder

Model Router

LLM Inference

Streaming

Tool Calling

RAG

Safety

Cost Control

Observability。

最重要的一句话：

The chat assistant is not just an LLM.

It is a product system around the LLM.