🎯 Design Chat System
1️⃣ Core Framework
When discussing Chat System design, I frame it as:
- Core user flows: send message, receive message, read history
- Communication model: WebSocket vs polling vs push notification
- Data model: users, conversations, messages, participants
- Message delivery pipeline
- Online / offline handling
- Ordering, delivery status, and read receipts
- Scaling, caching, and storage strategy
- Trade-offs: latency vs consistency vs durability
- Failure handling and reliability
2️⃣ Core Requirements
Functional Requirements
- User can send 1:1 messages
- User can send group messages
- User can receive messages in real time
- User can load conversation history
- Support online/offline users
- Support message delivery status
- Support read receipts
- Support media messages
- Support push notifications
Non-functional Requirements
- Low-latency delivery
- High availability
- Durable message storage
- Scalable concurrent connections
- Eventual consistency is acceptable for delivery status
- Stronger durability is required for message persistence
👉 Interview Answer
A chat system has three core flows: sending messages, receiving messages in real time, and loading message history.
The system needs low-latency delivery, durable storage, and support for online and offline users.
The main challenge is balancing real-time communication, message durability, ordering, and large-scale connection management.
3️⃣ Main APIs
Send Message
POST /api/messages
Request:
{
"conversationId": "c123",
"senderId": "u1",
"content": "hello",
"messageType": "text",
"clientMessageId": "local-abc-123"
}
Response:
{
"messageId": "m789",
"serverTimestamp": "2026-05-02T10:00:00Z",
"status": "sent"
}
Get Conversation History
GET /api/conversations/{conversationId}/messages?cursor=xxx&limit=50
WebSocket Receive
ws://chat.example.com/connect?userId=u1
Server pushes:
{
"eventType": "message_created",
"conversationId": "c123",
"messageId": "m789",
"senderId": "u1",
"content": "hello",
"serverTimestamp": "2026-05-02T10:00:00Z"
}
Mark Message Read
POST /api/conversations/{conversationId}/read
Request:
{
"userId": "u2",
"lastReadMessageId": "m789"
}
👉 Interview Answer
I would use HTTP APIs for sending messages, loading history, and updating read status.
For real-time delivery, I would use WebSocket connections so the server can push new messages to online users immediately.
4️⃣ Data Model
User Table
user (
user_id VARCHAR PRIMARY KEY,
username VARCHAR,
created_at TIMESTAMP
)
Conversation Table
conversation (
conversation_id VARCHAR PRIMARY KEY,
type VARCHAR, -- one_to_one, group
created_at TIMESTAMP
)
Conversation Participant Table
conversation_participant (
conversation_id VARCHAR,
user_id VARCHAR,
role VARCHAR,
joined_at TIMESTAMP,
last_read_message_id VARCHAR,
PRIMARY KEY (conversation_id, user_id)
)
Message Table
message (
conversation_id VARCHAR,
message_id VARCHAR,
sender_id VARCHAR,
content TEXT,
message_type VARCHAR,
created_at TIMESTAMP,
client_message_id VARCHAR,
status VARCHAR,
PRIMARY KEY (conversation_id, created_at, message_id)
)
Delivery Status Table
message_delivery_status (
message_id VARCHAR,
user_id VARCHAR,
status VARCHAR, -- sent, delivered, read
updated_at TIMESTAMP,
PRIMARY KEY (message_id, user_id)
)
Why Partition Messages by Conversation?
- Conversation history is usually queried by conversation
- Keeps related messages localized
- Makes pagination easier
- Good fit for append-heavy workloads
👉 Interview Answer
I would store messages partitioned by conversation ID, because message history is usually read per conversation.
The message table is append-heavy, and conversation-based partitioning makes history pagination efficient.
I would store read status separately, because delivery and read receipts change more frequently than message content.
5️⃣ Communication Model
Option 1: Polling
Client repeatedly asks server for new messages.
client → server every few seconds
Pros
- Simple
- Easy to implement
Cons
- Higher latency
- Wasteful when no new messages
- Expensive at scale
Option 2: Long Polling
Client waits until server has updates.
Pros
- Better than polling
- Works when WebSocket is unavailable
Cons
- Still less efficient than WebSocket
- More connection overhead
Option 3: WebSocket
Persistent bidirectional connection.
client ↔ WebSocket Gateway ↔ Chat Service
Pros
- Low latency
- Efficient for real-time messages
- Supports typing indicators and presence
Cons
- Harder to scale
- Requires connection state management
Recommended
Use:
WebSocket for online delivery
Push notification for offline users
HTTP for history and state updates
👉 Interview Answer
I would use WebSocket for real-time delivery, because chat requires low-latency server-to-client updates.
For offline users, I would use push notifications.
HTTP APIs would still be used for message history, sending messages, and read status updates.
6️⃣ Message Send Flow
Basic Flow
- Client sends message with
clientMessageId - API service authenticates user
- Validate conversation membership
- Generate server
messageId - Store message durably
- Publish message-created event
- Delivery service pushes to online participants
- Push notification service notifies offline users
- Return acknowledgement to sender
Event Pipeline
Message API
→ Message Store
→ Message Event Queue
→ Delivery Service
→ WebSocket Gateway / Push Notification
Why Store Before Deliver?
Because message durability matters.
If the system delivers first but fails to store, users may see messages that disappear later.
👉 Interview Answer
I would store the message durably before delivering it.
After the message is persisted, I would publish a message-created event.
Delivery workers then push the message to online users through WebSocket and notify offline users through push notifications.
This ensures the message is not lost even if delivery fails temporarily.
7️⃣ Message Receive Flow
Online User
message event
→ delivery service
→ find user's active WebSocket connection
→ push message to client
→ client sends ack
→ update delivered status
Offline User
message event
→ user is offline
→ store message
→ send push notification
→ user later opens app
→ client syncs unread messages
Client Ack
Client should acknowledge received message:
{
"eventType": "message_ack",
"messageId": "m789",
"userId": "u2"
}
👉 Interview Answer
For online users, the delivery service pushes messages through active WebSocket connections.
For offline users, the message is already stored, so the system sends push notifications and the client later syncs unread messages from storage.
Client acknowledgements can be used to update delivery status.
8️⃣ Ordering and Idempotency
Message Ordering
Within a conversation, users expect messages to appear in order.
Options:
- Server timestamp
- Monotonic sequence number per conversation
- Snowflake-style ID
Recommended
Use:
conversation_id + sequence_number
or:
conversation_id + server_timestamp + message_id
Why Sequence Number?
- Clear ordering within one conversation
- Easier pagination
- Avoids clock skew issues
Idempotency
Clients may retry send requests.
Use:
sender_id + client_message_id
to deduplicate retries.
👉 Interview Answer
I would assign ordering on the server side.
For each conversation, we can use a monotonically increasing sequence number or a server-generated timestamp plus message ID.
To handle client retries, I would use clientMessageId for idempotency, so the same message is not stored multiple times.
9️⃣ Read Receipts and Delivery Status
Status Types
- Sent: stored by server
- Delivered: pushed to recipient device
- Read: recipient opened conversation
1:1 Chat
Read receipt can be simple:
last_read_message_id per user per conversation
Group Chat
Avoid storing one row per message per user if group is large.
Use:
conversation_participant.last_read_message_id
Then compute read state based on message order.
👉 Interview Answer
For read receipts, I would store each participant’s lastReadMessageId per conversation.
This is more scalable than storing read status for every message-user pair, especially in group chats.
Delivery status can be eventually consistent, because it is less critical than message durability.
🔟 Online Presence and Typing Indicators
Presence
Presence tracks whether user is online.
user_id → active connection IDs
Stored in:
- Redis
- In-memory connection registry
- Distributed presence service
Typing Indicator
Typing events should be ephemeral.
typing_start
typing_stop
Do not store durably.
👉 Interview Answer
Presence and typing indicators are ephemeral state.
I would keep them in memory or Redis with short TTLs, rather than storing them durably.
If these signals are lost, it is acceptable, because they are not critical data.
1️⃣1️⃣ Caching Strategy
What to Cache?
- Recent messages per conversation
- Conversation metadata
- User metadata
- Participant list
- Presence state
- Unread counts
Cache Layers
- Local service cache
- Redis / Memcached
- CDN for media
Cache Challenges
- Message edits or deletes
- Read status updates
- Group membership changes
- Multi-device sync
👉 Interview Answer
I would cache recent messages, conversation metadata, participant lists, and unread counts.
Most users open recent conversations, so caching recent message history can significantly reduce latency.
Media files should be stored separately and served through a CDN.
1️⃣2️⃣ Scaling Patterns
Pattern 1: WebSocket Gateway Layer
Keep WebSocket connections separate from business logic.
Client
→ WebSocket Gateway
→ Delivery Service
→ Message Service
Pattern 2: Message Queue for Delivery
Use async delivery:
message-created event → queue → delivery workers
Pattern 3: Shard Messages by Conversation ID
Good for:
- Conversation history queries
- Append-heavy writes
- Pagination
Pattern 4: Shard Connections by User ID
WebSocket gateway needs to know:
user_id → connection_id → gateway_node
Pattern 5: Separate Hot and Cold Storage
- Recent messages in fast storage
- Old messages in cheaper storage
Pattern 6: Multi-device Sync
A user may have:
- Mobile app
- Web app
- Desktop app
Each device may need delivery and sync state.
👉 Interview Answer
To scale the chat system, I would separate WebSocket gateways from message services, use queues for async delivery, shard messages by conversation ID, and keep connection routing state in a presence service.
I would also separate recent hot messages from older cold history to optimize cost and performance.
1️⃣3️⃣ Trade-offs
WebSocket vs Polling
| Strategy | Pros | Cons |
|---|---|---|
| Polling | Simple | Higher latency and waste |
| Long polling | Better compatibility | More overhead |
| WebSocket | Low latency | Harder to scale |
Durability vs Latency
- Store before deliver → safer but slightly slower
- Deliver before store → faster but risky
Recommended:
store first, then deliver
Consistency vs Availability
- Message storage needs strong durability
- Delivery status can be eventually consistent
- Presence can be best-effort
Per-message Status vs Last-read Pointer
- Per-message status is detailed but expensive
- Last-read pointer is compact and scalable
👉 Interview Answer
The core trade-offs are latency, durability, and consistency.
For message content, I prioritize durability, so messages should be stored before delivery.
For delivery status, read receipts, typing indicators, and presence, eventual consistency or best-effort delivery is acceptable.
1️⃣4️⃣ Failure Handling
Common Failures
- WebSocket connection drops
- Message service failure
- Queue backlog
- Duplicate send requests
- Client offline
- Push notification failure
- Message delivered but ack lost
Strategies
- Client reconnect and resume from last seen message
- Use idempotency key for message send
- Retry delivery events
- Store messages durably
- Use dead-letter queue
- Periodic sync for missed messages
- Push notification best effort
👉 Interview Answer
The system should assume connections are unstable.
Clients should reconnect and sync from the last seen message.
The server should use idempotency keys to handle retries and store messages durably before delivery.
Even if real-time delivery fails, users can still recover messages from history.
1️⃣5️⃣ Consistency Model
Stronger Consistency Needed For
- Message persistence
- Conversation membership validation
- Permission checks
- Message deletion rules
Eventual Consistency Acceptable For
- Delivery status
- Read receipts
- Typing indicators
- Online presence
- Unread counts
- Push notifications
👉 Interview Answer
I would require strong durability for message storage, because users should not lose sent messages.
But many chat features, such as delivery status, read receipts, typing indicators, presence, and unread counts, can be eventually consistent or best effort.
1️⃣6️⃣ End-to-End Flow
Send Message Flow
Client sends message
→ Message API validates user
→ Store message durably
→ Publish message-created event
→ Delivery service pushes to online users
→ Push notification sent to offline users
→ Client receives ack
Receive Message Flow
Message event created
→ Delivery service checks recipient connection
→ WebSocket Gateway pushes message
→ Client sends ack
→ Delivery status updated
Offline Sync Flow
User opens app
→ Client sends last seen message
→ Server returns missed messages
→ Client updates local state
Key Insight
Chat systems are not only real-time systems — they are durable messaging systems with real-time delivery on top.
🧠 Staff-Level Answer (Final)
👉 Interview Answer (Full Version)
When designing a chat system, I think of it as a durable messaging system with real-time delivery on top.
The system has three core flows: sending messages, receiving messages in real time, and loading conversation history.
I would use WebSocket for online real-time delivery, push notifications for offline users, and HTTP APIs for sending messages, loading history, and updating read status.
Messages should be stored durably before delivery, because message loss is unacceptable.
After persistence, the message service publishes an event, and delivery workers push the message to online users through WebSocket gateways.
For offline users, the message remains in storage, and the client can sync missed messages later.
I would partition messages by conversation ID, because message history is usually queried per conversation.
For ordering, I would use server-side ordering, such as a per-conversation sequence number.
For idempotency, I would use clientMessageId to deduplicate retry requests.
Delivery status, read receipts, typing indicators, presence, and unread counts can be eventually consistent.
The main trade-offs are latency, durability, connection scalability, and consistency.
Ultimately, the goal is to provide reliable message storage and low-latency delivery across online, offline, and multi-device users.
⭐ Final Insight
A chat system is not just WebSocket — it is durable message storage plus real-time delivery and offline synchronization.
中文部分
🎯 Design Chat System
1️⃣ 核心框架
在设计 Chat System 时,我通常从以下几个方面来分析:
- 核心用户流程:发送消息、接收消息、读取历史消息
- 通信模型:WebSocket vs polling vs push notification
- 数据模型:用户、会话、消息、参与者
- 消息投递 pipeline
- 在线 / 离线处理
- 消息顺序、投递状态和已读回执
- 扩展、缓存和存储策略
- 核心权衡:延迟 vs 一致性 vs 持久性
- 故障处理和可靠性
2️⃣ 核心需求
功能需求
- 用户可以发送 1:1 消息
- 用户可以发送群聊消息
- 用户可以实时接收消息
- 用户可以加载会话历史
- 支持在线 / 离线用户
- 支持消息投递状态
- 支持已读回执
- 支持图片、视频等媒体消息
- 支持 push notifications
非功能需求
- 低延迟消息投递
- 高可用
- 消息持久化
- 支持大规模并发连接
- 投递状态可以最终一致
- 消息内容持久化需要更强保证
👉 面试回答
Chat System 有三个核心流程: 发送消息、实时接收消息、读取历史消息。
系统需要低延迟投递、可靠存储, 并支持在线和离线用户。
核心挑战是在实时通信、消息持久性、 消息顺序和大规模连接管理之间做平衡。
3️⃣ 主要 API
发送消息
POST /api/messages
Request:
{
"conversationId": "c123",
"senderId": "u1",
"content": "hello",
"messageType": "text",
"clientMessageId": "local-abc-123"
}
Response:
{
"messageId": "m789",
"serverTimestamp": "2026-05-02T10:00:00Z",
"status": "sent"
}
获取会话历史
GET /api/conversations/{conversationId}/messages?cursor=xxx&limit=50
WebSocket 接收消息
ws://chat.example.com/connect?userId=u1
服务端推送:
{
"eventType": "message_created",
"conversationId": "c123",
"messageId": "m789",
"senderId": "u1",
"content": "hello",
"serverTimestamp": "2026-05-02T10:00:00Z"
}
标记已读
POST /api/conversations/{conversationId}/read
Request:
{
"userId": "u2",
"lastReadMessageId": "m789"
}
👉 面试回答
我会使用 HTTP API 来发送消息、加载历史消息、 更新已读状态。
对于实时消息投递, 我会使用 WebSocket, 这样服务端可以立即将新消息推送给在线用户。
4️⃣ 数据模型
User Table
user (
user_id VARCHAR PRIMARY KEY,
username VARCHAR,
created_at TIMESTAMP
)
Conversation Table
conversation (
conversation_id VARCHAR PRIMARY KEY,
type VARCHAR, -- one_to_one, group
created_at TIMESTAMP
)
Conversation Participant Table
conversation_participant (
conversation_id VARCHAR,
user_id VARCHAR,
role VARCHAR,
joined_at TIMESTAMP,
last_read_message_id VARCHAR,
PRIMARY KEY (conversation_id, user_id)
)
Message Table
message (
conversation_id VARCHAR,
message_id VARCHAR,
sender_id VARCHAR,
content TEXT,
message_type VARCHAR,
created_at TIMESTAMP,
client_message_id VARCHAR,
status VARCHAR,
PRIMARY KEY (conversation_id, created_at, message_id)
)
Delivery Status Table
message_delivery_status (
message_id VARCHAR,
user_id VARCHAR,
status VARCHAR, -- sent, delivered, read
updated_at TIMESTAMP,
PRIMARY KEY (message_id, user_id)
)
为什么按 Conversation 存储消息?
- 历史消息通常按 conversation 查询
- 相关消息可以聚合在一起
- 分页更容易
- 适合 append-heavy 的写入模式
👉 面试回答
我会按照 conversation ID 对消息进行分区存储, 因为历史消息通常是按会话读取的。
Message table 是 append-heavy 的, 按 conversation 分区可以让历史分页更高效。
已读状态可以单独存储, 因为 delivery 和 read receipts 比 message content 更新更频繁。
5️⃣ 通信模型
方案 1:Polling
客户端定期向服务端请求新消息。
client → server every few seconds
优点
- 简单
- 容易实现
缺点
- 延迟较高
- 没有新消息时浪费资源
- 大规模下成本高
方案 2:Long Polling
客户端请求后,服务端等到有更新再返回。
优点
- 比 polling 更好
- WebSocket 不可用时可以兜底
缺点
- 仍然不如 WebSocket 高效
- 连接管理开销较高
方案 3:WebSocket
持久化双向连接。
client ↔ WebSocket Gateway ↔ Chat Service
优点
- 低延迟
- 适合实时消息
- 支持 typing indicators 和 presence
缺点
- 扩展更复杂
- 需要管理连接状态
推荐方案
使用:
WebSocket for online delivery
Push notification for offline users
HTTP for history and state updates
👉 面试回答
我会使用 WebSocket 进行实时消息投递, 因为 chat 需要低延迟的服务端推送能力。
对于离线用户, 我会使用 push notifications。
HTTP API 仍然用于发送消息、读取历史消息 和更新已读状态。
6️⃣ 消息发送流程
基本流程
- Client 带着
clientMessageId发送消息 - API service 验证用户身份
- 校验用户是否属于该 conversation
- 生成服务端
messageId - 持久化存储消息
- 发布 message-created event
- Delivery service 推送给在线参与者
- Push notification service 通知离线用户
- 返回 ack 给发送方
Event Pipeline
Message API
→ Message Store
→ Message Event Queue
→ Delivery Service
→ WebSocket Gateway / Push Notification
为什么先存储再投递?
因为消息持久性很重要。
如果先投递但存储失败, 用户可能看到一条之后又消失的消息。
👉 面试回答
我会先持久化消息,再进行投递。
消息持久化之后, 系统发布 message-created event。
Delivery workers 再通过 WebSocket 将消息推送给在线用户, 并通过 push notification 通知离线用户。
这样即使投递暂时失败, 消息也不会丢失。
7️⃣ 消息接收流程
在线用户
message event
→ delivery service
→ find user's active WebSocket connection
→ push message to client
→ client sends ack
→ update delivered status
离线用户
message event
→ user is offline
→ store message
→ send push notification
→ user later opens app
→ client syncs unread messages
Client Ack
客户端接收后发送 ack:
{
"eventType": "message_ack",
"messageId": "m789",
"userId": "u2"
}
👉 面试回答
对在线用户, delivery service 会通过活跃的 WebSocket connection 推送消息。
对离线用户, 消息已经存储在系统中, 所以系统可以发送 push notification, 用户之后打开 app 时再同步未读消息。
Client acknowledgement 可以用于更新 delivered 状态。
8️⃣ 消息顺序和幂等
消息顺序
在一个 conversation 内, 用户期望消息按顺序展示。
可选方案:
- Server timestamp
- 每个 conversation 单调递增 sequence number
- Snowflake-style ID
推荐方案
使用:
conversation_id + sequence_number
或者:
conversation_id + server_timestamp + message_id
为什么 Sequence Number 更好?
- Conversation 内顺序清晰
- 分页更容易
- 避免 clock skew 问题
幂等性
客户端可能会重试发送消息。
使用:
sender_id + client_message_id
来去重。
👉 面试回答
我会在服务端决定消息顺序。
对每个 conversation, 可以使用单调递增的 sequence number, 或者使用 server timestamp 加 message ID。
为了处理客户端重试, 我会使用 clientMessageId 做幂等去重, 避免同一条消息被存储多次。
9️⃣ 已读回执和投递状态
状态类型
- Sent:服务端已存储
- Delivered:已推送到接收方设备
- Read:接收方打开了 conversation
1:1 Chat
Read receipt 可以简单使用:
last_read_message_id per user per conversation
Group Chat
如果群很大, 避免为每个 message-user pair 都存一条记录。
推荐使用:
conversation_participant.last_read_message_id
然后根据 message 顺序计算 read 状态。
👉 面试回答
对于已读回执, 我会存储每个 participant 在每个 conversation 中的 lastReadMessageId。
这种方式比为每条消息、每个用户都存 read status 更可扩展, 特别是在群聊中。
Delivery status 可以最终一致, 因为它没有消息持久性那么关键。
🔟 在线状态和正在输入
Presence
Presence 用来表示用户是否在线。
user_id → active connection IDs
可以存储在:
- Redis
- WebSocket gateway 内存
- Distributed presence service
Typing Indicator
Typing events 是临时状态。
typing_start
typing_stop
不需要持久化。
👉 面试回答
Presence 和 typing indicators 都是临时状态。
我会将它们放在内存或 Redis 中, 并设置较短 TTL, 而不是持久化存储。
如果这些信号丢失,是可以接受的, 因为它们不是关键数据。
1️⃣1️⃣ 缓存策略
缓存什么?
- 最近消息
- Conversation metadata
- User metadata
- Participant list
- Presence state
- Unread counts
缓存层
- Local service cache
- Redis / Memcached
- CDN for media
缓存挑战
- Message edits or deletes
- Read status updates
- Group membership changes
- Multi-device sync
👉 面试回答
我会缓存最近消息、conversation metadata、 participant list 和 unread counts。
大多数用户经常打开最近的会话, 所以缓存最近历史消息可以显著降低延迟。
Media files 应该单独存储, 并通过 CDN 分发。
1️⃣2️⃣ 扩展模式
Pattern 1: WebSocket Gateway Layer
将 WebSocket 连接管理和业务逻辑分开。
Client
→ WebSocket Gateway
→ Delivery Service
→ Message Service
Pattern 2: Message Queue for Delivery
使用异步投递:
message-created event → queue → delivery workers
Pattern 3: Shard Messages by Conversation ID
适合:
- Conversation history queries
- Append-heavy writes
- Pagination
Pattern 4: Shard Connections by User ID
WebSocket gateway 需要知道:
user_id → connection_id → gateway_node
Pattern 5: Separate Hot and Cold Storage
- 最近消息放在快速存储
- 老消息放在低成本存储
Pattern 6: Multi-device Sync
一个用户可能同时有:
- 手机 app
- Web app
- 桌面 app
每个设备都可能需要消息投递和同步状态。
👉 面试回答
为了扩展 Chat System, 我会将 WebSocket gateways 和 message services 分开, 使用 queue 进行异步投递, 按 conversation ID 对消息分片, 并在 presence service 中维护 connection routing 状态。
我也会区分 hot messages 和 cold history, 以优化成本和性能。
1️⃣3️⃣ 核心权衡
WebSocket vs Polling
| Strategy | 优点 | 缺点 |
|---|---|---|
| Polling | 简单 | 延迟高且浪费资源 |
| Long polling | 兼容性好 | 连接开销仍然较高 |
| WebSocket | 低延迟 | 扩展更复杂 |
持久性 vs 延迟
- 先存储再投递:更安全,但稍慢
- 先投递再存储:更快,但风险高
推荐:
store first, then deliver
一致性 vs 可用性
- Message storage 需要强持久性
- Delivery status 可以最终一致
- Presence 可以 best-effort
Per-message Status vs Last-read Pointer
- Per-message status 更详细,但成本高
- Last-read pointer 更紧凑、更可扩展
👉 面试回答
核心权衡包括延迟、持久性和一致性。
对于 message content, 我会优先保证持久性, 所以消息应该先存储再投递。
对于 delivery status、read receipts、 typing indicators 和 presence, 最终一致或 best-effort 是可以接受的。
1️⃣4️⃣ 故障处理
常见故障
- WebSocket connection 断开
- Message service 故障
- Queue backlog
- 重复发送请求
- Client offline
- Push notification 失败
- Message delivered 但 ack 丢失
处理策略
- Client reconnect 后从 last seen message 继续同步
- 使用 idempotency key 处理重复发送
- 重试 delivery events
- 消息持久化存储
- 使用 dead-letter queue
- 定期同步 missed messages
- Push notification best effort
👉 面试回答
系统应该假设连接是不稳定的。
Client 应该支持重连, 并从 last seen message 开始同步缺失消息。
服务端应该使用 idempotency key 处理重试请求, 并在投递前持久化消息。
即使实时投递失败, 用户仍然可以从历史消息中恢复内容。
1️⃣5️⃣ 一致性模型
需要较强一致性的场景
- Message persistence
- Conversation membership validation
- Permission checks
- Message deletion rules
可以最终一致的场景
- Delivery status
- Read receipts
- Typing indicators
- Online presence
- Unread counts
- Push notifications
👉 面试回答
我会要求 message storage 有强持久性, 因为用户不应该丢失已经发送的消息。
但是很多 chat 功能, 比如 delivery status、read receipts、 typing indicators、presence 和 unread counts, 可以接受最终一致或 best-effort。
1️⃣6️⃣ End-to-End Flow
Send Message Flow
Client sends message
→ Message API validates user
→ Store message durably
→ Publish message-created event
→ Delivery service pushes to online users
→ Push notification sent to offline users
→ Client receives ack
Receive Message Flow
Message event created
→ Delivery service checks recipient connection
→ WebSocket Gateway pushes message
→ Client sends ack
→ Delivery status updated
Offline Sync Flow
User opens app
→ Client sends last seen message
→ Server returns missed messages
→ Client updates local state
Key Insight
Chat System 不只是实时系统, 它是在可靠消息存储之上实现实时投递。
🧠 Staff-Level Answer(最终版)
👉 面试回答(完整背诵版)
在设计 Chat System 时, 我会把它看作一个带有实时投递能力的可靠消息系统。
系统有三个核心流程: 发送消息、实时接收消息和加载历史消息。
我会使用 WebSocket 处理在线用户的实时消息投递, 使用 push notifications 通知离线用户, 并使用 HTTP API 处理发送消息、读取历史消息和更新已读状态。
消息应该先持久化再投递, 因为消息丢失是不可接受的。
消息持久化之后, message service 会发布 event, delivery workers 再通过 WebSocket gateways 将消息推送给在线用户。
对于离线用户, 消息会保留在存储中, 用户之后可以同步 missed messages。
我会按照 conversation ID 对消息进行分区, 因为历史消息通常按 conversation 查询。
对于消息顺序, 我会使用服务端生成的顺序, 例如 per-conversation sequence number。
对于幂等性, 我会使用 clientMessageId 来去重客户端重试请求。
Delivery status、read receipts、typing indicators、 presence 和 unread counts 可以最终一致。
主要权衡包括延迟、持久性、 连接扩展性和一致性。
最终目标是在在线、离线、多设备场景下, 同时提供可靠的消息存储和低延迟消息投递。
⭐ Final Insight
Chat System 的核心不只是 WebSocket, 而是可靠消息存储 + 实时投递 + 离线同步。
Implement