System Design Deep Dive - 04 Design Chat System

Post by ailswan April. 28, 2026

中文 ↓

🎯 Design Chat System


1️⃣ Core Framework

When discussing Chat System design, I frame it as:

  1. Core user flows: send message, receive message, read history
  2. Communication model: WebSocket vs polling vs push notification
  3. Data model: users, conversations, messages, participants
  4. Message delivery pipeline
  5. Online / offline handling
  6. Ordering, delivery status, and read receipts
  7. Scaling, caching, and storage strategy
  8. Trade-offs: latency vs consistency vs durability
  9. Failure handling and reliability

2️⃣ Core Requirements


Functional Requirements


Non-functional Requirements


👉 Interview Answer

A chat system has three core flows: sending messages, receiving messages in real time, and loading message history.

The system needs low-latency delivery, durable storage, and support for online and offline users.

The main challenge is balancing real-time communication, message durability, ordering, and large-scale connection management.


3️⃣ Main APIs


Send Message

POST /api/messages

Request:

{
  "conversationId": "c123",
  "senderId": "u1",
  "content": "hello",
  "messageType": "text",
  "clientMessageId": "local-abc-123"
}

Response:

{
  "messageId": "m789",
  "serverTimestamp": "2026-05-02T10:00:00Z",
  "status": "sent"
}

Get Conversation History

GET /api/conversations/{conversationId}/messages?cursor=xxx&limit=50

WebSocket Receive

ws://chat.example.com/connect?userId=u1

Server pushes:

{
  "eventType": "message_created",
  "conversationId": "c123",
  "messageId": "m789",
  "senderId": "u1",
  "content": "hello",
  "serverTimestamp": "2026-05-02T10:00:00Z"
}

Mark Message Read

POST /api/conversations/{conversationId}/read

Request:

{
  "userId": "u2",
  "lastReadMessageId": "m789"
}

👉 Interview Answer

I would use HTTP APIs for sending messages, loading history, and updating read status.

For real-time delivery, I would use WebSocket connections so the server can push new messages to online users immediately.


4️⃣ Data Model


User Table

user (
  user_id VARCHAR PRIMARY KEY,
  username VARCHAR,
  created_at TIMESTAMP
)

Conversation Table

conversation (
  conversation_id VARCHAR PRIMARY KEY,
  type VARCHAR, -- one_to_one, group
  created_at TIMESTAMP
)

Conversation Participant Table

conversation_participant (
  conversation_id VARCHAR,
  user_id VARCHAR,
  role VARCHAR,
  joined_at TIMESTAMP,
  last_read_message_id VARCHAR,
  PRIMARY KEY (conversation_id, user_id)
)

Message Table

message (
  conversation_id VARCHAR,
  message_id VARCHAR,
  sender_id VARCHAR,
  content TEXT,
  message_type VARCHAR,
  created_at TIMESTAMP,
  client_message_id VARCHAR,
  status VARCHAR,
  PRIMARY KEY (conversation_id, created_at, message_id)
)

Delivery Status Table

message_delivery_status (
  message_id VARCHAR,
  user_id VARCHAR,
  status VARCHAR, -- sent, delivered, read
  updated_at TIMESTAMP,
  PRIMARY KEY (message_id, user_id)
)

Why Partition Messages by Conversation?


👉 Interview Answer

I would store messages partitioned by conversation ID, because message history is usually read per conversation.

The message table is append-heavy, and conversation-based partitioning makes history pagination efficient.

I would store read status separately, because delivery and read receipts change more frequently than message content.


5️⃣ Communication Model


Option 1: Polling

Client repeatedly asks server for new messages.

client → server every few seconds

Pros

Cons


Option 2: Long Polling

Client waits until server has updates.

Pros

Cons


Option 3: WebSocket

Persistent bidirectional connection.

client ↔ WebSocket Gateway ↔ Chat Service

Pros

Cons


Use:

WebSocket for online delivery
Push notification for offline users
HTTP for history and state updates

👉 Interview Answer

I would use WebSocket for real-time delivery, because chat requires low-latency server-to-client updates.

For offline users, I would use push notifications.

HTTP APIs would still be used for message history, sending messages, and read status updates.


6️⃣ Message Send Flow


Basic Flow

  1. Client sends message with clientMessageId
  2. API service authenticates user
  3. Validate conversation membership
  4. Generate server messageId
  5. Store message durably
  6. Publish message-created event
  7. Delivery service pushes to online participants
  8. Push notification service notifies offline users
  9. Return acknowledgement to sender

Event Pipeline

Message API
→ Message Store
→ Message Event Queue
→ Delivery Service
→ WebSocket Gateway / Push Notification

Why Store Before Deliver?

Because message durability matters.

If the system delivers first but fails to store, users may see messages that disappear later.


👉 Interview Answer

I would store the message durably before delivering it.

After the message is persisted, I would publish a message-created event.

Delivery workers then push the message to online users through WebSocket and notify offline users through push notifications.

This ensures the message is not lost even if delivery fails temporarily.


7️⃣ Message Receive Flow


Online User

message event
→ delivery service
→ find user's active WebSocket connection
→ push message to client
→ client sends ack
→ update delivered status

Offline User

message event
→ user is offline
→ store message
→ send push notification
→ user later opens app
→ client syncs unread messages

Client Ack

Client should acknowledge received message:

{
  "eventType": "message_ack",
  "messageId": "m789",
  "userId": "u2"
}

👉 Interview Answer

For online users, the delivery service pushes messages through active WebSocket connections.

For offline users, the message is already stored, so the system sends push notifications and the client later syncs unread messages from storage.

Client acknowledgements can be used to update delivery status.


8️⃣ Ordering and Idempotency


Message Ordering

Within a conversation, users expect messages to appear in order.

Options:


Use:

conversation_id + sequence_number

or:

conversation_id + server_timestamp + message_id

Why Sequence Number?


Idempotency

Clients may retry send requests.

Use:

sender_id + client_message_id

to deduplicate retries.


👉 Interview Answer

I would assign ordering on the server side.

For each conversation, we can use a monotonically increasing sequence number or a server-generated timestamp plus message ID.

To handle client retries, I would use clientMessageId for idempotency, so the same message is not stored multiple times.


9️⃣ Read Receipts and Delivery Status


Status Types


1:1 Chat

Read receipt can be simple:

last_read_message_id per user per conversation

Group Chat

Avoid storing one row per message per user if group is large.

Use:

conversation_participant.last_read_message_id

Then compute read state based on message order.


👉 Interview Answer

For read receipts, I would store each participant’s lastReadMessageId per conversation.

This is more scalable than storing read status for every message-user pair, especially in group chats.

Delivery status can be eventually consistent, because it is less critical than message durability.


🔟 Online Presence and Typing Indicators


Presence

Presence tracks whether user is online.

user_id → active connection IDs

Stored in:


Typing Indicator

Typing events should be ephemeral.

typing_start
typing_stop

Do not store durably.


👉 Interview Answer

Presence and typing indicators are ephemeral state.

I would keep them in memory or Redis with short TTLs, rather than storing them durably.

If these signals are lost, it is acceptable, because they are not critical data.


1️⃣1️⃣ Caching Strategy


What to Cache?


Cache Layers


Cache Challenges


👉 Interview Answer

I would cache recent messages, conversation metadata, participant lists, and unread counts.

Most users open recent conversations, so caching recent message history can significantly reduce latency.

Media files should be stored separately and served through a CDN.


1️⃣2️⃣ Scaling Patterns


Pattern 1: WebSocket Gateway Layer

Keep WebSocket connections separate from business logic.

Client
→ WebSocket Gateway
→ Delivery Service
→ Message Service

Pattern 2: Message Queue for Delivery

Use async delivery:

message-created event → queue → delivery workers

Pattern 3: Shard Messages by Conversation ID

Good for:


Pattern 4: Shard Connections by User ID

WebSocket gateway needs to know:

user_id → connection_id → gateway_node

Pattern 5: Separate Hot and Cold Storage


Pattern 6: Multi-device Sync

A user may have:

Each device may need delivery and sync state.


👉 Interview Answer

To scale the chat system, I would separate WebSocket gateways from message services, use queues for async delivery, shard messages by conversation ID, and keep connection routing state in a presence service.

I would also separate recent hot messages from older cold history to optimize cost and performance.


1️⃣3️⃣ Trade-offs


WebSocket vs Polling

Strategy Pros Cons
Polling Simple Higher latency and waste
Long polling Better compatibility More overhead
WebSocket Low latency Harder to scale

Durability vs Latency

Recommended:

store first, then deliver

Consistency vs Availability


Per-message Status vs Last-read Pointer


👉 Interview Answer

The core trade-offs are latency, durability, and consistency.

For message content, I prioritize durability, so messages should be stored before delivery.

For delivery status, read receipts, typing indicators, and presence, eventual consistency or best-effort delivery is acceptable.


1️⃣4️⃣ Failure Handling


Common Failures


Strategies


👉 Interview Answer

The system should assume connections are unstable.

Clients should reconnect and sync from the last seen message.

The server should use idempotency keys to handle retries and store messages durably before delivery.

Even if real-time delivery fails, users can still recover messages from history.


1️⃣5️⃣ Consistency Model


Stronger Consistency Needed For


Eventual Consistency Acceptable For


👉 Interview Answer

I would require strong durability for message storage, because users should not lose sent messages.

But many chat features, such as delivery status, read receipts, typing indicators, presence, and unread counts, can be eventually consistent or best effort.


1️⃣6️⃣ End-to-End Flow


Send Message Flow

Client sends message
→ Message API validates user
→ Store message durably
→ Publish message-created event
→ Delivery service pushes to online users
→ Push notification sent to offline users
→ Client receives ack

Receive Message Flow

Message event created
→ Delivery service checks recipient connection
→ WebSocket Gateway pushes message
→ Client sends ack
→ Delivery status updated

Offline Sync Flow

User opens app
→ Client sends last seen message
→ Server returns missed messages
→ Client updates local state

Key Insight

Chat systems are not only real-time systems — they are durable messaging systems with real-time delivery on top.


🧠 Staff-Level Answer (Final)


👉 Interview Answer (Full Version)

When designing a chat system, I think of it as a durable messaging system with real-time delivery on top.

The system has three core flows: sending messages, receiving messages in real time, and loading conversation history.

I would use WebSocket for online real-time delivery, push notifications for offline users, and HTTP APIs for sending messages, loading history, and updating read status.

Messages should be stored durably before delivery, because message loss is unacceptable.

After persistence, the message service publishes an event, and delivery workers push the message to online users through WebSocket gateways.

For offline users, the message remains in storage, and the client can sync missed messages later.

I would partition messages by conversation ID, because message history is usually queried per conversation.

For ordering, I would use server-side ordering, such as a per-conversation sequence number.

For idempotency, I would use clientMessageId to deduplicate retry requests.

Delivery status, read receipts, typing indicators, presence, and unread counts can be eventually consistent.

The main trade-offs are latency, durability, connection scalability, and consistency.

Ultimately, the goal is to provide reliable message storage and low-latency delivery across online, offline, and multi-device users.


⭐ Final Insight

A chat system is not just WebSocket — it is durable message storage plus real-time delivery and offline synchronization.



中文部分


🎯 Design Chat System


1️⃣ 核心框架

在设计 Chat System 时,我通常从以下几个方面来分析:

  1. 核心用户流程:发送消息、接收消息、读取历史消息
  2. 通信模型:WebSocket vs polling vs push notification
  3. 数据模型:用户、会话、消息、参与者
  4. 消息投递 pipeline
  5. 在线 / 离线处理
  6. 消息顺序、投递状态和已读回执
  7. 扩展、缓存和存储策略
  8. 核心权衡:延迟 vs 一致性 vs 持久性
  9. 故障处理和可靠性

2️⃣ 核心需求


功能需求


非功能需求


👉 面试回答

Chat System 有三个核心流程: 发送消息、实时接收消息、读取历史消息。

系统需要低延迟投递、可靠存储, 并支持在线和离线用户。

核心挑战是在实时通信、消息持久性、 消息顺序和大规模连接管理之间做平衡。


3️⃣ 主要 API


发送消息

POST /api/messages

Request:

{
  "conversationId": "c123",
  "senderId": "u1",
  "content": "hello",
  "messageType": "text",
  "clientMessageId": "local-abc-123"
}

Response:

{
  "messageId": "m789",
  "serverTimestamp": "2026-05-02T10:00:00Z",
  "status": "sent"
}

获取会话历史

GET /api/conversations/{conversationId}/messages?cursor=xxx&limit=50

WebSocket 接收消息

ws://chat.example.com/connect?userId=u1

服务端推送:

{
  "eventType": "message_created",
  "conversationId": "c123",
  "messageId": "m789",
  "senderId": "u1",
  "content": "hello",
  "serverTimestamp": "2026-05-02T10:00:00Z"
}

标记已读

POST /api/conversations/{conversationId}/read

Request:

{
  "userId": "u2",
  "lastReadMessageId": "m789"
}

👉 面试回答

我会使用 HTTP API 来发送消息、加载历史消息、 更新已读状态。

对于实时消息投递, 我会使用 WebSocket, 这样服务端可以立即将新消息推送给在线用户。


4️⃣ 数据模型


User Table

user (
  user_id VARCHAR PRIMARY KEY,
  username VARCHAR,
  created_at TIMESTAMP
)

Conversation Table

conversation (
  conversation_id VARCHAR PRIMARY KEY,
  type VARCHAR, -- one_to_one, group
  created_at TIMESTAMP
)

Conversation Participant Table

conversation_participant (
  conversation_id VARCHAR,
  user_id VARCHAR,
  role VARCHAR,
  joined_at TIMESTAMP,
  last_read_message_id VARCHAR,
  PRIMARY KEY (conversation_id, user_id)
)

Message Table

message (
  conversation_id VARCHAR,
  message_id VARCHAR,
  sender_id VARCHAR,
  content TEXT,
  message_type VARCHAR,
  created_at TIMESTAMP,
  client_message_id VARCHAR,
  status VARCHAR,
  PRIMARY KEY (conversation_id, created_at, message_id)
)

Delivery Status Table

message_delivery_status (
  message_id VARCHAR,
  user_id VARCHAR,
  status VARCHAR, -- sent, delivered, read
  updated_at TIMESTAMP,
  PRIMARY KEY (message_id, user_id)
)

为什么按 Conversation 存储消息?


👉 面试回答

我会按照 conversation ID 对消息进行分区存储, 因为历史消息通常是按会话读取的。

Message table 是 append-heavy 的, 按 conversation 分区可以让历史分页更高效。

已读状态可以单独存储, 因为 delivery 和 read receipts 比 message content 更新更频繁。


5️⃣ 通信模型


方案 1:Polling

客户端定期向服务端请求新消息。

client → server every few seconds

优点

缺点


方案 2:Long Polling

客户端请求后,服务端等到有更新再返回。

优点

缺点


方案 3:WebSocket

持久化双向连接。

client ↔ WebSocket Gateway ↔ Chat Service

优点

缺点


推荐方案

使用:

WebSocket for online delivery
Push notification for offline users
HTTP for history and state updates

👉 面试回答

我会使用 WebSocket 进行实时消息投递, 因为 chat 需要低延迟的服务端推送能力。

对于离线用户, 我会使用 push notifications。

HTTP API 仍然用于发送消息、读取历史消息 和更新已读状态。


6️⃣ 消息发送流程


基本流程

  1. Client 带着 clientMessageId 发送消息
  2. API service 验证用户身份
  3. 校验用户是否属于该 conversation
  4. 生成服务端 messageId
  5. 持久化存储消息
  6. 发布 message-created event
  7. Delivery service 推送给在线参与者
  8. Push notification service 通知离线用户
  9. 返回 ack 给发送方

Event Pipeline

Message API
→ Message Store
→ Message Event Queue
→ Delivery Service
→ WebSocket Gateway / Push Notification

为什么先存储再投递?

因为消息持久性很重要。

如果先投递但存储失败, 用户可能看到一条之后又消失的消息。


👉 面试回答

我会先持久化消息,再进行投递。

消息持久化之后, 系统发布 message-created event。

Delivery workers 再通过 WebSocket 将消息推送给在线用户, 并通过 push notification 通知离线用户。

这样即使投递暂时失败, 消息也不会丢失。


7️⃣ 消息接收流程


在线用户

message event
→ delivery service
→ find user's active WebSocket connection
→ push message to client
→ client sends ack
→ update delivered status

离线用户

message event
→ user is offline
→ store message
→ send push notification
→ user later opens app
→ client syncs unread messages

Client Ack

客户端接收后发送 ack:

{
  "eventType": "message_ack",
  "messageId": "m789",
  "userId": "u2"
}

👉 面试回答

对在线用户, delivery service 会通过活跃的 WebSocket connection 推送消息。

对离线用户, 消息已经存储在系统中, 所以系统可以发送 push notification, 用户之后打开 app 时再同步未读消息。

Client acknowledgement 可以用于更新 delivered 状态。


8️⃣ 消息顺序和幂等


消息顺序

在一个 conversation 内, 用户期望消息按顺序展示。

可选方案:


推荐方案

使用:

conversation_id + sequence_number

或者:

conversation_id + server_timestamp + message_id

为什么 Sequence Number 更好?


幂等性

客户端可能会重试发送消息。

使用:

sender_id + client_message_id

来去重。


👉 面试回答

我会在服务端决定消息顺序。

对每个 conversation, 可以使用单调递增的 sequence number, 或者使用 server timestamp 加 message ID。

为了处理客户端重试, 我会使用 clientMessageId 做幂等去重, 避免同一条消息被存储多次。


9️⃣ 已读回执和投递状态


状态类型


1:1 Chat

Read receipt 可以简单使用:

last_read_message_id per user per conversation

Group Chat

如果群很大, 避免为每个 message-user pair 都存一条记录。

推荐使用:

conversation_participant.last_read_message_id

然后根据 message 顺序计算 read 状态。


👉 面试回答

对于已读回执, 我会存储每个 participant 在每个 conversation 中的 lastReadMessageId。

这种方式比为每条消息、每个用户都存 read status 更可扩展, 特别是在群聊中。

Delivery status 可以最终一致, 因为它没有消息持久性那么关键。


🔟 在线状态和正在输入


Presence

Presence 用来表示用户是否在线。

user_id → active connection IDs

可以存储在:


Typing Indicator

Typing events 是临时状态。

typing_start
typing_stop

不需要持久化。


👉 面试回答

Presence 和 typing indicators 都是临时状态。

我会将它们放在内存或 Redis 中, 并设置较短 TTL, 而不是持久化存储。

如果这些信号丢失,是可以接受的, 因为它们不是关键数据。


1️⃣1️⃣ 缓存策略


缓存什么?


缓存层


缓存挑战


👉 面试回答

我会缓存最近消息、conversation metadata、 participant list 和 unread counts。

大多数用户经常打开最近的会话, 所以缓存最近历史消息可以显著降低延迟。

Media files 应该单独存储, 并通过 CDN 分发。


1️⃣2️⃣ 扩展模式


Pattern 1: WebSocket Gateway Layer

将 WebSocket 连接管理和业务逻辑分开。

Client
→ WebSocket Gateway
→ Delivery Service
→ Message Service

Pattern 2: Message Queue for Delivery

使用异步投递:

message-created event → queue → delivery workers

Pattern 3: Shard Messages by Conversation ID

适合:


Pattern 4: Shard Connections by User ID

WebSocket gateway 需要知道:

user_id → connection_id → gateway_node

Pattern 5: Separate Hot and Cold Storage


Pattern 6: Multi-device Sync

一个用户可能同时有:

每个设备都可能需要消息投递和同步状态。


👉 面试回答

为了扩展 Chat System, 我会将 WebSocket gateways 和 message services 分开, 使用 queue 进行异步投递, 按 conversation ID 对消息分片, 并在 presence service 中维护 connection routing 状态。

我也会区分 hot messages 和 cold history, 以优化成本和性能。


1️⃣3️⃣ 核心权衡


WebSocket vs Polling

Strategy 优点 缺点
Polling 简单 延迟高且浪费资源
Long polling 兼容性好 连接开销仍然较高
WebSocket 低延迟 扩展更复杂

持久性 vs 延迟

推荐:

store first, then deliver

一致性 vs 可用性


Per-message Status vs Last-read Pointer


👉 面试回答

核心权衡包括延迟、持久性和一致性。

对于 message content, 我会优先保证持久性, 所以消息应该先存储再投递。

对于 delivery status、read receipts、 typing indicators 和 presence, 最终一致或 best-effort 是可以接受的。


1️⃣4️⃣ 故障处理


常见故障


处理策略


👉 面试回答

系统应该假设连接是不稳定的。

Client 应该支持重连, 并从 last seen message 开始同步缺失消息。

服务端应该使用 idempotency key 处理重试请求, 并在投递前持久化消息。

即使实时投递失败, 用户仍然可以从历史消息中恢复内容。


1️⃣5️⃣ 一致性模型


需要较强一致性的场景


可以最终一致的场景


👉 面试回答

我会要求 message storage 有强持久性, 因为用户不应该丢失已经发送的消息。

但是很多 chat 功能, 比如 delivery status、read receipts、 typing indicators、presence 和 unread counts, 可以接受最终一致或 best-effort。


1️⃣6️⃣ End-to-End Flow


Send Message Flow

Client sends message
→ Message API validates user
→ Store message durably
→ Publish message-created event
→ Delivery service pushes to online users
→ Push notification sent to offline users
→ Client receives ack

Receive Message Flow

Message event created
→ Delivery service checks recipient connection
→ WebSocket Gateway pushes message
→ Client sends ack
→ Delivery status updated

Offline Sync Flow

User opens app
→ Client sends last seen message
→ Server returns missed messages
→ Client updates local state

Key Insight

Chat System 不只是实时系统, 它是在可靠消息存储之上实现实时投递。


🧠 Staff-Level Answer(最终版)


👉 面试回答(完整背诵版)

在设计 Chat System 时, 我会把它看作一个带有实时投递能力的可靠消息系统。

系统有三个核心流程: 发送消息、实时接收消息和加载历史消息。

我会使用 WebSocket 处理在线用户的实时消息投递, 使用 push notifications 通知离线用户, 并使用 HTTP API 处理发送消息、读取历史消息和更新已读状态。

消息应该先持久化再投递, 因为消息丢失是不可接受的。

消息持久化之后, message service 会发布 event, delivery workers 再通过 WebSocket gateways 将消息推送给在线用户。

对于离线用户, 消息会保留在存储中, 用户之后可以同步 missed messages。

我会按照 conversation ID 对消息进行分区, 因为历史消息通常按 conversation 查询。

对于消息顺序, 我会使用服务端生成的顺序, 例如 per-conversation sequence number。

对于幂等性, 我会使用 clientMessageId 来去重客户端重试请求。

Delivery status、read receipts、typing indicators、 presence 和 unread counts 可以最终一致。

主要权衡包括延迟、持久性、 连接扩展性和一致性。

最终目标是在在线、离线、多设备场景下, 同时提供可靠的消息存储和低延迟消息投递。


⭐ Final Insight

Chat System 的核心不只是 WebSocket, 而是可靠消息存储 + 实时投递 + 离线同步。

Implement