·

System Design Deep Dive - 09 How Slack Handles Real-time Messaging

Post by ailswan May. 26, 2026

中文 ↓

🎯 How Slack Handles Real-time Messaging


1️⃣ Core Messaging Framework (Staff-Level)

When discussing a Slack-like realtime messaging system, I frame it as:

  1. WebSocket gateway layer
  2. Durable message write path
  3. Channel membership and authorization
  4. Fan-out to online clients
  5. Offline delivery and unread state
  6. Ordering and idempotency
  7. Reconnect and sync recovery
  8. Trade-offs: latency vs durability vs fan-out cost

2️⃣ Core Problem

Slack must make messages feel instant while preserving reliable chat history.

Challenges:


👉 Interview Answer

A Slack-like system combines low-latency socket delivery with a durable message log. The message should be persisted before broad fan-out so reconnects, history, and multi-device sync remain reliable.


3️⃣ High-Level Architecture

Client
  ↓
WebSocket Gateway
  ↓
Message API
  ↓
Auth + Channel Membership Check
  ↓
Message Store
  ↓
Fan-out Queue
  ↓
Presence / Connection Router
  ↓
Online Devices

4️⃣ Message Write Path

Write flow:

Send message
  ↓
Validate channel membership
  ↓
Assign message ID and sequence
  ↓
Persist message
  ↓
Publish fan-out event
  ↓
Deliver to connected clients

👉 Interview Answer

I would place durability before fan-out. Once the message is committed, delivery can be retried safely, offline users can catch up later, and clients can recover by asking for messages after their last seen sequence.


5️⃣ Fan-out Strategy

Small channels:

Large channels:


6️⃣ Ordering and Idempotency

Ordering is usually scoped to a channel.

Common approach:


👉 Interview Answer

I would guarantee a stable order within a channel, not a global order across the whole system. The server assigns channel sequence numbers, and clients use idempotency keys so retries do not create duplicate messages.


7️⃣ Reconnect Recovery

Client stores:

last_seen_channel_sequence

On reconnect:

Client reconnects
  ↓
Gateway authenticates
  ↓
Client sends last seen sequence
  ↓
Server returns missed messages
  ↓
Realtime stream resumes

8️⃣ Staff-Level Trade-offs

Decision Benefit Cost
WebSockets Low latency Connection management cost
Durable write before fan-out Reliable history Slightly higher send latency
Per-channel ordering Practical consistency No global order
Push fan-out Instant delivery Expensive for large channels
Pull catch-up Efficient recovery More client sync logic

9️⃣ Failure Handling

Failures:

Protections:


中文部分

中文速记

一句话

Slack Messaging 是“实时 socket delivery + durable message log”的组合:消息先可靠写入,再做 fan-out。


背诵要点


中文面试回答

我会把 Slack 消息系统拆成 WebSocket gateway、message service、message store 和 fan-out worker。 客户端通过 gateway 建立长连接,发送消息时先做 channel membership 校验,然后由服务端分配 message ID 和 channel sequence number。 消息必须先写入 durable store,再发布 fan-out event 给在线用户。

这样做的好处是,即使 gateway 崩溃或 fan-out 失败,消息历史仍然可靠。 离线用户或重连用户可以带着 last seen sequence 调用 catch-up API 拉取漏掉的消息。 客户端重试发送时使用 idempotency key,避免重复消息。

Staff 级重点是:实时连接状态和持久消息状态要分离。 WebSocket 让体验像实时的,但系统可靠性来自 durable message log、幂等写入和可恢复的 fan-out。


✅ Final Interview Answer

A Slack-like realtime messaging system uses WebSocket gateways for low-latency delivery, but it should persist messages durably before fan-out. The send path validates membership, assigns a message ID and channel sequence, writes the message, and then publishes a fan-out event to online devices. Offline users or reconnecting clients recover by fetching messages after their last seen sequence.

At staff level, the key design is separating durable chat state from ephemeral connection state. Gateways can fail and clients can reconnect, but the message log remains the source of truth. The main trade-off is balancing instant delivery with ordering, idempotency, fan-out cost, and reliable recovery.

Implement