🎯 How Slack Handles Real-time Messaging
1️⃣ Core Messaging Framework (Staff-Level)
When discussing a Slack-like realtime messaging system, I frame it as:
- WebSocket gateway layer
- Durable message write path
- Channel membership and authorization
- Fan-out to online clients
- Offline delivery and unread state
- Ordering and idempotency
- Reconnect and sync recovery
- Trade-offs: latency vs durability vs fan-out cost
2️⃣ Core Problem
Slack must make messages feel instant while preserving reliable chat history.
Challenges:
- many persistent client connections
- multi-device delivery
- large public channels
- message ordering
- reconnect recovery
- offline users
- edits, deletes, reactions, threads
👉 Interview Answer
A Slack-like system combines low-latency socket delivery with a durable message log. The message should be persisted before broad fan-out so reconnects, history, and multi-device sync remain reliable.
3️⃣ High-Level Architecture
Client
↓
WebSocket Gateway
↓
Message API
↓
Auth + Channel Membership Check
↓
Message Store
↓
Fan-out Queue
↓
Presence / Connection Router
↓
Online Devices
4️⃣ Message Write Path
Write flow:
Send message
↓
Validate channel membership
↓
Assign message ID and sequence
↓
Persist message
↓
Publish fan-out event
↓
Deliver to connected clients
👉 Interview Answer
I would place durability before fan-out. Once the message is committed, delivery can be retried safely, offline users can catch up later, and clients can recover by asking for messages after their last seen sequence.
5️⃣ Fan-out Strategy
Small channels:
- fan out to all connected members directly
Large channels:
- partition fan-out
- batch deliveries
- avoid per-user expensive work on the hot path
- use pull-based catch-up for some clients
6️⃣ Ordering and Idempotency
Ordering is usually scoped to a channel.
Common approach:
- server-assigned message ID
- channel sequence number
- client-generated idempotency key
- retry-safe send API
👉 Interview Answer
I would guarantee a stable order within a channel, not a global order across the whole system. The server assigns channel sequence numbers, and clients use idempotency keys so retries do not create duplicate messages.
7️⃣ Reconnect Recovery
Client stores:
last_seen_channel_sequence
On reconnect:
Client reconnects
↓
Gateway authenticates
↓
Client sends last seen sequence
↓
Server returns missed messages
↓
Realtime stream resumes
8️⃣ Staff-Level Trade-offs
| Decision | Benefit | Cost |
|---|---|---|
| WebSockets | Low latency | Connection management cost |
| Durable write before fan-out | Reliable history | Slightly higher send latency |
| Per-channel ordering | Practical consistency | No global order |
| Push fan-out | Instant delivery | Expensive for large channels |
| Pull catch-up | Efficient recovery | More client sync logic |
9️⃣ Failure Handling
Failures:
- gateway node dies
- user reconnects on another device
- duplicate send retry
- fan-out worker lag
- notification service delay
Protections:
- stateless gateways with connection registry
- durable message store
- idempotent sends
- retryable fan-out jobs
- catch-up API after reconnect
中文部分
中文速记
一句话
Slack Messaging 是“实时 socket delivery + durable message log”的组合:消息先可靠写入,再做 fan-out。
背诵要点
- WebSocket 负责低延迟,不负责做 source of truth
- message store 才是聊天历史的权威数据
- fan-out 失败可以重试,离线用户可以 catch up
- ordering 通常保证 channel 内顺序,不追求全局顺序
- reconnect 时用 last seen sequence 拉取 missed messages
中文面试回答
我会把 Slack 消息系统拆成 WebSocket gateway、message service、message store 和 fan-out worker。 客户端通过 gateway 建立长连接,发送消息时先做 channel membership 校验,然后由服务端分配 message ID 和 channel sequence number。 消息必须先写入 durable store,再发布 fan-out event 给在线用户。
这样做的好处是,即使 gateway 崩溃或 fan-out 失败,消息历史仍然可靠。 离线用户或重连用户可以带着 last seen sequence 调用 catch-up API 拉取漏掉的消息。 客户端重试发送时使用 idempotency key,避免重复消息。
Staff 级重点是:实时连接状态和持久消息状态要分离。 WebSocket 让体验像实时的,但系统可靠性来自 durable message log、幂等写入和可恢复的 fan-out。
✅ Final Interview Answer
A Slack-like realtime messaging system uses WebSocket gateways for low-latency delivery, but it should persist messages durably before fan-out. The send path validates membership, assigns a message ID and channel sequence, writes the message, and then publishes a fan-out event to online devices. Offline users or reconnecting clients recover by fetching messages after their last seen sequence.
At staff level, the key design is separating durable chat state from ephemeral connection state. Gateways can fail and clients can reconnect, but the message log remains the source of truth. The main trade-off is balancing instant delivery with ordering, idempotency, fan-out cost, and reliable recovery.
Implement