Design Chat System

Real-Time Chat System: Interview Q&A & Key Points

This document summarizes common system design interview questions about a large-scale chat system, with standard answers, trade-offs, and keywords.

1. System Scenario

Q1: Design a large-scale real-time chat system that supports one-to-one and group messaging.

Key Points:

Millions of concurrent users
Low-latency message delivery
Message history persistence
Scalability, fault tolerance, consistency trade-offs

Sample Answer:

The system should support both one-to-one and group chat. It must handle millions of concurrent users, deliver messages in real-time with low latency, and store message history reliably. We need to consider scalability, fault tolerance, and consistency trade-offs.

2. WebSocket vs HTTP / Protocols

Q2: What protocol would users use to connect? HTTP or WebSocket? Why?

Answer:

Users connect via WebSocket for real-time bidirectional communication. HTTP is used only for REST APIs like login, history fetch, or profile management.

Keywords: WebSocket, full-duplex, low-latency, HTTP, REST

Q3: Is the WebSocket connection always WebSocket? What about the first handshake?

Answer:

First connection starts as HTTP (GET + Upgrade: websocket)
Auth and rate-limiting happen during handshake
After handshake → long-lived WebSocket connection

Keywords: handshake, Upgrade header, authorization, rate limiting

3. WebSocket Gateway / Edge Layer

Q4: What is a WebSocket Gateway? Does it include WebSocket servers?

Answer:

WS Gateway = logical layer at the edge handling all WebSocket connections
Consists of multiple WebSocket servers
Responsibilities: connection management, sticky sessions, heartbeat, routing, fan-out
Does not handle chat business logic

Keywords: WebSocket Gateway, WebSocket servers, sticky session, heartbeat, fan-out

Q5: Why call it Edge layer instead of just WS Gateway?

Answer:

Edge = entry layer for all client traffic (HTTP, WS, gRPC, TLS termination)
WS Gateway is protocol-specific implementation inside Edge
Edge emphasizes extensibility and abstraction

Keywords: Edge layer, entry point, abstraction, protocol-specific, extensibility

Q6: Is client → WS Gateway → Chat Service all WebSocket?

Answer:

Client → WS Gateway: ✅ WebSocket
WS Gateway → Chat Service: ❌ usually gRPC / HTTP / Message Queue
WS Gateway terminates WebSocket and forwards to stateless Chat Service

Keywords: stateful vs stateless, long-lived connection, protocol translation

Q7: Why not put WebSocket server inside Chat Service?

Answer:

Coupling connection management with business logic reduces scalability and fault isolation
Separation allows independent scaling and easier failure recovery

Keywords: decoupling, scalability, fault isolation, stateful vs stateless

4. Session Affinity / Sticky Routing

Q8: Why do WebSocket servers require session affinity?

Answer:

WebSocket connections are stateful
Each connection lives on a single server
Sticky routing ensures all messages from the client go to the correct WS server

Keywords: session affinity, sticky session, stateful connection

5. Message Storage / Persistence

Q9: Where is chat history stored? How to maintain ordering?

Answer:

Use distributed DB (Cassandra, DynamoDB) partitioned by chat_id
Each message has monotonically increasing sequence_id per chat
Redis for ephemeral data: connection mapping, online presence, recent messages
Message Queue (Kafka/Pulsar) for async fan-out and durability

Keywords: distributed DB, partitioning, sequence ID, Redis, ephemeral storage, message queue

6. High Availability / Fault Tolerance

Q10: How to make the system fault-tolerant / HA?

Answer:

Multiple WS Gateway instances behind load balancer
Auto-failover, health checks
Stateless Chat Service → horizontal scaling
Redis / DB clustered for persistence
Clients reconnect on failure

Keywords: high availability, horizontal scaling, failover, stateless, clustered DB

7. Rate Limiting / Authorization

Q11: Where to do auth and rate limiting?

Answer:

Auth → during handshake at API Gateway or Edge
Connection-level rate limiting → WS Gateway
Message-level rate limiting → Chat Service

Keywords: auth, JWT, handshake, connection-level rate limit, message-level rate limit

8. Trade-offs / System Design Points

Q12: What are the key trade-offs in a chat system?

Trade-off	Consideration
WS Gateway co-located with Chat Service	Simple & low latency vs scalability & fault isolation
Consistency vs latency	Ordering guaranteed per chat vs eventual consistency for multi-device
Stateful vs stateless	WebSocket requires state → scaling harder; Chat service stateless → easier scale
Persistent vs ephemeral storage	Redis = fast but volatile, DB = durable but slower

9. Estimation / Scaling Calculations

Q13: How to estimate storage / traffic?

Example:

10M users, 20 messages/day, 500 bytes/message
10M × 20 × 500B ≈ 100 GB/day
90 days retention → 9 TB storage
Use distributed DB, partitioned by chat_id

Keywords: message size, user count, daily volume, retention, distributed DB, partitioning

10. Excalidraw / Whiteboard Guidelines

Layered Layout (Left → Right):

[ Clients ]
     |
     v
----------------------------
| API Gateway | WS Gateway  |
| - Auth      | - WS server|
| - Rate Lim  | - Sticky   |
| - TLS       | - Heartbeat|
----------------------------
     |               |
     v               v
[ REST Services ]   [ Chat Service ]
                     | 
           -------------------------
           | Message Queue / Kafka |
           | Redis / ephemeral     |
           | DB / Cassandra        |
           -------------------------

Tips:

Mark protocols on arrows (HTTP / WS / gRPC)
Edge layer = API Gateway + WS Gateway
Chat Service = stateless, business logic
Redis = ephemeral state / session
MQ = async delivery / fan-out

11. Common Follow-Up Questions & Answers

Question	Standard Answer
WS Gateway crashes	Clients reconnect, state rebuilt from Redis
Chat Service crashes	WS Gateway buffers messages in MQ
New user connects	HTTP handshake → auth → WS upgrade → sticky WS server
Multi-device sync	Messages stored in DB, sequence IDs ensure ordering
High concurrency	Horizontal scaling WS Gateway + Chat Service; partition DB by chat_id

12. Summary Statements (Interview Ready)

The client connects through an Edge layer where authentication, rate limiting, and protocol handling occur. WebSocket Gateway manages long-lived connections, forwards messages to stateless Chat Services, which persist messages to DB and coordinate via message queues. This separation ensures scalability, fault tolerance, and low-latency delivery.