Real-Time Chat System: Interview Q&A & Key Points
This document summarizes common system design interview questions about a large-scale chat system, with standard answers, trade-offs, and keywords.
1. System Scenario
Q1: Design a large-scale real-time chat system that supports one-to-one and group messaging.
Key Points:
- Millions of concurrent users
- Low-latency message delivery
- Message history persistence
- Scalability, fault tolerance, consistency trade-offs
Sample Answer:
The system should support both one-to-one and group chat. It must handle millions of concurrent users, deliver messages in real-time with low latency, and store message history reliably. We need to consider scalability, fault tolerance, and consistency trade-offs.
2. WebSocket vs HTTP / Protocols
Q2: What protocol would users use to connect? HTTP or WebSocket? Why?
Answer:
Users connect via WebSocket for real-time bidirectional communication. HTTP is used only for REST APIs like login, history fetch, or profile management.
Keywords: WebSocket, full-duplex, low-latency, HTTP, REST
Q3: Is the WebSocket connection always WebSocket? What about the first handshake?
Answer:
- First connection starts as HTTP (GET + Upgrade: websocket)
- Auth and rate-limiting happen during handshake
- After handshake → long-lived WebSocket connection
Keywords: handshake, Upgrade header, authorization, rate limiting
3. WebSocket Gateway / Edge Layer
Q4: What is a WebSocket Gateway? Does it include WebSocket servers?
Answer:
- WS Gateway = logical layer at the edge handling all WebSocket connections
- Consists of multiple WebSocket servers
- Responsibilities: connection management, sticky sessions, heartbeat, routing, fan-out
- Does not handle chat business logic
Keywords: WebSocket Gateway, WebSocket servers, sticky session, heartbeat, fan-out
Q5: Why call it Edge layer instead of just WS Gateway?
Answer:
- Edge = entry layer for all client traffic (HTTP, WS, gRPC, TLS termination)
- WS Gateway is protocol-specific implementation inside Edge
- Edge emphasizes extensibility and abstraction
Keywords: Edge layer, entry point, abstraction, protocol-specific, extensibility
Q6: Is client → WS Gateway → Chat Service all WebSocket?
Answer:
- Client → WS Gateway: ✅ WebSocket
- WS Gateway → Chat Service: ❌ usually gRPC / HTTP / Message Queue
- WS Gateway terminates WebSocket and forwards to stateless Chat Service
Keywords: stateful vs stateless, long-lived connection, protocol translation
Q7: Why not put WebSocket server inside Chat Service?
Answer:
- Coupling connection management with business logic reduces scalability and fault isolation
- Separation allows independent scaling and easier failure recovery
Keywords: decoupling, scalability, fault isolation, stateful vs stateless
4. Session Affinity / Sticky Routing
Q8: Why do WebSocket servers require session affinity?
Answer:
- WebSocket connections are stateful
- Each connection lives on a single server
- Sticky routing ensures all messages from the client go to the correct WS server
Keywords: session affinity, sticky session, stateful connection
5. Message Storage / Persistence
Q9: Where is chat history stored? How to maintain ordering?
Answer:
- Use distributed DB (Cassandra, DynamoDB) partitioned by chat_id
- Each message has monotonically increasing sequence_id per chat
- Redis for ephemeral data: connection mapping, online presence, recent messages
- Message Queue (Kafka/Pulsar) for async fan-out and durability
Keywords: distributed DB, partitioning, sequence ID, Redis, ephemeral storage, message queue
6. High Availability / Fault Tolerance
Q10: How to make the system fault-tolerant / HA?
Answer:
- Multiple WS Gateway instances behind load balancer
- Auto-failover, health checks
- Stateless Chat Service → horizontal scaling
- Redis / DB clustered for persistence
- Clients reconnect on failure
Keywords: high availability, horizontal scaling, failover, stateless, clustered DB
7. Rate Limiting / Authorization
Q11: Where to do auth and rate limiting?
Answer:
- Auth → during handshake at API Gateway or Edge
- Connection-level rate limiting → WS Gateway
- Message-level rate limiting → Chat Service
Keywords: auth, JWT, handshake, connection-level rate limit, message-level rate limit
8. Trade-offs / System Design Points
Q12: What are the key trade-offs in a chat system?
| Trade-off | Consideration |
|---|---|
| WS Gateway co-located with Chat Service | Simple & low latency vs scalability & fault isolation |
| Consistency vs latency | Ordering guaranteed per chat vs eventual consistency for multi-device |
| Stateful vs stateless | WebSocket requires state → scaling harder; Chat service stateless → easier scale |
| Persistent vs ephemeral storage | Redis = fast but volatile, DB = durable but slower |
9. Estimation / Scaling Calculations
Q13: How to estimate storage / traffic?
Example:
- 10M users, 20 messages/day, 500 bytes/message
- 10M × 20 × 500B ≈ 100 GB/day
- 90 days retention → 9 TB storage
- Use distributed DB, partitioned by chat_id
Keywords: message size, user count, daily volume, retention, distributed DB, partitioning
10. Excalidraw / Whiteboard Guidelines
Layered Layout (Left → Right):
[ Clients ]
|
v
----------------------------
| API Gateway | WS Gateway |
| - Auth | - WS server|
| - Rate Lim | - Sticky |
| - TLS | - Heartbeat|
----------------------------
| |
v v
[ REST Services ] [ Chat Service ]
|
-------------------------
| Message Queue / Kafka |
| Redis / ephemeral |
| DB / Cassandra |
-------------------------
Tips:
- Mark protocols on arrows (HTTP / WS / gRPC)
- Edge layer = API Gateway + WS Gateway
- Chat Service = stateless, business logic
- Redis = ephemeral state / session
- MQ = async delivery / fan-out
11. Common Follow-Up Questions & Answers
| Question | Standard Answer |
|---|---|
| WS Gateway crashes | Clients reconnect, state rebuilt from Redis |
| Chat Service crashes | WS Gateway buffers messages in MQ |
| New user connects | HTTP handshake → auth → WS upgrade → sticky WS server |
| Multi-device sync | Messages stored in DB, sequence IDs ensure ordering |
| High concurrency | Horizontal scaling WS Gateway + Chat Service; partition DB by chat_id |
12. Summary Statements (Interview Ready)
The client connects through an Edge layer where authentication, rate limiting, and protocol handling occur. WebSocket Gateway manages long-lived connections, forwards messages to stateless Chat Services, which persist messages to DB and coordinate via message queues. This separation ensures scalability, fault tolerance, and low-latency delivery.