Real-Time Chat System: Interview Q&A & Key Points

This document summarizes common system design interview questions about a large-scale chat system, with standard answers, trade-offs, and keywords.


1. System Scenario

Q1: Design a large-scale real-time chat system that supports one-to-one and group messaging.

Key Points:

  • Millions of concurrent users
  • Low-latency message delivery
  • Message history persistence
  • Scalability, fault tolerance, consistency trade-offs

Sample Answer:

The system should support both one-to-one and group chat. It must handle millions of concurrent users, deliver messages in real-time with low latency, and store message history reliably. We need to consider scalability, fault tolerance, and consistency trade-offs.


2. WebSocket vs HTTP / Protocols

Q2: What protocol would users use to connect? HTTP or WebSocket? Why?

Answer:

Users connect via WebSocket for real-time bidirectional communication. HTTP is used only for REST APIs like login, history fetch, or profile management.

Keywords: WebSocket, full-duplex, low-latency, HTTP, REST


Q3: Is the WebSocket connection always WebSocket? What about the first handshake?

Answer:

  • First connection starts as HTTP (GET + Upgrade: websocket)
  • Auth and rate-limiting happen during handshake
  • After handshake → long-lived WebSocket connection

Keywords: handshake, Upgrade header, authorization, rate limiting


3. WebSocket Gateway / Edge Layer

Q4: What is a WebSocket Gateway? Does it include WebSocket servers?

Answer:

  • WS Gateway = logical layer at the edge handling all WebSocket connections
  • Consists of multiple WebSocket servers
  • Responsibilities: connection management, sticky sessions, heartbeat, routing, fan-out
  • Does not handle chat business logic

Keywords: WebSocket Gateway, WebSocket servers, sticky session, heartbeat, fan-out


Q5: Why call it Edge layer instead of just WS Gateway?

Answer:

  • Edge = entry layer for all client traffic (HTTP, WS, gRPC, TLS termination)
  • WS Gateway is protocol-specific implementation inside Edge
  • Edge emphasizes extensibility and abstraction

Keywords: Edge layer, entry point, abstraction, protocol-specific, extensibility


Q6: Is client → WS Gateway → Chat Service all WebSocket?

Answer:

  • Client → WS Gateway: ✅ WebSocket
  • WS Gateway → Chat Service: ❌ usually gRPC / HTTP / Message Queue
  • WS Gateway terminates WebSocket and forwards to stateless Chat Service

Keywords: stateful vs stateless, long-lived connection, protocol translation


Q7: Why not put WebSocket server inside Chat Service?

Answer:

  • Coupling connection management with business logic reduces scalability and fault isolation
  • Separation allows independent scaling and easier failure recovery

Keywords: decoupling, scalability, fault isolation, stateful vs stateless


4. Session Affinity / Sticky Routing

Q8: Why do WebSocket servers require session affinity?

Answer:

  • WebSocket connections are stateful
  • Each connection lives on a single server
  • Sticky routing ensures all messages from the client go to the correct WS server

Keywords: session affinity, sticky session, stateful connection


5. Message Storage / Persistence

Q9: Where is chat history stored? How to maintain ordering?

Answer:

  • Use distributed DB (Cassandra, DynamoDB) partitioned by chat_id
  • Each message has monotonically increasing sequence_id per chat
  • Redis for ephemeral data: connection mapping, online presence, recent messages
  • Message Queue (Kafka/Pulsar) for async fan-out and durability

Keywords: distributed DB, partitioning, sequence ID, Redis, ephemeral storage, message queue


6. High Availability / Fault Tolerance

Q10: How to make the system fault-tolerant / HA?

Answer:

  • Multiple WS Gateway instances behind load balancer
  • Auto-failover, health checks
  • Stateless Chat Service → horizontal scaling
  • Redis / DB clustered for persistence
  • Clients reconnect on failure

Keywords: high availability, horizontal scaling, failover, stateless, clustered DB


7. Rate Limiting / Authorization

Q11: Where to do auth and rate limiting?

Answer:

  • Auth → during handshake at API Gateway or Edge
  • Connection-level rate limiting → WS Gateway
  • Message-level rate limiting → Chat Service

Keywords: auth, JWT, handshake, connection-level rate limit, message-level rate limit


8. Trade-offs / System Design Points

Q12: What are the key trade-offs in a chat system?

Trade-off Consideration
WS Gateway co-located with Chat Service Simple & low latency vs scalability & fault isolation
Consistency vs latency Ordering guaranteed per chat vs eventual consistency for multi-device
Stateful vs stateless WebSocket requires state → scaling harder; Chat service stateless → easier scale
Persistent vs ephemeral storage Redis = fast but volatile, DB = durable but slower

9. Estimation / Scaling Calculations

Q13: How to estimate storage / traffic?

Example:

  • 10M users, 20 messages/day, 500 bytes/message
  • 10M × 20 × 500B ≈ 100 GB/day
  • 90 days retention → 9 TB storage
  • Use distributed DB, partitioned by chat_id

Keywords: message size, user count, daily volume, retention, distributed DB, partitioning


10. Excalidraw / Whiteboard Guidelines

Layered Layout (Left → Right):

[ Clients ]
     |
     v
----------------------------
| API Gateway | WS Gateway  |
| - Auth      | - WS server|
| - Rate Lim  | - Sticky   |
| - TLS       | - Heartbeat|
----------------------------
     |               |
     v               v
[ REST Services ]   [ Chat Service ]
                     | 
           -------------------------
           | Message Queue / Kafka |
           | Redis / ephemeral     |
           | DB / Cassandra        |
           -------------------------

Tips:

  • Mark protocols on arrows (HTTP / WS / gRPC)
  • Edge layer = API Gateway + WS Gateway
  • Chat Service = stateless, business logic
  • Redis = ephemeral state / session
  • MQ = async delivery / fan-out

11. Common Follow-Up Questions & Answers

Question Standard Answer
WS Gateway crashes Clients reconnect, state rebuilt from Redis
Chat Service crashes WS Gateway buffers messages in MQ
New user connects HTTP handshake → auth → WS upgrade → sticky WS server
Multi-device sync Messages stored in DB, sequence IDs ensure ordering
High concurrency Horizontal scaling WS Gateway + Chat Service; partition DB by chat_id

12. Summary Statements (Interview Ready)

The client connects through an Edge layer where authentication, rate limiting, and protocol handling occur. WebSocket Gateway manages long-lived connections, forwards messages to stateless Chat Services, which persist messages to DB and coordinate via message queues. This separation ensures scalability, fault tolerance, and low-latency delivery.