System Design Interview: Real-Time Chat System
Complete system design interview question for building a real-time chat application like Slack or WhatsApp, covering WebSockets, message delivery guarantees, and presence systems.
System Design Interview: Real-Time Chat System
Design a real-time chat application similar to Slack or WhatsApp. This question tests understanding of real-time communication, message delivery guarantees, presence systems, and data modeling at scale.
Interview Format (45 minutes)
Time Allocation:
- Requirements gathering: 5-8 minutes
- High-level design: 10-15 minutes
- Deep dive: 15-20 minutes
- Scale and edge cases: 5-10 minutes
Step 1: Requirements Gathering (5-8 min)
A strong candidate will clarify the scope before designing anything.
Functional Requirements
Good questions to ask:
- Is this 1:1 chat, group chat, or both? (both, groups up to 500 members)
- Do we need message history/persistence? (yes, searchable)
- What message types? (text, images, files)
- Do we need read receipts? (yes)
- Online/offline presence? (yes)
- Push notifications? (yes, for offline users)
- Message editing/deletion? (yes)
Agreed requirements:
- 1:1 and group messaging (up to 500 members)
- Real-time message delivery
- Persistent message history with search
- Read receipts and typing indicators
- Online/offline presence
- Push notifications for offline users
- Image and file sharing
Non-Functional Requirements
Good questions to ask:
- Expected user base? (50M DAU)
- Messages per day? (1B messages/day)
- Message size limit? (64KB text, 100MB files)
- Latency requirements? (<200ms delivery)
- Geographic distribution? (global)
- Message retention? (forever for paid, 90 days for free)
Agreed requirements:
- Low latency (<200ms for message delivery)
- High availability (99.99% uptime)
- Message ordering guaranteed within a conversation
- At-least-once delivery (with deduplication)
- End-to-end encryption (stretch goal)
Calculations
Messages:
50M DAU, average 20 messages/day = 1B messages/day
Average message size: 200 bytes
1B x 200 bytes = 200GB/day = 73TB/year
Connections:
50M concurrent WebSocket connections (peak)
Each connection: ~10KB memory
50M x 10KB = 500GB RAM for connections alone
QPS:
1B messages/day = ~12,000 messages/sec average
Peak (3x): ~36,000 messages/sec
Red flags if candidate:
- Designs only for HTTP polling
- Doesn't consider message ordering
- Ignores offline scenarios
- Doesn't ask about group size limits
Step 2: High-Level Design (10-15 min)
API Design
WebSocket Connection:
wss://chat.example.com/ws?token=<auth_token>
// Client -> Server
{
"type": "send_message",
"conversationId": "conv_123",
"content": "Hello!",
"clientMessageId": "client_uuid_456" // for deduplication
}
// Server -> Client
{
"type": "new_message",
"messageId": "msg_789",
"conversationId": "conv_123",
"senderId": "user_001",
"content": "Hello!",
"timestamp": "2026-02-14T10:30:00Z"
}
REST APIs (for non-real-time operations):
GET /api/conversations # List conversations
GET /api/conversations/:id/messages # Message history (paginated)
POST /api/conversations # Create conversation/group
POST /api/conversations/:id/messages # Send message (fallback)
PUT /api/messages/:id # Edit message
DELETE /api/messages/:id # Delete message
POST /api/upload # Upload file/image
Good candidate discusses:
- WebSocket vs SSE vs long polling trade-offs
- REST fallback for reliability
- Client-generated message IDs for deduplication
Core Components
┌───────────────┐
│ Clients │
└───────┬───────┘
│ WSS
┌───────▼───────────────────────────────────┐
│ WebSocket Gateway │
│ (Connection management, routing) │
└───────┬───────────────┬───────────────────┘
│ │
┌───────▼───────┐ ┌─────▼──────────────┐
│ Chat Service │ │ Presence Service │
│ (Messages) │ │ (Online status) │
└───────┬───────┘ └─────┬──────────────┘
│ │
┌───────▼───────┐ ┌─────▼──────────────┐
│ Message DB │ │ Redis Cluster │
│ (Cassandra) │ │ (Presence + Pub/Sub)│
└───────────────┘ └────────────────────┘
Data Model
Messages (Cassandra / DynamoDB):
-- Partition by conversation, sorted by time
CREATE TABLE messages (
conversation_id UUID,
message_id TIMEUUID,
sender_id UUID,
content TEXT,
content_type TEXT, -- 'text', 'image', 'file'
media_url TEXT,
created_at TIMESTAMP,
edited_at TIMESTAMP,
deleted BOOLEAN,
PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
Conversations (PostgreSQL):
CREATE TABLE conversations (
id UUID PRIMARY KEY,
type VARCHAR(10), -- 'direct', 'group'
name VARCHAR(255),
created_at TIMESTAMP,
updated_at TIMESTAMP
);
CREATE TABLE conversation_members (
conversation_id UUID REFERENCES conversations(id),
user_id UUID,
role VARCHAR(20) DEFAULT 'member',
joined_at TIMESTAMP,
last_read_message_id UUID,
PRIMARY KEY (conversation_id, user_id)
);
CREATE INDEX idx_user_conversations
ON conversation_members(user_id);
Step 3: Deep Dive (15-20 min)
Message Delivery Flow
Sender -> WebSocket Gateway -> Chat Service -> Message DB
|
v
Message Queue
|
┌───────────────┼───────────────┐
v v v
WS Gateway WS Gateway Push Service
(User A) (User B) (Offline Users)
Implementation:
class ChatService:
def handle_message(self, sender_id, conversation_id, content, client_msg_id):
# 1. Deduplication check
if self.message_store.exists_by_client_id(client_msg_id):
return # Already processed
# 2. Validate sender is member of conversation
if not self.is_member(sender_id, conversation_id):
raise PermissionError("Not a member")
# 3. Store message
message = self.message_store.create(
conversation_id=conversation_id,
sender_id=sender_id,
content=content,
client_message_id=client_msg_id
)
# 4. Get conversation members
members = self.get_members(conversation_id)
# 5. Fan out to online members via pub/sub
for member_id in members:
if member_id != sender_id:
self.pubsub.publish(
channel=f"user:{member_id}",
message=message.to_dict()
)
# 6. Send push notifications to offline members
offline_members = [m for m in members
if not self.presence.is_online(m)]
self.push_service.notify(offline_members, message)
# 7. Acknowledge to sender
return {"status": "delivered", "messageId": message.id}
Presence System
Challenge: Tracking 50M online users in real time
class PresenceService:
def __init__(self, redis):
self.redis = redis
self.HEARTBEAT_INTERVAL = 30 # seconds
self.TIMEOUT = 90 # seconds
def user_connected(self, user_id):
self.redis.hset(f"presence:{user_id}", mapping={
"status": "online",
"last_seen": time.time(),
"server_id": self.server_id
})
self.redis.expire(f"presence:{user_id}", self.TIMEOUT)
# Notify contacts
self._broadcast_status(user_id, "online")
def heartbeat(self, user_id):
self.redis.hset(f"presence:{user_id}",
"last_seen", time.time())
self.redis.expire(f"presence:{user_id}", self.TIMEOUT)
def user_disconnected(self, user_id):
# Don't immediately mark offline (might reconnect)
self.redis.hset(f"presence:{user_id}",
"status", "away")
# Schedule offline check after grace period
self.scheduler.schedule(
delay=30,
task=self._check_still_offline,
args=(user_id,)
)
def is_online(self, user_id):
data = self.redis.hgetall(f"presence:{user_id}")
if not data:
return False
return (time.time() - float(data["last_seen"])) < self.TIMEOUT
def _broadcast_status(self, user_id, status):
# Only broadcast to users who have this user in their contacts
contacts = self.get_contacts(user_id)
for contact_id in contacts:
self.pubsub.publish(
channel=f"user:{contact_id}",
message={"type": "presence", "userId": user_id, "status": status}
)
Strong candidate discusses:
- Heartbeat mechanism vs connection-based detection
- Grace period before marking offline
- Fan-out problem for popular users (hundreds of contacts)
- Lazy presence (only check when user opens a conversation)
Read Receipts and Typing Indicators
# Read receipts: persistent (stored in DB)
def mark_read(user_id, conversation_id, message_id):
db.update("conversation_members",
set={"last_read_message_id": message_id},
where={"conversation_id": conversation_id,
"user_id": user_id})
# Notify other members
pubsub.publish(f"conv:{conversation_id}", {
"type": "read_receipt",
"userId": user_id,
"lastReadMessageId": message_id
})
# Typing indicators: ephemeral (never stored)
def typing_started(user_id, conversation_id):
pubsub.publish(f"conv:{conversation_id}", {
"type": "typing",
"userId": user_id,
"status": "started"
})
# Auto-expire after 5 seconds (in case stop event lost)
Message Ordering
Challenge: Ensuring messages appear in correct order across devices
Approach: Server-assigned timestamps + sequence numbers
class MessageOrderer:
def assign_order(self, conversation_id, message):
# Atomic increment per conversation
seq = self.redis.incr(f"seq:{conversation_id}")
message.sequence_number = seq
message.server_timestamp = time.time_ns()
return message
def resolve_conflicts(self, messages):
# Sort by sequence number (primary)
# Then by server timestamp (secondary)
return sorted(messages,
key=lambda m: (m.sequence_number, m.server_timestamp))
Strong candidate discusses:
- Client-side vs server-side timestamps
- Causal ordering vs total ordering
- Handling out-of-order delivery on client
Step 4: Scale and Edge Cases (5-10 min)
Scaling WebSocket Connections
Problem: Single server can handle ~500K connections max
Solution: WebSocket Gateway Cluster
┌────────────────────────────────────────────────┐
│ Load Balancer (L4) │
│ (Sticky sessions by user_id) │
└──────┬──────────┬──────────┬──────────┬────────┘
│ │ │ │
┌────▼────┐ ┌──▼─────┐ ┌─▼──────┐ ┌▼───────┐
│ WS GW │ │ WS GW │ │ WS GW │ │ WS GW │
│ 500K │ │ 500K │ │ 500K │ │ 500K │
└────┬────┘ └──┬─────┘ └─┬──────┘ └┬───────┘
│ │ │ │
└────────┬┴─────────┴─────────┘
│
┌────────▼──────────┐
│ Redis Pub/Sub │
│ (Message Bus) │
└───────────────────┘
Connection registry (which user is on which server):
class ConnectionRegistry:
def register(self, user_id, server_id):
self.redis.sadd(f"connections:{user_id}", server_id)
def unregister(self, user_id, server_id):
self.redis.srem(f"connections:{user_id}", server_id)
def get_servers(self, user_id):
return self.redis.smembers(f"connections:{user_id}")
def route_message(self, user_id, message):
servers = self.get_servers(user_id)
for server_id in servers:
self.pubsub.publish(f"server:{server_id}", {
"target_user": user_id,
"message": message
})
Group Message Fan-Out
Problem: A message to a 500-person group means 499 deliveries
def fan_out_group_message(conversation_id, message):
members = get_members(conversation_id)
if len(members) <= 50:
# Small group: fan-out on write (push to each member)
for member_id in members:
deliver_to_user(member_id, message)
else:
# Large group: fan-out on read (members pull when online)
store_in_conversation_feed(conversation_id, message)
# Only push notification to online + mentioned users
online = [m for m in members if is_online(m)]
mentioned = extract_mentions(message.content)
notify_users = set(online + mentioned)
for user_id in notify_users:
deliver_to_user(user_id, message)
Offline Message Sync
def sync_messages(user_id, last_sync_timestamp):
"""Called when a user comes back online"""
conversations = get_user_conversations(user_id)
unread = {}
for conv_id in conversations:
last_read = get_last_read_message(user_id, conv_id)
new_messages = get_messages_after(conv_id, last_read,
limit=50)
if new_messages:
unread[conv_id] = {
"messages": new_messages,
"unread_count": count_unread(conv_id, last_read)
}
return unread
Edge Cases
Strong candidates identify:
- Network partitions (messages sent but not acknowledged)
- Device sync (user on phone and laptop simultaneously)
- Large media files (separate upload flow with CDN)
- Spam and abuse (rate limiting, content moderation)
- Message deletion propagation across all devices
- Clock skew between servers
Evaluation Rubric
Strong Performance (Hire)
- Chooses WebSockets with proper justification
- Designs for message ordering and delivery guarantees
- Handles presence efficiently at scale
- Considers fan-out strategies for groups
- Discusses offline sync and push notifications
- Clear separation of real-time vs persistent data
- Mentions security (encryption, auth)
Adequate Performance (Maybe)
- Functional design with WebSockets
- Basic message storage and retrieval
- Some scaling considerations
- Misses edge cases like offline sync or ordering
- Can be guided toward better solutions
Weak Performance (No Hire)
- Only considers HTTP polling
- No thought given to delivery guarantees
- Doesn't address group messaging challenges
- Can't reason about connection management at scale
- Poor data model choices
Follow-up Questions
For senior candidates:
- How would you implement end-to-end encryption?
- Design the notification system in detail
- How would you handle message search across billions of messages?
- How would you implement message reactions and threads?
For staff+ candidates:
- Design the infrastructure for global deployment with <100ms latency
- How would you handle compliance (message retention, legal holds)?
- Design the system for 500M DAU
- How would you implement real-time translation?
This question tests real-time systems design, pub/sub patterns, presence management, and data consistency under concurrent writes. A strong candidate will balance latency requirements with delivery guarantees while maintaining clear system boundaries.