Blog

How to Design a Chat System (Real-Time Messaging) - Complete System Design Guide

Blog
How to Design a Chat System (Real-Time Messaging) - Complete System Design Guide
20 min
·
October 10, 2025
·
System Designer
·
hard

How to Design a Chat System (Real-Time Messaging) - Complete System Design Guide

Master chat system design for your next big tech interview. Learn WebSockets, message delivery guarantees, Cassandra schema design, fan-out for group chats, presence systems, and how to scale to billions of messages a day.

Design a Chat System (Real-Time Messaging)

Designing a real-time chat system is one of the most frequently asked system design questions at big tech companies — and for good reason. It's deceptively complex. The surface-level version is easy: user A sends a message, user B receives it. But the moment you start thinking about scale, delivery guarantees, group chats, offline users, and message ordering, you're deep in distributed systems territory.

WhatsApp processes over 100 billion messages every single day for 2+ billion users. Facebook Messenger handles billions more. These aren't just chat apps — they're some of the most demanding real-time distributed systems ever built.

This guide walks you through designing a production-grade chat system from the ground up: the right protocol choices, how to guarantee message delivery, how group messaging actually works at scale, and the database design decisions that separate a good answer from a great one.


Problem Statement

Design a real-time messaging system that can:

  • Support one-on-one and group conversations
  • Deliver messages with low latency to online users
  • Reliably deliver messages to users who are offline, when they come back online
  • Show message delivery status (sent, delivered, read)
  • Show user presence (online/offline/last seen)
  • Scale to hundreds of millions of daily active users

Requirements Gathering

The first thing an interviewer wants to see is that you can scope the problem intelligently. Don't start drawing boxes before you've asked questions.

Functional Requirements

  • 1:1 messaging: Two users can exchange text messages in real time
  • Group messaging: Users can create group conversations (up to 100 members is a reasonable starting scope)
  • Message persistence: Chat history is stored and retrievable when a user opens an old conversation
  • Delivery receipts: Messages show sent (✓), delivered (✓✓), and read (✓✓ in blue) status
  • Presence: Users can see if others are online or when they were last active
  • Media sharing: Images, videos, and files (treat as a separate service; don't let it dominate the interview)
  • Push notifications: Offline users receive notifications so they come back online

Non-Functional Requirements

  • Low latency: Messages should be delivered within 500ms for online users
  • High availability: 99.99% uptime; the system should never lose a sent message
  • Consistency over availability for messages: If a user can't see a message right now, that's okay — but a message must never be permanently lost
  • Scalability: Handle 500M daily active users and tens of billions of messages per day
  • Durability: Messages should be stored persistently (minimum 5 years is common)

Back-of-the-Envelope Calculations

Let's ground this design in real numbers.

plaintext
Assumptions:
  Total registered users: 1 billion
  Daily Active Users (DAU): 500 million
  Average messages sent per user per day: 20
  Average message size (text): 100 bytes
  Peak multiplier: 3x (morning/evening spikes)
 
Message Throughput:
  Average: 500M × 20 / 86,400s ≈ 115,700 messages/second
  Peak:    ~350,000 messages/second
 
Concurrent WebSocket Connections:
  Assume 10% of DAU are online at any moment
  = 50 million concurrent connections
 
Storage (messages only):
  Per day: 500M users × 20 messages × 100 bytes ≈ 1 TB/day
  Per year: ~365 TB
  5-year retention: ~1.8 PB
 
  With metadata (sender, timestamp, status, receipts): ~2–3x
  Realistic 5-year storage: ~4–5 PB

These numbers have two important implications:

  1. At 50 million concurrent WebSocket connections, you need a lot of chat servers. A modern server can hold roughly 50,000–100,000 WebSocket connections (each connection uses ~20–50KB of memory depending on the stack). At 50K connections per server, you need ~1,000 chat servers at peak.

  2. At 1TB of messages per day with an append-heavy, write-intensive pattern, a traditional relational database is not the right primary store for messages. This will drive the database selection.


The Core Challenge: Pushing Messages in Real Time

This is the heart of the problem and the first thing worth nailing down before touching anything else. The fundamental challenge is: HTTP is request-driven. Clients initiate requests, servers respond. But in a chat app, the server needs to push a message to User B the moment User A sends it — without User B asking.

There are three approaches to this. Understanding all three (and why two of them fall short) is the kind of depth interviewers are looking for.

Option 1: Short Polling

The client sends a GET request to the server every N seconds asking "any new messages?"

plaintext
Client → Server: "Any new messages?" (t=0s)
Server → Client: "No."
Client → Server: "Any new messages?" (t=2s)
Server → Client: "No."
Client → Server: "Any new messages?" (t=4s)
Server → Client: "Yes, here's one from User A."

Why this fails at scale: If you're polling every 2 seconds with 500 million users, you're generating 250 million HTTP requests per second doing nothing but saying "nope." It's massively wasteful and adds up to 2 seconds of delivery delay. Eliminate this immediately in your interview.

Option 2: Long Polling

The client sends a request, and the server holds it open until a message arrives (or a timeout, usually 20–30 seconds). When a message arrives, the server responds, and the client immediately opens a new long-poll connection.

plaintext
Client → Server: "Any new messages?" (connection held open)
[20 seconds pass]
Server → Client: "Here's a message from User A." (responds when message arrives)
Client → Server: [immediately opens new long-poll connection]

Better than polling, but still has problems:

  • Still half-duplex (server can only respond, not push freely)
  • On timeout, the client has to re-establish the connection — costly at scale
  • Harder to handle multiple messages arriving simultaneously
  • Each "connection" is actually a new HTTP request under the hood

Option 3: WebSockets (The Right Answer)

WebSockets establish a persistent, full-duplex TCP connection between client and server. Either side can send data at any time. The connection is established once via an HTTP upgrade handshake, then stays open.

plaintext
Client → Server: HTTP Upgrade request (one-time handshake)
Server → Client: 101 Switching Protocols
 
[Connection is now a persistent, bidirectional channel]
 
User A sends message → Client A → Server → Client B (instant push)
Server sends typing indicator → Client B (instant push)
Server sends delivery receipt → Client A (instant push)

WebSockets are the standard for real-time chat. They're used by WhatsApp, Slack, Discord, and essentially every production chat system. The latency overhead is negligible once the connection is established.

One important nuance for your interview: WebSocket connections are stateful and sticky. A user's connection lives on a specific chat server. This means you can't use a standard L7 load balancer to distribute requests per-message — the client must stay connected to the same server for the duration of the session. Mention this, and explain how you handle message routing across servers (covered below).

The polling → long polling → WebSockets progression is one of those explanations that flows naturally once you've explained it a few times — and sounds halting the first time you try it under pressure. If you want to get the fluency before the real interview, Mockingly.ai has chat system design simulations where the interviewer will probe exactly this choice.


High-Level Architecture

plaintext
                         ┌───────────────────────┐
                         │     API Gateway /      │
                         │     Load Balancer       │
                         │  (L4 for WebSockets,   │
                         │   L7 for REST APIs)    │
                         └──────────┬────────────┘

         ┌──────────────────────────┼──────────────────────────┐
         │                          │                          │
┌────────▼─────────┐      ┌─────────▼────────┐      ┌─────────▼────────┐
│   Chat Server 1  │      │   Chat Server 2  │      │   Chat Server N  │
│  (WS connections │      │  (WS connections │      │  (WS connections │
│   for users A–M) │      │   for users N–Z) │      │   for overflow)  │
└────────┬─────────┘      └─────────┬────────┘      └─────────┬────────┘
         │                          │                          │
         └──────────────────────────┼──────────────────────────┘

              ┌─────────────────────┼─────────────────────┐
              │                     │                     │
    ┌─────────▼────────┐  ┌─────────▼────────┐  ┌────────▼──────────┐
    │  Redis Pub/Sub   │  │  Session Service  │  │   Message Queue   │
    │  (Cross-server   │  │  (Who's on which  │  │   (Kafka)         │
    │   message fanout)│  │   chat server?)   │  │                   │
    └──────────────────┘  └──────────────────┘  └────────┬──────────┘

              ┌───────────────────────────────────────────┼───────────────┐
              │                                           │               │
   ┌──────────▼──────────┐                   ┌───────────▼────────┐  ┌───▼──────────┐
   │   Message Storage   │                   │   Metadata DB      │  │  Notification│
   │   (Cassandra)       │                   │   (MySQL/Postgres) │  │  Service     │
   │   Chat history,     │                   │   Users, Groups,   │  │  (APNs/FCM)  │
   │   message content   │                   │   Relationships    │  └──────────────┘
   └─────────────────────┘                   └────────────────────┘

Core Components Deep Dive

Chat Servers

Chat servers are the heart of the system. Each one maintains persistent WebSocket connections to a subset of online users. When User A sends a message to User B:

  1. The message arrives at the chat server User A is connected to (let's call it Server 1)
  2. Server 1 persists the message asynchronously via Kafka
  3. Server 1 needs to deliver the message to User B

Step 3 is where it gets interesting. User B might be connected to a different chat server (Server 2). Server 1 can't directly push to User B's connection — it lives on Server 2. The solution is Redis Pub/Sub.

Each chat server subscribes to a Redis channel. When Server 1 receives a message for User B, it:

  1. Looks up which server User B is connected to (via the Session Service)
  2. Publishes the message to that server's Redis channel
  3. Server 2 receives the publish event and pushes the message to User B's WebSocket connection
python
# On Server 1 (sender's server)
def handle_incoming_message(sender_id, recipient_id, message):
    # Persist asynchronously
    kafka.publish("messages", message)
    
    # Find recipient's server
    recipient_server = session_service.get_server(recipient_id)
    
    if recipient_server:
        # Recipient is online — route via Redis Pub/Sub
        redis.publish(f"server:{recipient_server}", {
            "type": "new_message",
            "recipient_id": recipient_id,
            "message": message
        })
    else:
        # Recipient is offline — queue for later + trigger push notification
        offline_queue.enqueue(recipient_id, message)
        notification_service.send_push(recipient_id, message)

Session Service

The Session Service tracks which chat server each user is connected to. It's a simple key-value store in Redis:

plaintext
user:{user_id}:server → "chat-server-42"
user:{user_id}:last_seen → 1696118400

When a user connects to a chat server, the server registers the mapping. When they disconnect, it clears it. This is also where presence data lives (covered later).

Message Queue (Kafka)

Message persistence is decoupled from message delivery. When a message arrives at a chat server, it's published to a Kafka topic immediately and the sender gets an acknowledgement (the "sent" checkmark). A downstream consumer then writes the message to Cassandra.

This decoupling is critical for two reasons:

  1. Writing to Cassandra is slower than acknowledging via Kafka. Clients shouldn't wait for disk persistence before seeing their message delivered.
  2. If a chat server crashes mid-write, the Kafka message can be re-consumed. No messages are lost.

Database Design

Why Cassandra for Messages?

The message storage pattern is:

  • Write-heavy: Billions of messages written per day
  • Append-only: Messages are never updated (only delivery status fields change)
  • Time-ordered reads: "Get the last 50 messages in this conversation" is the primary query pattern
  • High cardinality: Billions of distinct conversation IDs

This is exactly what Cassandra was designed for. Its LSM-tree storage engine handles sequential writes at enormous throughput. And its partition model — where all messages in a conversation can be collocated on the same node — makes "fetch conversation history" queries blazingly fast.

A SQL database (PostgreSQL, MySQL) would work for small scale but would require significant sharding infrastructure to match Cassandra's write throughput. Mention this trade-off explicitly.

Cassandra Schema for Messages

sql
-- Core message table
CREATE TABLE messages (
    conversation_id  UUID,
    message_id       BIGINT,        -- Snowflake ID: sortable + globally unique
    sender_id        UUID,
    content          TEXT,
    message_type     TINYINT,       -- 0=text, 1=image, 2=video, 3=file
    media_url        TEXT,          -- null for text messages
    created_at       TIMESTAMP,
    
    PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
 
-- This schema means:
-- Partition key = conversation_id  → all messages in a conversation land on the same node
-- Clustering key = message_id DESC → newest messages first, no additional sort needed

Message IDs: Why Not Auto-Increment?

Auto-increment IDs are a single writer bottleneck and don't work across distributed nodes. The right answer here is a Snowflake ID — a 64-bit integer composed of:

plaintext
| 41 bits (timestamp ms) | 10 bits (machine ID) | 12 bits (sequence) |

This gives you:

  • ~69 years of unique timestamps before overflow
  • Up to 1,024 machines generating IDs independently
  • 4,096 unique IDs per millisecond per machine
  • IDs are naturally time-sortable — the ordering problem is solved for free

Twitter open-sourced Snowflake. Instagram built a Postgres-based variant. For any distributed system that needs globally unique, time-ordered IDs, Snowflake or a similar scheme is the right answer.

Relational DB for Metadata

User profiles, group membership, and friend relationships are relational by nature and low-volume relative to messages. Keep these in MySQL or PostgreSQL:

sql
-- Users table
CREATE TABLE users (
    id           BIGINT PRIMARY KEY,
    phone_number VARCHAR(20) UNIQUE NOT NULL,
    display_name VARCHAR(100),
    avatar_url   TEXT,
    created_at   TIMESTAMP,
    last_seen    TIMESTAMP,
    INDEX idx_phone (phone_number)
);
 
-- Conversations table (1:1 and group)
CREATE TABLE conversations (
    id           UUID PRIMARY KEY,
    type         TINYINT NOT NULL,  -- 0=direct, 1=group
    name         VARCHAR(100),      -- null for 1:1 conversations
    created_at   TIMESTAMP,
    created_by   BIGINT
);
 
-- Conversation membership (who's in which conversation)
CREATE TABLE conversation_members (
    conversation_id  UUID,
    user_id          BIGINT,
    joined_at        TIMESTAMP,
    last_read_at     TIMESTAMP,     -- for unread count calculation
    PRIMARY KEY (conversation_id, user_id),
    INDEX idx_user_conversations (user_id)
);

The last_read_at field on conversation_members is an elegant way to compute unread counts: just count messages where message_id > snowflake(last_read_at).

The Cassandra schema and Snowflake ID combination is a detail that comes up in almost every senior chat system design interview — knowing why the schema is designed that way, not just what it looks like, is what earns the follow-up questions that let you demonstrate depth. Mockingly.ai has chat system design prompts specifically structured to get you to that depth.


Message Delivery Guarantees

This is a section most candidates skip, and it's where senior interviewers dig in. In any distributed system, messages can be lost, duplicated, or delivered out of order. Your system needs a clear strategy.

The Three Levels

  • At-most-once: Fire and forget. Message might not arrive. Never acceptable for a chat app.
  • At-least-once: Message is guaranteed to arrive, but might be delivered more than once (duplicates possible under retries/failures).
  • Exactly-once: Message arrives exactly once. Requires significant coordination overhead; true exactly-once is very hard to achieve end-to-end.

Practical recommendation: Design for at-least-once delivery + idempotent clients. This is what WhatsApp, iMessage, and Signal all do.

The mechanism: every message has a unique client-generated message ID. When the client sends a message, it stores it locally as "pending." The server persists it and sends an ACK with the server-assigned message ID. The client marks it "sent." If the ACK never arrives (network drop), the client retries — and because the message ID is unique, the server can detect and discard the duplicate.

plaintext
[Client] → sends {client_msg_id: "abc123", content: "Hey"} → [Server]
[Server] → persists message, replies {server_msg_id: "snowflake_id", client_msg_id: "abc123"} → [Client]
[Client] → marks message as "sent" ✓
 
[On retry after network failure:]
[Client] → resends {client_msg_id: "abc123", content: "Hey"} → [Server]
[Server] → detects duplicate via client_msg_id, discards second write, re-sends ACK → [Client]

Delivery and Read Receipts

plaintext
Sent (✓):     Server has received and persisted the message
Delivered (✓✓): Recipient's device has received the message
Read (✓✓ blue): Recipient has opened the conversation and seen the message

Implementation:

  • Sent: Server sends ACK back to sender's WebSocket after persisting to Kafka
  • Delivered: When the message is pushed to the recipient's WebSocket, the recipient's client sends a DELIVERED event back to the server, which forwards it to the sender
  • Read: When the user opens the conversation, the client sends a READ event with the last-seen message_id

Store delivery status in Redis for recent messages (fast reads) and eventually persist to Cassandra as part of the message record.


Group Messaging: The Fan-Out Problem

Group messaging is where the architecture gets genuinely hard. When User A sends a message to a group with 100 members, that message needs to be delivered to all 99 other members — some online on different chat servers, some offline waiting for push notifications.

This is called the fan-out problem: one write needs to fan out to many readers.

Fan-Out on Write

When a message arrives for a group:

  1. The chat server receives the message and publishes it to Kafka
  2. A Group Message Consumer reads from Kafka
  3. It queries the Group Service for the member list
  4. For each member:
    • If online: publishes to their chat server's Redis channel for immediate WebSocket delivery
    • If offline: queues the message for delivery on reconnect, and triggers a push notification
python
def process_group_message(group_id, message):
    members = group_service.get_members(group_id)
    
    for member_id in members:
        if member_id == message.sender_id:
            continue
            
        server = session_service.get_server(member_id)
        
        if server:
            # Member is online
            redis.publish(f"server:{server}", {
                "recipient_id": member_id,
                "message": message
            })
        else:
            # Member is offline
            offline_queue.enqueue(member_id, message)
            notification_service.send_push(member_id, f"New message in {group_id}")

For groups of 100 members, this is manageable. But for very large groups (think: Slack channels with 10,000 members, or Telegram channels with millions), fan-out on write becomes very expensive.

For extremely large groups, consider fan-out on read: store only one copy of the message, and have clients pull it when they open the channel. This trades write scalability for read complexity and slightly higher latency. Mention this as a scaling discussion and note that most chat apps cap group size (WhatsApp: 1024, Messenger: 250) precisely to make fan-out tractable.

The fan-out problem is one of those topics where interviewers can clearly tell who has thought it through versus who memorised a diagram. Being asked "what if the group has 10,000 members?" mid-explanation and having a clean answer ready is the kind of thing that separates offer-level answers. Mockingly.ai puts you in exactly that position — follow-up questions mid-explanation, on the clock.


Presence System

The "last seen" and online indicator features seem simple from the outside. At scale, they're one of the trickiest parts of the system.

Heartbeat Approach

Clients send a heartbeat to the server every 5 seconds while the app is in the foreground. The server updates a Redis key:

plaintext
user:{user_id}:last_heartbeat → Unix timestamp

A user is considered "online" if their last heartbeat was less than 10–15 seconds ago. "Last seen" is the last heartbeat timestamp.

The Scale Problem

With 50 million concurrent users each sending a heartbeat every 5 seconds:

plaintext
50,000,000 users / 5 seconds = 10,000,000 heartbeats/second

That's 10 million Redis writes per second just for presence data. That's significant.

Mitigations:

  • Dedicated presence servers that batch-write heartbeats instead of writing one at a time
  • Use Redis HyperLogLog for approximate online-count queries
  • Fan out presence changes selectively: only notify users who are in active conversations with the user whose status changed, not their entire contact list

A common interview follow-up is "how would you design the presence fan-out?" The answer: don't broadcast to everyone. When User A's status changes, query only the users who are currently viewing a conversation with A, and push only to them.


Handling Offline Users

When a user is offline, messages destined for them need to be held somewhere and delivered when they reconnect.

Storage

Store pending messages in a dedicated table, keyed by user_id:

sql
-- Pending deliveries table (in Redis or a fast KV store)
-- Key: user:{user_id}:pending_messages
-- Value: sorted set of message_ids, scored by timestamp
 
-- On reconnect:
ZRANGE user:{user_id}:pending_messages 0 -1  -- fetch all pending

Set a TTL on pending messages (30 days is standard). If a user doesn't reconnect within 30 days, they can pull history directly from Cassandra when they do come back.

Push Notifications

For offline users, the system needs to wake up their device. This is handled via platform-native push notification services:

  • iOS: Apple Push Notification service (APNs)
  • Android: Firebase Cloud Messaging (FCM)

Important nuance: push notifications are not a reliable delivery mechanism for the actual message content — they're a signal to wake the device and fetch new messages. The content should be kept vague (e.g., "You have 3 new messages") to avoid exposing message content at the notification layer, which has weaker security guarantees than the app itself. This is especially important for end-to-end encrypted systems.


End-to-End Encryption (Mention This, Don't Overcomplicate It)

Interviewers often ask about security. Signal Protocol is the industry standard for E2E encryption (used by WhatsApp, Signal, and Google Messages). The key points:

  • Each device generates a public/private key pair
  • The server stores only public keys — it never sees plaintext messages
  • Messages are encrypted on the sender's device and decrypted only on the recipient's device
  • The server is a dumb pipe that routes ciphertext

For a system design interview, you don't need to go deep into the cryptographic primitives. What matters is showing you understand the architectural implication: the server cannot decrypt messages, which means you can't implement server-side search, content moderation on message bodies, or read-receipt logic based on content. Mention these trade-offs and move on.


Media Sharing

Don't spend more than 2–3 minutes on this in an interview. The pattern is straightforward:

  1. Client uploads media directly to object storage (S3 / GCS) via a pre-signed URL — the chat server is never in the media upload path
  2. The message contains a reference (URL) to the media, not the content itself
  3. A CDN sits in front of object storage for fast downloads
  4. Thumbnail generation runs asynchronously via a background job

The key insight: never funnel media through your chat servers. A 10MB video upload would saturate a chat server that should be spending all its capacity managing WebSocket connections.


Monitoring and Key Metrics

Real-time health metrics:

  • Message delivery latency (p50, p95, p99) — alert if p99 exceeds 500ms for online users
  • WebSocket connection count per chat server (for capacity planning)
  • Kafka consumer lag (indicates message processing is falling behind)
  • Redis memory utilization and hit rate
  • Failed delivery rate (messages that couldn't be delivered after retries)

Business metrics:

  • Daily/monthly active users
  • Messages sent per hour
  • Notification delivery success rate (APNs/FCM)
  • Average session duration (WebSocket connection lifetime)

Common Interview Follow-ups

"How do you handle message ordering in a distributed system?"

Messages within a single conversation are ordered by their Snowflake message ID. Since Snowflake IDs encode a millisecond timestamp, they're naturally sortable by time. Within the same millisecond, the sequence number resolves ordering. For conversations between users in different timezones or with clock-skewed devices, always use the server-assigned Snowflake ID as the canonical order — never the client's timestamp.

"How would you scale to 1 billion concurrent connections?"

Horizontally scale chat servers. Each server holds 50K–100K connections. At 1 billion connections, you'd need 10,000–20,000 chat servers. The routing layer (Redis Pub/Sub + Session Service) needs to scale too — shard Redis by user ID range. At this scale, a company like WhatsApp or Facebook would also likely build custom infrastructure at the networking layer (Erlang's actor model was famously suited to this, which is why WhatsApp ran on Erlang for years).

"How would you handle a chat server crashing with active connections?"

When a chat server dies, all its connected clients detect the connection drop via a heartbeat timeout (WebSocket keep-alive pings). Clients automatically attempt reconnection — the load balancer routes them to a healthy server. In-flight messages that were received by the dead server but not yet persisted to Kafka need to be handled: this is why message persistence should happen before the ACK is sent to the client. If the server dies before ACKing, the client retries from its "pending" local queue.

"What if two users send a message to each other at exactly the same time?"

This is a write conflict in the conversation timeline. With Cassandra's partition-key-based model, both messages land in the same partition (keyed by conversation_id). Cassandra handles concurrent writes with last-write-wins at the cell level — since each message is its own row (with a unique Snowflake ID as clustering key), there's no conflict. Both messages are written, and their Snowflake IDs determine the canonical ordering.

"How does read status work for group messages?"

It's a many-to-many tracking problem. For a group of 100 members, each message needs up to 99 delivery/read receipts. You can't store this as a single status flag on the message. Store receipts as separate rows in a receipt table, keyed by (message_id, user_id). For a group of 100, WhatsApp shows "delivered to N people" rather than individual receipts — this is a UX decision that also happens to simplify the data model.


Quick Interview Checklist

Before wrapping up your answer, make sure you've covered these:

  • ✅ Clarified scope: 1:1 only, or groups too? Group size limit? Media?
  • ✅ Explained why WebSockets beat polling (and briefly acknowledged long polling)
  • ✅ Described how messages are routed between users on different chat servers (Redis Pub/Sub)
  • ✅ Chosen Cassandra for message storage and explained why (write throughput, time-ordered reads)
  • ✅ Explained message ordering with Snowflake IDs
  • ✅ Covered message delivery guarantees (at-least-once + idempotent deduplication)
  • ✅ Addressed delivery receipts (sent, delivered, read)
  • ✅ Explained the fan-out problem for groups and how to handle it
  • ✅ Covered offline users: message queue + APNs/FCM
  • ✅ Briefly mentioned the presence system and its scaling challenges
  • ✅ Mentioned E2E encryption at a high level without going down a cryptography rabbit hole

Conclusion

Designing a chat system tests more distributed systems fundamentals per square inch than almost any other interview question. The reason companies like Meta, Google, Amazon, and Apple use it is that a strong answer requires you to simultaneously reason about real-time protocols, message ordering, delivery guarantees, database selection, fan-out patterns, and failure handling.

The key design pillars:

  1. WebSockets for real-time, bidirectional communication — not polling
  2. Redis Pub/Sub to route messages across chat servers when sender and recipient are on different nodes
  3. Cassandra for message storage because of its write throughput and natural fit for time-ordered, partition-keyed data
  4. Snowflake IDs for message ordering that's globally unique and distributed-friendly
  5. At-least-once delivery with client-side deduplication — the right pragmatic trade-off between reliability and complexity
  6. Kafka to decouple message receipt from message persistence, protecting against data loss on server failure

The best candidates don't just describe these choices — they explain the alternatives they rejected and why. That reasoning is what separates a senior engineer's answer from a mid-level one.



Frequently Asked Questions

Why use WebSockets instead of HTTP polling for a chat app?

WebSockets provide a persistent, full-duplex TCP connection where either side can send data at any time. HTTP polling makes repeated requests on a timer, wasting bandwidth and adding latency.

The three options compared:

Short PollingLong PollingWebSockets
ConnectionNew HTTP request every N secondsRequest held open until data arrivesSingle persistent connection
LatencyUp to N seconds delayLower, but still one round-tripNear-instant push
Server loadExtremely high (empty responses)ModerateLow (idle connections are cheap)
DuplexHalf (client initiates)Half (client initiates)Full (both sides push freely)
Battery (mobile)Very poorPoorGood

Why short polling fails at scale:

At 500 million users polling every 2 seconds, the system processes 250 million requests/second doing nothing but returning empty responses. Dismiss this immediately in your interview.

Why WebSockets are the right answer:

  1. Once established via a one-time HTTP upgrade handshake, the connection stays open indefinitely
  2. The server pushes messages, receipts, and typing indicators without the client asking
  3. Latency drops from seconds to milliseconds
  4. Used by WhatsApp, Slack, Discord, iMessage, and every other production chat system

How does a message get from User A's chat server to User B's chat server?

Redis Pub/Sub is the standard cross-server routing mechanism. Each chat server subscribes to its own Redis channel. When a message must be delivered to a user on a different server, it is published to that server's channel.

The flow step by step:

  1. User A sends a message — it arrives at Chat Server 1 (where A is connected)
  2. Server 1 looks up User B in the Session Service: GET user:B:server → "chat-server-7"
  3. Server 1 publishes the message to Redis channel server:chat-server-7
  4. Chat Server 7 is subscribed to that channel — it receives the event
  5. Server 7 pushes the message to User B's WebSocket connection
  6. User B's client receives the message and sends a DELIVERED ACK back
  7. The ACK travels the same path in reverse to reach User A

If User B is offline:

Server 1 finds no session entry for User B. Instead of publishing to Redis, it adds the message to User B's offline queue and triggers a push notification via APNs (iOS) or FCM (Android).


Why is Cassandra used for chat message storage instead of MySQL or PostgreSQL?

Cassandra is purpose-built for the exact access pattern chat generates. Relational databases handle it at small scale but require significant sharding infrastructure to reach Cassandra's native throughput.

The chat message storage access pattern:

  1. Write-heavy — billions of inserts per day, never updates
  2. Append-only — new messages always arrive at the "now" end of a conversation
  3. Time-ordered reads — "get the last 50 messages in this conversation" is the dominant query
  4. High cardinality — billions of distinct conversation_id values

Why Cassandra fits:

  1. LSM-tree storage engine — writes are sequential appends to a log, not random-access B-tree updates. Sequential writes are orders of magnitude faster at high throughput
  2. Partition model — setting conversation_id as the partition key collocates all messages for a conversation on the same node. "Fetch last 50 messages" becomes a single-node range scan, not a distributed join
  3. Clustering orderCLUSTERING ORDER BY (message_id DESC) means the query result requires no additional sorting
  4. Horizontal scale — adding nodes redistributes load automatically via consistent hashing

What you give up with Cassandra:

  1. No flexible ad-hoc queries (no JOIN, limited WHERE options)
  2. Eventual consistency by default (tunable per-query)
  3. No native full-text search — needs Elasticsearch for that use case

What is a Snowflake ID and why does chat use it for message ordering?

A Snowflake ID is a 64-bit integer that encodes a millisecond timestamp, a machine identifier, and a per-machine sequence number — making it globally unique and naturally sortable by creation time without any coordination between servers.

The 64-bit structure:

plaintext
| 41 bits (ms timestamp) | 10 bits (machine ID) | 12 bits (sequence) |

Why Snowflake IDs solve the message ordering problem:

  1. No central bottleneck — each chat server generates IDs independently with no coordination. Auto-increment requires a single writer; Snowflake does not
  2. Naturally time-ordered — newer messages always have larger IDs. Sorting by message_id DESC is equivalent to sorting by time
  3. Globally unique — the machine ID bits ensure two servers can never generate the same ID in the same millisecond
  4. Collision-free — 4,096 unique IDs per millisecond per machine supports very high per-server message throughput

Why not use the client's timestamp for ordering?

Client clocks are unreliable — they drift, can be set by the user, and differ across timezones. Always use the server-assigned Snowflake ID as the canonical message order, never the client-reported timestamp.


What is at-least-once message delivery and why is it the right choice for chat?

At-least-once delivery guarantees a message will be delivered at least once but may be delivered more than once under failure scenarios. Exactly-once guarantees precisely one delivery but requires expensive distributed coordination.

The three delivery guarantees:

  1. At-most-once — fire and forget. Message may be lost. Never acceptable for chat
  2. At-least-once — guaranteed delivery, possible duplicates. Correct for chat when combined with deduplication
  3. Exactly-once — guaranteed single delivery. Very hard to achieve end-to-end; requires significant overhead

Why at-least-once + client deduplication is the right approach:

  1. Messages are never silently lost — the most important guarantee for a chat user
  2. Duplicates are rare (only on retry after failure) and cheap to handle client-side
  3. Every message carries a client-generated client_msg_id UUID. On retry, the server checks if it has already processed this ID — if yes, it returns the existing result without creating a duplicate
  4. The client deduplicates on display: if the same client_msg_id arrives twice, it is shown once

This is what WhatsApp, iMessage, and Signal all implement.


How does group chat fan-out work and what happens at large group sizes?

Fan-out is the process of delivering one message to many recipients. In a group chat, one write must reach every member — some online on different servers, some offline.

Fan-out on write (for groups under ~500 members):

  1. Message arrives at the sender's chat server and is published to Kafka
  2. A Group Message Consumer reads from Kafka and fetches the member list
  3. For each member: if online → publish to their server's Redis channel; if offline → add to offline queue and trigger push notification
  4. Each chat server delivers to its connected members via WebSocket

Fan-out on read (for very large groups / broadcast channels):

  1. Store one copy of the message
  2. Members fetch it when they open the channel
  3. No per-member write amplification — one write regardless of group size
  4. Trade-off: higher read latency when opening the channel; no real-time push for large groups

Why most chat apps cap group size:

  1. WhatsApp caps groups at 1,024 members; Messenger at 250
  2. Fan-out on write at 1,024 members = 1,024 Redis publishes + up to 1,024 push notifications per message
  3. Capping group size keeps fan-out tractable and delivery latency predictable

How does the online presence system work at scale?

Online presence uses heartbeats from connected clients, stored as timestamps in Redis. A user is "online" if their last heartbeat arrived within a threshold window.

The mechanism:

  1. Connected clients send a heartbeat ping every 5 seconds via the WebSocket connection
  2. The server updates: SET user:{user_id}:last_heartbeat {unix_timestamp}
  3. A user is "online" if now - last_heartbeat < 15 seconds
  4. "Last seen" is the value of last_heartbeat at the moment the WebSocket disconnected

The scale challenge:

50 million concurrent users × 1 heartbeat every 5 seconds = 10 million Redis writes per second just for presence data. This requires:

  1. Dedicated presence servers that batch-write heartbeats (write one Redis call per 100 heartbeats rather than one per heartbeat)
  2. Selective fan-out — when User A's status changes, notify only users who are currently viewing a conversation with A. Broadcasting to all of A's contacts would generate enormous unnecessary traffic
  3. Redis TTL as fallback — set a 30-second TTL on the heartbeat key. If the client dies without sending a disconnect event, the key expires and A is automatically marked offline

How do sent, delivered, and read receipts work?

Message receipts track three distinct stages of delivery, each requiring a different trigger and storage path.

StatusSymbolTriggerWho updates it
SentServer has persisted the messageServer ACK to sender
Delivered✓✓Recipient's device received the messageRecipient's client → server → sender
Read✓✓ (blue)Recipient opened the conversationRecipient's client → server → sender

Implementation:

  1. Sent — after the chat server publishes the message to Kafka and receives confirmation, it sends an ACK to the sender's WebSocket. The client renders the single checkmark
  2. Delivered — when the message is pushed to the recipient's device, the recipient's client sends a DELIVERED event. The server updates the message status and pushes the double-checkmark to the sender
  3. Read — when the user opens the conversation, the client sends READ {last_message_id}. The server updates status for all messages up to that ID and pushes blue double-checkmarks to all senders in the conversation

Group read receipts:

A group message cannot have a single delivered or read flag — there are N recipients. Store receipts as separate rows in a message_receipts table keyed by (message_id, user_id). Show aggregate counts in the UI ("delivered to 47 of 100") rather than individual receipts, which dramatically simplifies both the data model and the UI.


Which companies ask the chat system design question in interviews?

Meta, Google, Amazon, Microsoft, Apple, Uber, WhatsApp, and Slack all ask variants of this question for senior software engineer roles.

Why it is a consistently popular interview question:

  1. Covers breadth — it requires reasoning about real-time protocols, distributed databases, message ordering, delivery guarantees, fan-out, and failure handling in a single 45-minute session
  2. Scales to seniority — a junior answer describes "use a database and WebSockets"; a senior answer explains why Cassandra over MySQL, what at-least-once means and how to implement it, and how fan-out breaks at 10,000 members
  3. Directly maps to real products — every company on the list runs a real-time messaging product

What interviewers specifically listen for:

  1. Explicitly dismissing short polling — shows you understand the scale implications
  2. WebSocket stickiness — naming that chat servers hold stateful connections and explaining the routing consequence
  3. Cassandra over SQL — with the specific reasoning (LSM-tree writes, partition model, clustering order)
  4. Snowflake IDs — and why client timestamps can't be trusted for ordering
  5. Fan-out on write vs read — and the group size threshold where you'd switch strategies
  6. At-least-once + client dedup — rather than claiming exactly-once delivery

This is exactly the kind of question where practicing out loud makes a real difference. You can read about fan-out all day, but the first time you actually have to explain it under pressure, the gaps in your thinking become obvious fast. Mockingly.ai is built for that kind of deliberate practice — realistic system design interview simulations where you get feedback on your actual reasoning, not just the components you listed.


How to Answer This in a System Design Interview

When this question comes up in an interview, it's tempting to jump straight into technologies like message queues or distributed databases. A better approach is to walk the interviewer through your thinking step‑by‑step.

A typical structure could look like this:

  1. Clarify requirements

    • Is this 1:1 messaging or group chat?
    • Do we support media messages?
    • How large can groups be?
    • Are we optimizing for delivery speed or strict ordering?
  2. Start with a simple design

    Begin with the simplest version:

    Client → Chat Server → Database

    The server receives messages, stores them, and delivers them to the recipient.

  3. Introduce real‑time communication

    Polling works but is inefficient. Long polling improves things but still adds latency.
    For real chat systems, WebSockets are the typical solution because they allow persistent bidirectional connections.

  4. Add a message queue for scalability

    Once traffic grows, the chat server shouldn't directly deliver every message.
    A queue helps decouple message ingestion from message delivery.

    Examples include systems like Apache Kafka or other distributed queues.

  5. Scale storage

    Messages grow extremely fast. A horizontally scalable database is typically used.

    Examples:

    • Cassandra
    • DynamoDB
    • Bigtable

    The key requirement is horizontal scalability and efficient time‑based writes.

  6. Add presence and push notifications

    Presence can be implemented using heartbeats over WebSockets.

    For offline users, messages trigger push notifications through services such as:

    • Firebase Cloud Messaging (FCM)
    • Apple Push Notification Service (APNs)
  7. Discuss trade‑offs

    Finally, talk about:

    • message ordering
    • delivery guarantees
    • group fan‑out
    • scaling WebSocket connections

Interviewers usually care more about your reasoning process than the specific technologies you pick.


Message Ordering

One important detail that often comes up in interviews is ordering guarantees.

In practice, chat systems usually guarantee ordering per conversation, not globally.

For example:

  • Messages inside a single chat appear in order.
  • Messages across different chats do not require global ordering.

This keeps the system simpler and avoids unnecessary coordination between servers.


Multi‑Device Synchronization

Users often access chat applications from multiple devices at the same time (phone, laptop, tablet).

To support this, each device maintains its own WebSocket connection to the server.

The server tracks active sessions for a user:

User → Device A (WebSocket)
User → Device B (WebSocket)

When a message arrives, the server delivers it to all active sessions for that user.

If a device is offline, the message remains in storage and is delivered when the device reconnects.


Scaling WebSocket Connections

A large chat system may need to maintain millions of persistent connections.

This is usually handled using a layer of connection gateway servers.

Typical architecture:

Client → Load Balancer → WebSocket Gateway → Messaging Service

The gateway servers manage persistent connections while the messaging service focuses on processing messages.

Load balancers often use sticky sessions so a client stays connected to the same gateway.

This separation allows the system to scale connection handling independently from message processing.

Prepare for these companies

This topic is commonly asked at these companies. Explore their full question lists:

Interview Guide for This Topic

See how top companies test this question in their system design interviews:

Practice "Design a Chat System (Real-Time Messaging)" with AI

Reading is great, but practicing is how you actually get the offer. Get real-time AI feedback on your system design answers.

Related Posts

Turn This Knowledge Into Interview Confidence

Reading guides builds understanding. Practicing under interview conditions builds the muscle memory you need to deliver confidently when it counts.

Free tier available • AI-powered feedback • No credit card required

Practice system design with AI