Chat System

A chat app performs different functions for different people. It is extremely important to nail down the exact requirements. For example, you do not want to design a system that focuses on group chat when the interviewer has one-on-one chat in mind. It is important to explore the feature requirements.

It is vital to agree on the type of chat app to design. In the marketplace, there are one-on-one chat apps like Facebook Messenger, WeChat, and WhatsApp, office chat apps that focus on group chat like Slack, or game chat apps, like Discord, that focus on large group interaction and low voice chat latency.

The first set of clarification questions should nail down what the interviewer has in mind exactly when she asks you to design a chat system. At the very least, figure out if you should focus on a one-on-one chat or group chat app.

Questions to ask for exact scope

What kind of chat app shall we design? 1 on 1 or group based? – It should support both 1 on 1 and group chat.

Is this a mobile app? Or a web app? Or both? – both

What is the scale of this app? A startup app or massive scale? – It should support 50 million daily active users (DAU).

For group chat, what is the group member limit? – A maximum of 100 people

What features are important for the chat app? Can it support attachment? – 1 on 1 chat, group chat, online indicator. The system only supports text messages.

Is there a message size limit? – Yes, text length should be less than 100,000 characters long.

Is end-to-end encryption required? – Not required for now but we will discuss that if time allows.

How long shall we store the chat history? – forever

These are the requirements based on the questions above:

  • A one-on-one chat with low delivery latency
  • Small group chat (max of 100 people)
  • Online presence
  • Multiple device support. The same account can be logged in to multiple accounts at the same time.
  • Push notifications

Clients do not communicate directly with each other. Instead, each client connects to a chat service, which supports all the features mentioned above. Let us focus on fundamental operations. The chat service must support the following functions:

  • Receive messages from other clients.
  • Find the right recipients for each message and relay the message to the recipients.
  • If a recipient is not online, hold the messages for that recipient on the server until she is online.

When a client intends to start a chat, it connects the chats service using one or more network protocols. For a chat service, the choice of network protocols is important.

Requests are initiated by the client for most client/server applications. This is also true for the sender side of a chat application. When the sender sends a message to the receiver via the chat service, it uses the time-tested HTTP protocol, which is the most common web protocol. In this scenario, the client opens a HTTP connection with the chat service and sends the message, informing the service to send the message to the receiver. However, the receiver side is a bit more complicated. Since HTTP is client-initiated, it is not trivial to send messages from the server. Over the years, many techniques are used to simulate a server-initiated connection: polling, long polling, and WebSocket.

Polling – polling is a technique that the client periodically asks the server if there are messages available. Depending on polling frequency, polling could be costly. It could consume precious server resources to answer a question that offers no as an answer most of the time.

Long Polling – in long polling, a client holds the connection open until there are actually new messages available or a timeout threshold has been reached. Once the client receives new messages, it immediately sends another request to the server, restarting the process. Long polling has a few drawbacks:

  • Sender and receiver may not connect to the same chat server. HTTP based servers are usually stateless. If you use round robin for load balancing, the server that receives the message might not have a long-polling connection with the client who receives the message.
  • A server has no good way to tell if a client is disconnected.
  • It is inefficient. If a user does not chat much, long polling still makes periodic connections after timeouts.

Websocket – webSocket is the most common solution for sending asynchronous updates from server to client. WebSocket connection is initiated by the client. It is bi-directional and persistent. It starts its life as a HTTP connection and could be “upgraded” via some well-defined handshake to a WebSocket connection. Through this persistent connection, a server could send updates to a client.

  • WebSocket connections generally work even if a firewall is in place. This is because they use port 80 or 443 which are also used by HTTP/HTTPS connections.
  • Earlier we said that on the sender side HTTP is a fine protocol to use, but since WebSocket is bidirectional, there is no strong technical reason not to use it also for sending.


No technologist would design such a scale in a single server. Single server design is a deal breaker due to many factors. The single point of failure is the biggest among them. We suggest having a presence server.

Here the client maintains a persistent WebSocket connection to a chat server for real-time messaging.

  • Chat servers facilitate message sending/receiving.
  • Presence servers manage online/offline status.
  • API servers handle everything including user login, signup, change profile, etc.
  • Notification servers send push notifications.
  • Finally, the key-value store is used to store chat history. When an offline user comes online, she will see all her previous chat history.


Selecting the correct storage system that supports all of our use cases is crucial. We recommend key-value stores for the following reasons:

  • Key-value stores allow easy horizontal scaling.
  • Key-value stores provide very low latency to access data.
  • Relational databases do not handle long tail of data well. When the indexes grow large, random access is expensive.
  • Key-value stores are adopted by other proven reliable chat applications. For example, both Facebook messenger and Discord use key-value stores. Facebook messenger uses HBase, and Discord uses Cassandra.


One on One chat flow

  1. User A sends a chat message to Chat server 1.
  2. Chat server 1 obtains a message ID from the ID generator.
  3. Chat server 1 sends the message to the message sync queue.
  4. The message is stored in a key-value store.
  5. If User B is online, the message is forwarded to Chat server 2 where User B is connected
  6. If User B is offline, a push notification is sent from push notification (PN) servers.
  7. Chat server 2 forwards the message to User B. There is a persistent WebSocket connection between User B and Chat server 2.

Message synchronization across multiple devices

Each device maintains a variable called cur_max_message_id, which keeps track of the latest message ID on the device. Messages that satisfy the following two conditions are considered as news messages:

  • The recipient ID is equal to the currently logged-in user ID.
  • Message ID in the key-value store is larger than cur_max_message_id .

With distinct cur_max_message_id on each device, message synchronization is easy as each device can get new messages from the KV store.

Group chat

Subscribe To Our Newsletter
You will receive our latest post and tutorial.
Thank you for subscribing!


Leave a Reply

Your email address will not be published. Required fields are marked *