Why building a scalable notification system is complex?

Anand Sukumaran

Before starting Engagespot, In my previous company, I worked on a lot of application development projects primarily working with startup founders to turn their ideas into products. Those were mostly in the B2B SaaS space. Multi-channel notification systems were a common module that I had to work on almost all the products we built.

Notification is not a simple term, it's an experience

While we managed to build a basic module by connecting email and push notification APIs to deliver one-time passwords and critical alerts, we could never build the notification experience layer that our users wanted. For end-users, they don’t care if the product they use is built by a 2 member team, or a huge company, they expect the same, seamless notification experience as they would get from a well-established app like LinkedIn or Instagram. This is the problem. When your engineering resources and time are limited, and precious, the real question is whether you should try to reinvent the wheel. Or, should you focus on your core features that no one else has built before?

Is building a notification system complex?

From my experience, building anything is not complex given that we have the knowledge, experience, and capacity to build it. It depends on the stage of our product. For an MVP product with one notification channel (let’s say email), and only 2-3 email triggers, it’s quite simple to integrate an SMTP service or an email API like Sendgrid. But once you get funded and start working seriously on your product for a large number of users, things might feel different. I would classify the stages you see in your product development journey as -

MVP for early adopters, and demos.
Production grade with growing users, and SLA commitments.
Scale where you witness unexpected infrastructure challenges. I’ve already covered the MVP stage in the above paragraph. Now let’s talk about the complexities in building the same for stages 2 and 3.

Building a notification system for production-grade and highly scalable applications

Building a production-grade application is so different from an MVP. It needs attention to minute details, it needs to be fault-tolerant, and you need to have systems to manage any potential risks that might cause downtime. So the notification system that you built for your MVP starts to bottleneck. Not just in terms of performance, but also in terms of features. It becomes outdated quickly.

Your app’s notification experience starts to degrade.

As you grow, users expect a good notification experience from your app. Notification experiences from apps like Instagram, Facebook, LinkedIn etc have set the expectations of your users high. They can do that because, they have several engineers working to improve the notification experience. Like this, your application probably will lag behind popular apps in the industry in terms of features like -

Handling duplicated notifications across multiple channels.

Your application should never send a notification via all channels at once. If the user has already seen the notification and taken action via a channel, then duplicating it via other channels can be frustrating.

Allowing users to control how and what alerts they want to be notified.

Without giving users the ability to control notifications from you, they will end up blocking the entire notifications at the operating system level. This means you will never get a chance to engage with the user. Applications like Facebook have set high expectations on how fine-grained a notification preference module should be. Your users will expect a similar experience from you too.

Notification burst control.

Let’s say you are building a social network and you send a notification whenever a user likes another user’s photo. If a photo becomes viral, this could result in hundreds of likes in a short period of time and it generates hundreds of notifications which might result in a direct block of notifications from your application by irritated users.

Advanced notification management features

Your users will expect advanced features like snooze notifications, actionable alerts, marking notifications as read/unread, deleting all notifications, etc.

Multi-device sync

If your users use both your web and mobile app, your notification status should sync across all their devices. For example, if they have seen a notification from your web app, it should be instantly marked as read on all their other devices as well.

Optimizing the notification copy for a personalized experience.

Your product or marketing team will have frequent requests to update the notification copies and experiment with them. If you hardcode the notification content in your code, each time you have to go through a release cycle, just to update the notification or email copy. Like this, your product team will have a lot of items in their backlog that the engineering team needs to attend to. From my experience, no matter how large the engineering team is, they are always occupied with tasks related to their core product features, and other areas like improving notifications get the least attention.

You need a reliable infrastructure

In a production-grade app, everything becomes critical. Downtimes can cost you money, reputation and affect your reliability. Since the notification system is a critical module for any application, it would become a ticking time bomb when you start to grow, because it is a potential single point of failure that might break anytime and you wouldn’t know when. At this stage, your engineering team’s priority should be to cut down as many risks as possible by offloading functionalities to reliable third-party services. Many things can go wrong in a production-grade app which you can never afford to happen in order to maintain your reliability. Few of them are -

A notification channel provider API breaks due to an unexpected response.

This happened to me several times. Due to some reason, (billing related, or other), a service such as SendGrid stopped accepting notifications from our server, and unless we have a proper logging mechanism built in, we will never know until one of our users complains. For mission-critical notifications such as One-Time-Passwords, this could even put your business at stake. So it’s really important to have proper logging, and monitoring mechanisms in place to track responses from your API providers. This overhead grows beyond control when you use multiple APIs like Sendgrid, Twilio, FCM, etc.

Each provider has their own API Rate limits and you need to have a system to control that.

The providers you use for each channel might have their own API rate limits. As your application grows, a lot of notifications will be triggered but since your provider APIs, you need to have rate limiters in place to avoid getting 429 Too many requests errors.

Retrying a failed notification.

For some reason, if the provider API is down, you need to keep the notification in a queue and have a retry mechanism to avoid missing a notification. For a production system, this means setting up dedicated Redis instances, or message brokers with proper persistence and backup mechanisms.

Decoupling notification triggers.

If you connect your application logic directly to a notification delivery API call, it can affect the performance of your application by increasing the latency of API calls. So, usually, you will need a message queue to process notifications. Again, in a production environment, this needs more setup to handle risks and failovers.

Finally, with scale, everything seems to fall apart

Let’s say you passed MVP, growth stage and your product became really successful. Now you start fighting with problems that you never faced in the previous stage. Because “scale” is a different dimension and it needs expert knowledge and sophisticated distributed software architecture to tackle this. Let’s talk about a few problems you’ll face.

Increasing data size

As more and more users start using the notification component, your database will grow several folds in the millions of rows and even a simple SELECT query will take seconds to finish. Your indexes will reach such a stage that it never fits in the memory as before. All these will result in degrading your “real-time delivery latency”.

Logging will not scale

In the early stages, a simple logs table might be enough to save the notification statuses and logs, but again when you send millions of notifications every month, the logging table will grow in the order of hundreds of millions of rows, and even partitioning becomes unmanageable.

Message throughput becomes a bottleneck

If your application has several notification triggers, the number of notifications generated per second will increase exponentially with the number of users. If you’re unable to process the queued notifications with an acceptable throughput range, the notification experience will be affected and the performance of your message queues will degrade.

Fault tolerance and recovery

You might have several notification delivery workers running simultaneously and if a system crashes, you should have the proper implementation to make sure none of the notifications is dropped. Without implementing mechanisms such as graceful shutdown, automated health checking, and auto-scaling, you will never be able to satisfy the SLA commitments with your enterprise customers.

Simply, it’s not worth the effort!

As you noticed, the problem that was trivial in the beginning became a deal breaker as your product scaled. We’re engineers and we can build anything given there is time, money, and resources. The question is whether you should build it. You should analyze the value benefit of building such a system in-house by spending several hundred thousand dollars instead of paying a small subscription fee and let them handle the responsibility. Moreover, their team's primary job is to tackle such problems even before you face them. Don’t waste time reinventing the wheel. There are a few developer focussed notification infrastructure players like Engagespot, Courier, Knock etc. Choose Engagespot and continuously improve the notification experience of your application without breaking a sweat.