Specifying requirements for live notification mechanism for systems integration purposes
Recently I’ve designed a mechanism to notify external systems (with which we cooperate) about changes in our system. This, obviously, can be done in multiple ways. Let’s look at some considerations on a high level, some questions and how that affects our requirements.
- we want to notify other, external systems, owned by someone else
- allowed delay, between the change in our system and making the notification is around one minute
- the change can carry multiple information and varies on the type of change
- we expose an API which is currently used by those external systems - they fetch the changes periodically
- the number of changes per second in our system is spiky in nature (assume 50-5000 notifications/second for now)
- external systems will subscribe themselves for notifications
- How to notify external systems?
- What information should we pass? When is the notification delivered?
- How long should we wait for the response?
- When should we retry?
- There are multiple external systems, made in multiple different technologies. The most popular and basic method of integration is just making HTTP(S) calls. Should it be GET, POST or X? Let’s consider two most popular - the GETs and POSTs.
- We have to pass multiple values, depending on the notification type. For example, normal amount of information is: string (300 chars), 5 dates, 5 integers - therefore both GET (allowing ~2k chars on nearly all browsers and servers) and POST methods are viable. However, GET is very straightforward and simple. No issues with encoding, accepting compression or even reading the stream. What is more, GET put less pressure on your’s servers as you do not have to send the body stream. Unfortunately GET query string is also visible for (nearly) everyone, therefore only-non sensitive information can be passed. What about concurrent notifications? How could one make “exactly-once” delivery model? Here is where we can use nicely one of our assumptions. Because of our API we can force external systems to fetch information through our API, after we will notify them. Such notification can be delivered in “at-least-once” model and we can provide non-sensitive, idempotent information about the change, which then can be used to get, full sensitive data from our API. One can even imagine an optimization - keep notifications to send in a buffer and delete duplicates in a small time bucket.
- The obvious thing is that the longer we wait for responses the more resources are used. However, there is one more important thing. By specifying the request timeout, we can control how the architecture of the external system will look like. By saying “you have 30 seconds to process the notification” is like saying ~“you have a lot of time to get our notification, process it and synchronously ask our API then send us HTTP 200 status code”. Compare it with “you have 3 seconds to store the notification for processing later or process it asynchronously”. The implications are clear, short time = less required resources + better integration.
- We want to be sure that the notification reaches the external system and thanks to the design specified in second point we can use “at-least-once” delivery model. I see two options now: a) hit specified URL 3 times (for example), don’t wait for the answer and don’t send this notification ever again, b) hit specified URL, retry in X minutes if HTTP status code was different than 200. First option is very simple in implementation, however it assumes that external systems will develop a mechanism to avoid processing the same notifications multiple times - which will likely end in hitting our API three times for every single notification.