Types of alarms - what's beyond min-max checks?
If you ever maintained a live system - I’m sure you must have used the min-max alarms. However, are they always the best tool for the job?
Alarms should be the first line of defense against issues on production. It shouldn’t be a manual process that you discover something isn’t working as expected. It shouldn’t be a developer lurking into the logs. The alarm should notify you about a problem - page your mobile phone for serious issues, otherwise just leave a trace in yours ticketing system for further investigation.
Simplifying, in a monitoring system there are two main components - metrics and alarms. Metrics are the data sent by producers (hosts, application, infrastructure etc.), and alarms are just the components for monitoring the values. Let’s focus on the alarms.
Min-max (range alarm)
Assume that you want to monitor CPU utilization on your hosts. The natural choice here is min-max alarm. Having the utilization too high may indicate problems with scaling, bugs etc. Having CPU utilization too low usually also isn’t good - perhaps it’s a waste of resources, perhaps your traffic dropped too much.
Edit it on Draw.io
Derivative alarm (rate of change alarm)
Now, imagine you are a website owner and you want to monitor the parameters regarding number of user registrations on your website. Usually, there is some kind of email validation algorithm in place - it can be dead simple like a regex, or more complex like an artificial intelligence checking if the emails are legit. Would you use a min-max alarm to monitor the number of invalid registrations on your website? What threshold would you set? Remember, that usually there are fewer registrations in night, than in during the daytime. You could try to divide the day and night periods and try to come up with min-max alarm based on that. However, the main disadvantage is that those alarms will become obsolete as hopefully your website will grow and the number of registrations will naturally increase.
In that case, you can monitor the rate of change - because the problem is when the metric changes rapidly. The rate of change is mathematically a derivative of a function, noted as f’(x).
The alarm rule looks like this: alarm if value > 3*f’(x) + 5.
[Edit it on Draw.io] (https://www.draw.io/?lightbox=1&highlight=0000ff&edit=_blank&layers=1&nav=1&title=Derivative%20alarm#R7VlLk%2BMmEP41PiYl8fDjOJ7dJJektmoOSY5YwjK1snAQXtv59QEJ9GhJI2vGM95NjVw1I5oGmq8%2FoBvN8OP%2B%2FKtih93vMubpDAXxeYY%2FzRBahsT8tYJLKaAIl4JEibgUhbXgSfzLnTBw0qOIed5S1FKmWhzawkhmGY90S8aUkqe22lam7VEPLOEdwVPE0q70TxHrnZsWDWr5b1wkOz9yGLiaDYu%2BJkoeMzfeDOFt8ZTVe%2Bb7cvr5jsXy1BDhzzP8qKTU5dv%2B%2FMhTC62HrWz3y0BtZbfimb6mAV6WLb6x9Mi9yYVh%2BuLBKKbDbYNghtenndD86cAiW3sy3jeynd6nphSa161I00eZSlW0xTHly5hYucy083NoyywVSWYKkTGUG%2BV1rpX8yn3TTGZGde2M40rz8%2BAMwwo3Q0cu91yri1FxDfDcQe2oGGJXPtWODb07dg2nYi9kjkxJ1XcNqHlxmA7gu3hjfBlfbqN74ovIPfFF9AfgbwfMHsgH8a2wuw9%2FV%2F93%2FpLlXfEl4%2FiaFubc4%2BPYsvxQHoZbcbb%2BWMcs3xWO6UF%2Bu4x49ALkTdPNkhIa3IbeEH5Eu%2FCTHvTJDcAPUQ%2F481Q7UFpemP9zlL7ip7yA68EohORwrivNW2L%2Fa7HnvidjRNlZWdXxrsFJD9G76TInai6l2nGo2qJcx7beOkGYmObBOXQv4thWdj18Az%2FSRduPhHT9iPtW0S38OL%2B5H01k6eleC%2FG8eLr%2Bzo77jQESBXJrzcqMIcY2FPA9E9ZQFseK57mJaE1IfFQiS8yL4okwXmdayOxHIkvfJnQDAoEwrWcXrrKDJn%2BqyPtVBLrilONZ%2FGAzCwtFyvJcRIUjmNJdcQP3zuYZFE9V49ML1LMTtw%2FWQYx53Epmugg3MKQ9EHqZ4qkh47d2CtQHqxvhixTF4vIOBIEKocAzuTyqiLtWzZwEdNSJeELQkQE94brTkfEDuzTUDlYhHzYYjuMNrklT9lhTqML0urP9iti0warumu1wx%2B0%2F07kDz35qf0aeKBYLw6fm4V78TF3KNjxdV%2Fks2Fr4Wei%2FbO8%2FY4pd%2Be%2ByvApd%2BQtX5iAsdo3CCjOQupSNVouVF5StSLD0AtjsWuKjoCRGM3K912KA3EKLFy6GeQD3xTdaDATGYfT5RQr2azQf0Ydx3iJ4Vh%2BBeaPViD4AHMOLD6iPgT4d0adAfz6iv5jY%2F2qaPg6n2YMnzhfD%2BeIR%2Fc75PaK%2FmMYHvJrGTxJC%2Fef7J3ganwmdxk8C8RnhJ4H8ISP6kD8jfKBwfY3gTyGeI%2FjQiXhSiOeIf2G%2BMdo%2F5E97vq8%2B7IkPeF%2Beg6DwmhwkCFYr6zuYg%2FjRNl7wye7%2FVVqxgXpXphqD4chw1tHJGAZzi75Uoh24tG8wQIwz608xptz%2BwGMsvPL64RZ3P%2BT11w8flHl%2FypDgjpTB70YZS5orKPNHMcEP0oxdj8Gc9j1J03fH%2FEGa7580CH6aeDvSmGL90bkMf%2BoP%2B%2Fjzfw%3D%3D)
Another solution for described use case would be to set up the alarm on a mathematic formula - the percentage of invalid registrations (invalidNumberOfRegistrations/totalNumberOfRegistrations).
Standard deviation alarm (σ alarm)
Sometimes (usually…?) the production runs on many servers. In that scenario the standard deviation alarm come in handy. It is working on additional “dimension” - it doesn’t process one metric, instead it processes many instances of the same metric from many servers. It checks if the values are similar to each other - mathematically - it calculates the standard deviation. The most useful case for it is when you compare 1-Box environment to production environment. I wrote about that before: Rock solid pipeline - how to deploy to production comfortably? For example, can monitor the number of inodes used on yours hosts (because you can run out of space… without running out of space).
Obviously, the alarming state is when one of the host in your fleet uses more inodes than “normally” the rest of the fleet, for example, 2*σ+5000 inodes.
Edit it on Draw.io
Generally, whenever there is need to compare an experiment (new deployment/new feature/new algorithm) to an existing one, σ alarms seems to be helpful.
So what metrics can you monitor?
…all of them! Stay tuned for the next post!
Operational Excellence series
- Intro: What is Software Operational Excellence?
- Deploying: Rock solid pipeline - how to deploy to production comfortably?
- Monitoring&Alarming: Types of alarms - what’s beyond min-max checks?
- Monitoring: What service metrics should be monitored?
- Scaling: (Auto) scaling services by CPU? You are doing it wrong
- Scaling: How do you know the right maximum connections?
- Scaling: How to estimate host fleet size? Why keeping CPU at 30% might NOT be waste of money?
Please note: the views I express are mine alone and they do not necessarily reflect the views of Amazon.com.