Monitoring services is crucial, if you care about the application uptime. There are hundreds if not thousands parameters which you can (and should) monitor, related to CPU, network, hosts, application and so on. What are they? What are the non-obvious choices?

Record as much as you can

First of all, a good rule of thumb is recording as many metrics as possible. It is cheap. There is no reason not to do so (if you care about the service uptime). Do not set alarm on all of them, just have them in case of an emergency. Having them is priceless when mitigating or root causing problems. When setting an alarm on some of them, consider the alarm type: Types of alarms - what’s beyond min-max checks?

Useful list of useful metrics

Here is my reference list. It’s not full and complete; it’s rather a good starting point. Pick whatever you think is useful. If you use a cloud (AWS/Azure/etc), probably you are already monitoring some of them by default.

Non-obvious choices

  • Inodes - because you can run out of space… without running out of space
  • memory - are you sure you want to monitor MemFree? Probably better will be MemAvailable. Is it enough? No, because it doesn’t tell how much of it is available for JVM/IIS.
  • number of running threads, different thread pool sizes. When all threads in a thread pool will be busy, in the logs you will only see that the application did nothing, without information why.

Load Balancer

  • maximum connections allowed to each underlying host
  • number of concurrent connections to each underlying host
  • number of active/inactive hosts behind load balancer
  • network bytes in, network bytes out
  • request queue size, number of dropped requests

Cache

  • cache hits, misses, hit ratio
  • cache time

Database

  • active connections
  • throttling
  • transaction lengths, queues

Java, C#

  • garbage collection count/time
  • open file descriptors

Application

  • number of fatals/errors/warnings
  • client side latencies
  • dependencies latencies, fatals, timeouts
  • different rates (messages, APIs, availability)
  • request/response sizes
  • business specific metrics

Host

  • CPU utilization
  • heartbeats
  • free memory (see remark above)
  • number of processes running
  • context switches
  • free disk space

Serverless

  • time from scheduling to execution
  • number of concurrent executions
  • time/CPU/memory used

What the eyes don’t see, the heart doesn’t grieve over

What I’ve found surprising, is that I’ve been doing well without metrics for a long time in my career. However, as soon as I started using them (both customized and predefined ones), I can’t imagine going back. Start today. Create a simple dashboard. In cloud it’s easy as drag&drop.

Operational Excellence series

  1. Intro: What is Software Operational Excellence?
  2. Deploying: Rock solid pipeline - how to deploy to production comfortably?
  3. Monitoring&Alarming: Types of alarms - what’s beyond min-max checks?
  4. Monitoring: What service metrics should be monitored?
  5. Scaling: (Auto) scaling services by CPU? You are doing it wrong
  6. Scaling: How do you know the right maximum connections?
  7. Scaling: How to estimate host fleet size? Why keeping CPU at 30% might NOT be waste of money?

Please note: the views I express are mine alone and they do not necessarily reflect the views of Amazon.com.