Monitoring services is crucial, if you care about the application uptime. There are hundreds if not thousands parameters which you can (and should) monitor, related to CPU, network, hosts, application and so on. What are they? What are the non-obvious choices?

Record as much as you can

First of all, a good rule of thumb is recording as many metrics as possible. It is cheap. There is no reason not to do so (if you care about the service uptime). Do not set alarm on all of them, just have them in case of an emergency. Having them is priceless when mitigating or root causing problems. When setting an alarm on some of them, consider the alarm type: Types of alarms - what’s beyond min-max checks?

Useful list of useful metrics

Here is my reference list. It’s not full and complete; it’s rather a good starting point. Pick whatever you think is useful. If you use a cloud (AWS/Azure/etc), probably you are already monitoring some of them by default.

Non-obvious choices

Inodes - because you can run out of space… without running out of space
memory - are you sure you want to monitor MemFree? Probably better will be MemAvailable. Is it enough? No, because it doesn’t tell how much of it is available for JVM/IIS.
number of running threads, different thread pool sizes. When all threads in a thread pool will be busy, in the logs you will only see that the application did nothing, without information why.

Load Balancer

maximum connections allowed to each underlying host
number of concurrent connections to each underlying host
number of active/inactive hosts behind load balancer
network bytes in, network bytes out
request queue size, number of dropped requests

Cache

cache hits, misses, hit ratio
cache time

Database

active connections
throttling
transaction lengths, queues

Java, C#

garbage collection count/time
open file descriptors

Application

number of fatals/errors/warnings
client side latencies
dependencies latencies, fatals, timeouts
different rates (messages, APIs, availability)
request/response sizes
business specific metrics

Host

CPU utilization
heartbeats
free memory (see remark above)
number of processes running
context switches
free disk space

Serverless

time from scheduling to execution
number of concurrent executions
time/CPU/memory used

What the eyes don’t see, the heart doesn’t grieve over

What I’ve found surprising, is that I’ve been doing well without metrics for a long time in my career. However, as soon as I started using them (both customized and predefined ones), I can’t imagine going back. Start today. Create a simple dashboard. In cloud it’s easy as drag&drop.

Operational Excellence series

Intro: What is Software Operational Excellence?
Deploying: Rock solid pipeline - how to deploy to production comfortably?
Monitoring&Alarming: Types of alarms - what’s beyond min-max checks?
Monitoring: What service metrics should be monitored?
Scaling: (Auto) scaling services by CPU? You are doing it wrong
Scaling: How do you know the right maximum connections?
Scaling: How to estimate host fleet size? Why keeping CPU at 30% might NOT be waste of money?

Please note: the views I express are mine alone and they do not necessarily reflect the views of Amazon.com.

What service metrics should be monitored?