Prometheus, not a Nagios replacement

In reading the documentation, it seems like prometheus is not a Nagios replacement. I thought that maybe that it could be.
It doesn’t seem to be a central location for performance of multiple hosts either, rather it is performance just for the host it is running on.

Am I interpreting this correctly?

What are you hoping for as “a Nagios replacement”?

Prometheus uses either directly instrumented applications or things called exporters to monitor systems & applications. They are accessed via HTTP, so you can run the Prometheus server on a different host to the things you are monitoring, although good practice recommends the Prometheus server not being “too far” away (for example you might have a Prometheus per data centre).

Views on status
How would I get a view of just items that are in an “error state”?

Alerting when a remote location is offline.
How would I receive notifications from a remote data center if that data center is supposed to have it’s own prometheus and it is offline or does not have internet access? It can’t alert me.

Grouping of hosts
There doesn’t appear to be grouping of hosts into a hierarchy. If a virtual machine host is down, it seems that I would get an alert for it and all the virtual machines rather than just the virtual machine host.

Alerting Overload
If a host is offline, I just want an alert that the host is offline, not an alert on all the services as well. I could end up with 20 alerts when I just need 1 alert.

Do these questions/concerns make sense?

  1. You can view alerts in a number of different ways. The Alertmanager UI will show currently firing alerts, while the Prometheus UI will also show alerts which are nearly firing. There are also other UIs such as using Grafana or Karma.

  2. There are various options in this case. You could have local Alertmanagers in each DC. You can also have alerts to indicate that a DC is down, so you know not to expect more granular alerts from that location.

  3. You can use labels to group things, for example you could have a label that indicates the VM host a VM currently exists on (similar is commonly the case within Kubernetes, with labels for namespace & node).

  4. “Inhibition rules” are what you are looking for here - alerts that do not get delivered because another alert is currently active, such as individual hosts being down because of a wider network or power failure.

1 Like

Awesome! Thank you much for clearing things up.