Why choose Prometheus over Icinga2?

Can someone recommend using Prometheus over Icinga2? I see lots of enthusiasm for Prometheus on the web so I’m curious to understand what I’m missing.

For my monitoring needs, I currently use a software stack comprised of Icinga2, Icingaweb2, Graphite/Grafana and Pagerduty. We monitor multiple stacks of about 30 servers each which include some bare metal servers but mostly AWS instances. Each stack of servers has it’s own monitoring server. Currently we do not use containerization although that might change.

I still use NRPE as the agent but expect to switch soon to Icinga as the agent. Both agents can use the repository of Nagios checks which constitute many of our remote checks but I have written many custom checks as well (mostly in python).

I see references in Prometheus to html scraping so I assume that, generally, all metrics get formatted as special html and then Prometheus knows how to access and parse those values. How do things like processes running, or process counts, cpu-loads, volume sizes, mysql replication delays, or ntpd health get checked and ready for Prometheus’ html scraping?

One of the features I love about Icinga2 is in the definition of the checks for a given host. The apply Service rule provides assign where and ignore where clauses so we can use logic about a host’s attributes to determine if a check should run on a given host. It really simplifies how we apply checks to hosts. Additionally a similar logic can be used to create host and service groups so we can correlate and compare the received metrics.

In a Prometheus comparison document (Comparison to alternatives | Prometheus), I see no reference to Icinga2. It does compare Prometheus to Nagios but although Icinga was originally a fork of Nagios, Icinga2 is way beyond Nagios.

Any advice appreciated. Thank you.

Prometheus is designed to replace both Icinga style checks and Graphite metrics with a single unified tool.

A lot of things are different, because Prometheus has no concept of hosts or checks.

What you have instead is service discovery and alerting rules.

Rather than have to configure a check for a host, you simply configure a rule based on the service indicator.

For example, if you use the mysqld_exporter to monitor replication delay. You write one alert that covers all instances of replication delay. As you add or remove MySQL servers, no monitoring system changes need to be made. Discovery is automatic (based on your method of choosing) and the alerting rule matches automatically.

Prometheus also tends to favor monitoring based on event counters. For example, instead of checking for an error condition, you create a metric that is a counter of the number of errors. This way if an error happens there is no way for Prometheus to miss the error between scrape intervals. This is the fundamental flaw with many NRPE checks. They’re completely blind between the check intervals.

Prometheus can also deduplicate alerts. Rather than send 30 host down alerts if a switch fails, it will send an alert saying “30 hosts are down”.

Prometheus can also poll data much faster. The standard polling interval for metrics is 15 seconds, meaning you can get notified faster in case of problems. You can also go even faster if you like, in order to see “micro bursts” of events.

Prometheus also acts in a more global way, because instead of executing simple remote checks, it pulls all the data into itself. It can then compare data across multiple hosts and decide if the overall system is health based on your rules. If you have 10 servers and only need 8 to run the service. Why bother alerting if one or two are down.

1 Like

I appreciate your insights, SuperQ.

Some notes/questions in no particular order, please…

UPDATE - I see now my questions are many and tedious. Apologies.

“single unified tool” - looking at documentation, I see a collection of tools; prometheus server with a simple UI, then AlertManagr, Pushgateway, node exporter, (other exporters), Service discovery, Grafana…

How does the “real world” (ironical quotes because, well, virtualization) association of hosts and service checks work if Prometheus has “no concept of hosts or checks”? We all deal with hosts and services, right? I would expect to see hosts and service checks in my monitoring tools.

node exporter - is this like an agent on a remote host to look at disk usage, cpu, hardware, etc?

Same question for mysqld_exporter which looks like a stand-alone application. I see releases so I assume no need to compile in go? Where does it run in relation to the hosts running mysql?

Counters - I don’t use them much because the ones I receive just go up, like total transactions, or total errors. And they go to zero when the process is restarted. I’d rather see the rate of transactions, or how many transactions per second/per minute. Obviously I can see that when they go flatlined it means no transactions.

Gauge - I think most of my metrics would be in this category; ping/icmp times, packet loss, cpu loads, total procs on a server… but I keep these to see trends using grafana.

I’ll need to see research what a Histogram or a Summary can tell me.

“an endpoint you can scrape is called an instance, usually corresponding to a single process” - is this a perspective focused on containers (which I don’t use)? I keep track of running processes within a “server/host” (also called an instance in our aws environment). I track that all the application processes on a given host are, in fact, running. One or more, OK; Zero, Critical/broken.

Our application code, mostly java, provides snmp metrics and some use tomcat as the http service. For those that do not currently require tomcat, will it be necessary to include tomcat to provide http scraping capabilities? And for the ones already using tomcat do they need to be substantially modified to do the same?

“can also deduplicate alerts” - that’s good and Icinga does the same with “apply dependencies” to hosts and services. Host and service definitions can have their notifications (or checks) disabled based on hierarchical dependencies. It’s quite elegant when used with “assign where” and “ignore where” conditions. I define each service only once and the “assign where” and “ignore where” conditions determine the hosts to which they apply. IOW, no hard coding a service to a host.

“instead of executing simple remote checks, it pulls all the data into itself. It can then compare data across multiple hosts and decide if the overall system is health based on your rules” - that sounds exciting.

Last questions:
Are there any recommended tutorials for configuring a simple prometheus server which, for example, is monitoring a few linux servers (centos/ubuntu) each with ping, disk, and cpu load checks (or the equivalent to checks in prometheus).

Are there any ansible roles that might be recommended for installing all the parts along with grafana, AlertManagr, etc.

Again, thank you. Any advice, appreciated.

I found Cloud Alchemy · GitHub for using ansible to install various components of Prometheus.

While almost all the tutorials I’ve found describe prometheus and node_exporter running on the prometheus server, few describe how to monitor other servers - which is the main thing to do. It looks like one configures the main prometheus config yaml on the central server using targets like this:

global:
  scrape_interval: 5s   # Reduced for demo sake
scrape_configs:
  - job_name: node
    static_configs: 
      - targets:
        - '192.168.1.1:9100'
        - '192.168.1.2:9100'
        - '192.168.1.3:9100'
        - '192.168.1.4:9100'
  - job_name: haproxy
    static_configs: 
      - targets:
        - '192.168.1.1:9101'
        - '192.168.1.2:9101'
        - '192.168.1.3:9101'
        - '192.168.1.4:9101'

So if I install only the node_exporter on each of my remote servers, I can then access the IP/port combinations I have configured on each of the remote servers.

BTW, my take on node_exporter is that is an agent application. It might work differently than nrpe but it acts as an agent to calls (scrapes) from the central prometheus host. :wink:

“single unified tool”

Maybe I should have said “system”, not “tool”. There are several components, but one overall system.

Prometheus is an “agentless” system. Kinda like SNMP, it’s defined protocol. Only the protocol is HTTP-based. Also rather than have one endpoint like SNMP typically has, it’s designed such that any service can expose metrics directly.

So you end up with a number of “micro agent” exporters. This is by design to avoid the bloated, buggy, mega agents of other systems. For example, the node_exporter. Since the Linux kernel doesn’t have an HTTP server built-in, the node_exporter acts as basically the agent for the kernel metrics.
Same with MySQL. MySQL doesn’t have an HTTP server built-in, so we have a shim service that translates SQL metrics queries into HTTP.

Think of it like, instead of check scripts, you have metrics scripts. They export raw data to Prometheus, and the “checks” are run on the Prometheus sever side with the raw data.

Counters - I don’t use them much because the ones I receive just go up, like total transactions, or total errors. And they go to zero when the process is restarted. I’d rather see the rate of transactions, or how many transactions per second/per minute. Obviously I can see that when they go flatlined it means no transactions.

Prometheus has a query language to deal with this. You use rate(my_count_total[1m]) to turn the counter into a transactions per second.

Gauge - I think most of my metrics would be in this category; ping/icmp times, packet loss, cpu loads, total procs on a server… but I keep these to see trends using grafana.

Funny enough, a lot of those can be counters, in a way. For example, the smokeping_prober takes ping times and records them into a histogram. Which is a special type of counter.

CPU is also a counter, if you grep '^cpu' /proc/stat, the underlying data the kernel provides for CPU utilization is a counter. In Prometheus, we expose this data raw, and store the raw counter values, rather than turn them into rates ahead of time. This is more robust against sample loss.

This is very similar to how SNMP works under the hood. ifHCInOctets is a counter. It’s just usually translated to a rate by the monitoring tools. In Prometheus we avoid pre-translating counters, because we can do more accurate math on the fly, depending on the time range you’re looking to view.

is this a perspective focused on containers

It’s not focused on containers, it just makes it easier to support both host-based and container-based infra at the same time.

Our application code, mostly java, provides snmp metrics and some use tomcat as the http service.

You replace SNMP with Prometheus. Prometheus is designed as both a replacement for Nagios/Icinga and SNMP in one system. For Java applications, you typically use the Prometheus java client library. There’s also the jmx_exporter which can expose data from Java apps, as well as pull data from your Tomcat apps.

It’s quite elegant when used with “assign where” and “ignore where” conditions. I define each service only once and the “assign where” and “ignore where” conditions determine the hosts to which they apply. IOW, no hard coding a service to a host.

In Prometheus, you don’t have to assign/ignore anything. It’s automatic.

1 Like

Sorry, I was posting/updating as you responded to my previous response. You’re faster than me. Thanks.

So if I install only the node_exporter on each of my remote servers, I can then access the IP/port combinations I have configured on each of the remote servers.

Yup, there are a dozen different service discovery methods depending on your environment.

For example, you mentioned AWS. Prometheus can discover all your servers directly from the EC2 API. You can use the tags on your servers to help Prometheus know which services are on which machines.

Why configure manually, when Prometheus can do it for you.

BTW, my take on node_exporter is that is an agent application

It’s more subtle than that in the Prometheus ecosystem, as I described in my other post. It’s an agent, but just for the kernel/OS. Metrics for other things like java apps, databases, etc are each handled by their own micro-agents. This allows them to be small and lightweight, with minimal dependencies and light security footprint.

One “end game” for Prometheus is to completely eliminate the need for separate agent-like exporters. If everyone is using the Prometheus protocol for monitoring, there’s no need for separate agents. For example, if you’re using HAProxy for load-balancing. There is built-in Prometheus support. This direct support is 10x easier to deal with than SNMP. There’s no MIBs, it’s a human readable format, it’s (somewhat) self-documenting, it defaults to giving you all metrics rather than having to figure out what parts of the MIB are relevant.

To this end, we’ve written OpenMetrics as an IETF standards RFC. It’s intended to be the long-term replacement for SNMP.

1 Like

end game” for Prometheus is to completely eliminate the need for separate agent-like exporters

Sounds like a grand scheme to take over the monitoring world… I’d be crazy to not sign on. :wink:

To that end, what is the fastest way for me to define all my servers, about 150, and fire up the prometheus server to see all those instances, let’s say for disk stats, processor loads, and packet loss. Can you recommend some tutorials or ansible playbooks. Thanks.

BTW, I have a couple dozen raspis that I use for home automation. Is Prometheus and/or the exporters light-weight enough to run in that environment?