Hi,
I’ve written a script that report state of applications in my infrastructure and report results to my prometheus.
My query for the visualization is:
sum(rate(httpstat_total_time{instance=“nodeX”, http_code=~“2.“}[15m])) by (host)/ sum(rate(httpstat_total_time{instance=“nodeX”, http_code=~”.”}[15m])) by (host) * 100
It simply compare 200 responses with all others and give me uptime in %.
It works fine but I have two main problems:
-
how to implement working maintenance window? I mean how to ignore downtimes in the time range (example 2:00 - 4:00 every day)
-
At least once per week I remove or add new application and it breaks my cumulative stats per server or per DC.
The query loks this way for example:
(sum(rate(httpstat_total_time{http_code=~“2.", region!=“europe”, server_type=~"client_server.”}[1d])) by (region, instance) / sum(rate(httpstat_total_time{http_code=~“.", region!=“europe”, server_type=~"client_server.”}[1d])) by (region, instance) * 100)
Thanks for your recommendations in advance…