A basic description of the issue is that when we are querying rate(api_heartbeat_seconds_bucket{le="1.0",instance="SITE-A:1234"}[2m]) we have unexpected spikes in the rate:
Fig 1
But when we query without the rate the api_heartbeat_seconds_bucket{le="1.0",instance="SITE-A:1234"} the increased steps are smooth:
Fig 2
Background: We were scraping from three different Prometheus nodes back to one Prometheus Federated and we saw that there was a very small difference between the Prometheus nodes; diff of 1 or 2 (let me know if you want to see the data from each node); so decided to scrape one node to test if this made any difference.
We saw that since then we have not had any spikes. Fig 3
It is strange as we have had the multi-target configuration in place for months and it has only become an issue over the last couple of weeks.
is this a coincidence and actually, something else is at play.
Why would the rate extrapolation be so sensitive and unpredictable
I assume that the reason when we query api_heartbeat_seconds_bucket{le="1.0",instance="SITE-A:1234"} as smoothed steps are that there is some smoothing going on when Prometheus runs this query, is this correct?
Could I check something to provide the route cause for the spikes
You likely have a small counter value drop, which is resulting in rate() thinking you have a counter reset. This single errant sample is the problem.
If you query api_heartbeat_seconds_bucket{le="1.0",instance="SITE-A:1234"}[10m] with an instant query (table view in the UI), it will return the actual samples and timestamps for debugging.
I was hoping if there was a way to confirm the behaviour I was seeing that would be great. Not sure if @SuperQ would be able to help. Thanks in advance