Like clockwork, 1 of my Prometheus servers sees rate calculations drop every 30 minutes. I noticed that prometheus_tsdb_out_of_order_samples_total
was rising at the exact same time, so I tried setting storage.tsdb.out_of_order_time_window
to 1s. That fixed the periodic rise in prometheus_tsdb_out_of_order_samples_total
, but not in any of my calculated data.
I also noticed that there is a gap in timestamp(up{job="$JOBNAME"})
at the same times. It looks like the data from scrapes at that those times is arriving late. (We scrape every 15s).
Other data we have drops out at 2 hour intervals.
I’ve checked for cron jobs or rises in resource utilization, but I haven’t been able to find anything. We’ve got other servers with periodic issues, but not the same period, and others with no issues. They’re all using the same basic configs, with different scrape targets. There doesn’t seem to be any correlation to the number of targets and the dropout periods.
What else could it be?
I think this is what we’re experiencing
opened 12:02PM - 01 Mar 22 UTC
closed 09:35AM - 31 Jul 23 UTC
priority/P1
component/tsdb
**What did you do?**
Upgraded Prometheus from v2.32.1 to v2.33.1
**What di… d you expect to see?**
Similar performance and smooth graphs.
**What did you see instead? Under which circumstances?**
It appears that starting with v2.33.0 Prometheus hits some scalability limits for us.
Something takes longer than it used to OR it started to block when it didn't use to.
We've started noticing that ever since we've upgrade Prometheus from v2.32.1 to v2.33.1 (same issue with v2.33.4) we're seeing an issue where on our biggest instances every 30 minutes we have:
- some counter updates are delayed, which looks like either scrapes are getting delayed - I only mention counters here because it's more visible on counters vs gauge, so either the actual HTTP scrape is delayed OR sample insertion to tsdb is delayed (or wherever sample timestamp is set on scrape)
- we see a massive spike in rule evaluation duration
- we see rule evaluation duration spike
- we see rule evaluation timeouts `query timed out in expression evaluation`
Still trying to debug it, so far it doesn't seem to be related to:
- queries - we don't see any elevated spike in queries
- go-routines - they stay flat, so it's not like Go gets so many goroutines that some gets left behind
- cpu or memory - don't see any elevated resource usage when this happens
- chunk write queue - since this is new code added in 2.33 with a default queue size of 1000 and our metrics show that the rate of elements added to the queue spikes to around 250k/s I tested Prometheus with a bigger queue size (up to 50M) with no effect on this issue
- query concurrency limit - since that's 20 by default and we seem to usually issue more queries per second I suspected we might be queuing queries too much, but bumping this value up to 96 (on a server with 128 cores) didn't seem to change anything
Since this is happening every 30 minutes and only seem to affect our biggest instances with ~15M time series I've checked what else is happening every 30 minutes. We do have `--storage.tsdb.min-block-duration=30m` & `--storage.tsdb.max-block-duration=30m`, mostly to reduce memory usage as we do have a fair amount of metrics churn and so more frequent HEAD compaction helps keep memory usage lower than it would be without it.
What I've also noticed is that tsdb HEAD active appenders are spiking around the time of this issue. Likely because they spike when there's HEAD/block compaction, so not sure if that's the effect or the cause. Looking at historical metrics I see that it was always spiking around that time, but with 2.33 spikes are bigger. See metrics below:
19dm12 - v2.32.1
19dm13 - v2.33.4

I didn't find any useful logs that would point me in any other direction so far and not sure what other metrics might be relevant here. Any tips on further debugging would be very helpful.
**Environment**
* System information:
`Linux 5.15.19 x86_64`
* Prometheus version:
insert output of `prometheus --version` here
* Alertmanager version:
insert output of `alertmanager --version` here (if relevant to the issue)
* Prometheus configuration file:
```
insert configuration here
```
* Alertmanager configuration file:
```
insert configuration here (if relevant to the issue)
```
* Logs:
```
insert Prometheus and Alertmanager logs relevant to the issue here
```
It was fixed in 2.47, we’re on 2.45. I’ll update when I’ve had a chance to upgrade and see if that fixes our issues.
Upgrading to 2.52.0 fixed our problem.