Huge Discrepancy in TSDB Block Disk Usage

I’m running 2 large Prometheus servers in a datacenter, both scraping the same targets. However, one has blocks that take up significantly more data (~140GB vs 108GB). What could be causing this large discrepancy?

It looks like compaction is not performing well on serverA, based on the data in this gist (where you can also see the size difference despite near identical series/chunks/samples counts).

Things to note:

  • We recently added a flag to account for scrape jitter (--scrape.timestamp-tolerance 49ms)
  • serverA ran out of disk recently and had compactions fail as a result

I just ran across your post and it seems I am running into the same problem. I recently moved our prometheus HA setup (two nodes scraping the same targets) onto new systems and one of them started to use significantly more storage space than the other. As on the first node every 2 hour block consumes up to 600MB, the block for the same time range occupies nearly 1.2GB on the second node. Both nodes share the identical configuration (apart from the replica label), scrape the same amount of targets and the meta.json of these blocks show almost identical numbers for chunks, series and samples. Did you find out where this came from on your setup?