Huge Discrepancy in TSDB Block Disk Usage

I’m running 2 large Prometheus servers in a datacenter, both scraping the same targets. However, one has blocks that take up significantly more data (~140GB vs 108GB). What could be causing this large discrepancy?

It looks like compaction is not performing well on serverA, based on the data in this gist (where you can also see the size difference despite near identical series/chunks/samples counts).

Things to note:

  • We recently added a flag to account for scrape jitter (--scrape.timestamp-tolerance 49ms)
  • serverA ran out of disk recently and had compactions fail as a result