Hi,
I’ve noticed an unexpected thing since we just recently deployed and started using Thanos with our existing Prometheus environment (my question here is more Prometheus though - not Thanos).
Regarding when TSDB blocks are created by Prometheus…
They seem to miss the last hour from the WAL. We don’t notice this when using the Prometheus datasource (I guess it’s combining blocks and WAL data) but with the Thanos datasource it is noticeable.
Incidentally, we have the recommended storage.tsdb.min-block-duration=2h and storage.tsdb.max-block-duration=2h in our Prometheus config since deploying Thanos.
But what I see in the Prometheus file system (this is just after 10AM BST this morning):
- A block has been created at 10 AM BST - showing at 09:00 UTC in the file system - all OK so far
- But if I look at the
meta.json file, it shows this:
- “minTime”: 1776060000042,
- “maxTime”: 1776067200000,
- Which is from 6 AM GMT/UTC (7 AM BST) to 8 AM GMT/UTC (9 AM BST)
- And if I look in the
wal folder just after the block is created, I see a further 1 hour of files (covering, I guess, from 9 AM BST up to current 10 AM BST)
So what I’m curious about is why, when a block is created at 10AM BST would it only contain the 2 hours from 7AM BST to 9AM BST and leave an hour in the wal folder?
Our Prometheus version is 2.53.0, which admittedly is a little behind the latest.
I’m not sure if this might be a daylight-saving issue. I only noticed it since we changed from GMT to BST in the UK.
Has anyone experienced this before? Or does this look like a common thing with TSDB blocks being written?
Thanks,
G
Some further info…
Here’s what these two datasources look like right now (11:42 BST) in Grafana:
The wal folder (times in UTC) looks like this:
Oh, I should mention - we are running Thanos Sidecar here too… but I have that temporarily deactivated in these examples (so I can properly show what the Thanos Store is returning)
Just to follow on the from the posts above, it is now 12:26 BST and I see this in Grafana (after a new Prometheus TSDB block creation):
(Incidentally, Thanos Sidecar also did it’s upload thing, no issues there.)
I see a new block folder in the Prometheus TSDB file system (timed 11:00 UTC).
And the wal folder now shows:
Apr 13 10:03 00048231
Apr 13 10:13 00048232
Apr 13 10:22 00048233
Apr 13 10:31 00048234
Apr 13 10:40 00048235
Apr 13 10:49 00048236
Apr 13 10:58 00048237
Apr 13 11:00 00048238
Apr 13 11:09 00048239
Apr 13 11:18 00048240
Apr 13 11:25 00048241
Apr 13 11:00 checkpoint.00048230/
So what I’m curious about is why the Prometheus block creation at 11:00 UTC didn’t include the 7 files between 10:00 UTC and 11:00 UTC?
This, I think, is why I get the 1-hour “gap” when I view this via the Thanos datasource. Or, excluding the Sidecar metrics as above, Thanos metrics that are 3 hours old just before a block upload, then 1 hour old after an upload.
Anyone have any ideas on that? 
Thanks
Today I’m looking at prometheus_tsdb_head_min_time_seconds and prometheus_tsdb_head_max_time_seconds metrics and I see this:
Each two hours a block is created (all looks OK in logs - no errors or warnings) but it never flushes the entire head / wal - it always leaves 1 hour.
3 hours just before the block is created - 2 hours written - head is then 1 hour.
Is this normal behaviour?
Is this normal behaviour?
I’m proceeding on the assertion that this is normal behaviour. I believe my initial assumption that the creation of blocks every 2 hours would include all metrics up to current time isn’t right - and that, actually, the last hour remains (probably still active, being written to, etc.).
I’ve not found anything yet to confirm this absolutely, but all our Prometheus instances show the same behaviour and none is showing any errors or warnings… 