NODE_disk_write_latency

pupillord · April 24, 2022, 9:26am

Hi, I have a problem that the disk write latency of Prometheus deployment is periodically very high. The latency will be higher than 30ms every 6 hours and last for some time.

I guess it’s the problem of TSDB, the prometheus compresses and writes data to disk at regular intervals.
Reference documents: Storage | Prometheus

Can I reduce disk write latency by modifying the following two parameters ?
–storage.tsdb. min-block-duration
–storage.tsdb. max-block-duration

Or there are other ways to solve this problem ?

SuperQ · April 24, 2022, 4:04pm

What does reducing the disk latency actually solve? Seems like you’re trying to solve a non-problem. Prometheus needs to use the disk to compact the TSDB data. Modifying those flags is just going to make things worse.

pupillord · April 25, 2022, 1:58am

Thanks for your help.
We use SSD with NVMe now, so there is a monitoring indicators that node_disk_write_latency need to be less than 16ms.

If disk write latency more than 16ms and up to 30ms, it means that the SSD is in a high load state, or even an abnormal state. We want to reduce the pressure on the disk and keep it in a normal state(write latency less than 16ms).

SuperQ · April 25, 2022, 2:53am

That sounds like some arbitrary measure you made up. Do you have a source for this “rule”?node_disk_write_latency is not a metric exposed by the node_exporter. Where does this come from?

You’ve set an arbitrary measure for no determinable reason.

“Pressure on the disk” is a meaningless measure, especially if the device is dedicated to a specific task. What you’re saying makes no sense.

pupillord · April 25, 2022, 3:12am

This is an alert rule for TiDB cluster, and it will be triggered frequently in my prometheus node.
TiDB Cluster Alert Rules | PingCAP Docs

SuperQ · April 25, 2022, 7:54am

Those alerts at most indicate you have a defective SSD/NVMe device, not a problem with Prometheus. The “Solutions” provided are not actually solutions.

Also, those alerts may be appropriate for TiDB, but they are not useful for Prometheus.

pupillord · April 25, 2022, 10:08am

OK，I can understand what you mean.
I just want to find a way to optimize TSDB, and it could write or compress data more frequently. So it could process less data to reduce the pressure on the disk each time.

Thank you again. If there is no solution here, I’ll try another way.

stuart · April 25, 2022, 10:50am

I think the question still comes down to “why”? Are you seeing any actual issues? It is expected that Prometheus will periodically have times where there is more disk I/O (e.g. when blocks are being written or expired, or when certain queries are happening). In this way the usage will always be “bursty”. In general the default settings are the ones to use and only should be adjusted very carefully - you can very easily make things a lot worse - and should only be done if there are specific problems being caused.

SuperQ · April 25, 2022, 11:07am

Changing these settings makes things worse. The reason the TSDB does compaction at the frequency it does is to reduce the overall write rate of the system. The write pattern is done to produce the most compact on-disk data. Doing things more frequently would make the data less compressed. This would have a much worse impact as the less compact data would take more memory to use.

Prometheus itself doesn’t actually need NVMe storage. Normal HDD storage is typically good enough. The read/write patterns are not super heavy (as you noticed, it’s just a burst every once in a while. This is because the actual metric processing happens in memory (page cache) rather than depend on heavy IO.

pupillord · April 25, 2022, 12:09pm

Thanks for your helpful reply. I will refer it and give feedback to my team.
I will change the alert rule instead of modifying these parameters !

pupillord · April 25, 2022, 12:11pm

Through your reply, I have realized the importance of these parameters.
I will adjust our alert rule.

Topic		Replies	Views
Windows_exporter Disk Latency Exporters and Metrics	1	238	December 29, 2023
Huge Discrepancy in TSDB Block Disk Usage Prometheus server	1	556	January 19, 2023
Why is prometheus constantly writing 100 MB/s to disk? General Help/Support	3	556	August 28, 2021
Is there a way to instruct promethues to dump/flush the in-memory head to TSDB block on disk manually Prometheus server	0	373	April 13, 2023
Prometheus nodeexporter filesystem bug Prometheus server	0	549	August 11, 2021

NODE_disk_write_latency

Related topics