Hi, I’m using Prometheus (2.45) on a Raspberry PI with a NVMe storage, but since some days I’m having issues:
Prometheus stops running after about 1-2h (probabily due some wal error), if I delete the wal folder and I restart Prometheus it works but -obviously- I lost all the data since it stopped and I delete the wal folder.
In the few hours my wal folder gets full of files, this is the folder after 6-8h since Prometheus stopped working (I was sleeping)
And Prometheus stops without any errors, I simply found this
root@Grafana:~# service prometheus status
● prometheus.service - prometheus
Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/prometheus.service.d
└─dietpi-process_tool.conf
Active: activating (auto-restart) (Result: exit-code) since Sun 2023-07-16 08:49:19 CEST; 8s ago
Process: 40471 ExecStart=/home/dietpi/prometheus/prometheus --config.file=/home/dietpi/prometheus/prometheus.yml --storage.tsdb.path=/home/dietpi/prometheus/data --storage.tsdb.retention.time=
3y (code=exited, status=1/FAILURE)
Main PID: 40471 (code=exited, status=1/FAILURE)
CPU: 610ms
When I delete the wal folder and I restart Prometheus all is working again:
root@Grafana:~# service prometheus status
● prometheus.service - prometheus
Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/prometheus.service.d
└─dietpi-process_tool.conf
Active: active (running) since Sun 2023-07-16 09:33:53 CEST; 33s ago
Main PID: 41120 (prometheus)
Tasks: 10 (limit: 4531)
CPU: 3.399s
CGroup: /system.slice/prometheus.service
└─41120 /home/dietpi/prometheus/prometheus --config.file=/home/dietpi/prometheus/prometheus.yml --storage.tsdb.path=/home/dietpi/prometheus/data --storage.tsdb.retention.time=3y
Jul 16 09:33:53 Grafana prometheus[41120]: ts=2023-07-16T07:33:53.724Z caller=main.go:1224 level=info msg="Loading configuration file" filename=/home/dietpi/prometheus/prometheus.yml
Jul 16 09:33:53 Grafana prometheus[41120]: ts=2023-07-16T07:33:53.726Z caller=main.go:1261 level=info msg="Completed loading of configuration file" filename=/home/dietpi/prometheus/prometheus.yml
totalDuration=2.074037ms db_storage=6.685µs remote_storage=7.907µs web_handler=3.778µs query_engine=4.388µs scrape=674.481µs scrape_sd=85.537µs notify=86.13µs notify_sd=60.722µs rules=6.426µs trac
ing=20.092µs
Jul 16 09:33:53 Grafana prometheus[41120]: ts=2023-07-16T07:33:53.727Z caller=main.go:1004 level=info msg="Server is ready to receive web requests."
Jul 16 09:33:53 Grafana prometheus[41120]: ts=2023-07-16T07:33:53.727Z caller=manager.go:995 level=info component="rule manager" msg="Starting rule manager..."
Jul 16 09:34:01 Grafana prometheus[41120]: ts=2023-07-16T07:34:01.159Z caller=compact.go:514 level=info component=tsdb msg="write block resulted in empty block" mint=1689451200000 maxt=16894584000
00 duration=78.179759ms
Jul 16 09:34:01 Grafana prometheus[41120]: ts=2023-07-16T07:34:01.174Z caller=head.go:1293 level=info component=tsdb msg="Head GC completed" caller=truncateMemory duration=8.437611ms
Jul 16 09:34:02 Grafana prometheus[41120]: ts=2023-07-16T07:34:02.165Z caller=compact.go:464 level=info component=tsdb msg="compact blocks" count=3 mint=1689422400000 maxt=1689444000000 ulid=01H5E
SXRWNSSRSV2BEYV2DCSQ7 sources="[01H5D11PBTCHTBP540V9AWV9Z8 01H5D7XDM5VA335SNYQBYRZ6AK 01H5DES4V60ZZKGAJA6SKGS6GE]" duration=800.055555ms
Jul 16 09:34:02 Grafana prometheus[41120]: ts=2023-07-16T07:34:02.177Z caller=db.go:1617 level=info component=tsdb msg="Deleting obsolete block" block=01H5D11PBTCHTBP540V9AWV9Z8
Jul 16 09:34:02 Grafana prometheus[41120]: ts=2023-07-16T07:34:02.180Z caller=db.go:1617 level=info component=tsdb msg="Deleting obsolete block" block=01H5DES4V60ZZKGAJA6SKGS6GE
Jul 16 09:34:02 Grafana prometheus[41120]: ts=2023-07-16T07:34:02.184Z caller=db.go:1617 level=info component=tsdb msg="Deleting obsolete block" block=01H5D7XDM5VA335SNYQBYRZ6AK
Looks like that something is broken with the compaction of some blocks or something similar… I also deleted some obsolete blocks but is the same.
There’s a way/tool to check the Prometheus data folder? Or to force the compaction of the wal folder into a block? Maybe I have to lower the 2h default time? Or maybe it’s my SSD/NVMe (but it works fine)?
I don’t know what to do to have again a working setup that not stops after few hours.
Here’s my Prometheus config:
--alertmanager.notification-queue-capacity 10000
--alertmanager.timeout
--config.file /home/dietpi/prometheus/prometheus.yml
--enable-feature
--log.format logfmt
--log.level info
--query.lookback-delta 5m
--query.max-concurrency 20
--query.max-samples 50000000
--query.timeout 2m
--rules.alert.for-grace-period 10m
--rules.alert.for-outage-tolerance 1h
--rules.alert.resend-delay 1m
--scrape.adjust-timestamps true
--scrape.discovery-reload-interval 5s
--scrape.timestamp-tolerance 2ms
--storage.agent.no-lockfile false
--storage.agent.path data-agent/
--storage.agent.retention.max-time 0s
--storage.agent.retention.min-time 0s
--storage.agent.wal-compression true
--storage.agent.wal-segment-size 0B
--storage.agent.wal-truncate-frequency 0s
--storage.remote.flush-deadline 1m
--storage.remote.read-concurrent-limit 10
--storage.remote.read-max-bytes-in-frame 1048576
--storage.remote.read-sample-limit 50000000
--storage.tsdb.allow-overlapping-blocks true
--storage.tsdb.head-chunks-write-queue-size 0
--storage.tsdb.max-block-chunk-segment-size 0B
--storage.tsdb.max-block-duration 31d
--storage.tsdb.min-block-duration 2h
--storage.tsdb.no-lockfile false
--storage.tsdb.path /home/dietpi/prometheus/data
--storage.tsdb.retention 0s
--storage.tsdb.retention.size 0B
--storage.tsdb.retention.time 3y
--storage.tsdb.samples-per-chunk 120
--storage.tsdb.wal-compression true
--storage.tsdb.wal-segment-size 0B
--web.config.file
--web.console.libraries console_libraries
--web.console.templates consoles
--web.cors.origin .*
--web.enable-admin-api false
--web.enable-lifecycle false
--web.enable-remote-write-receiver false
--web.external-url
--web.listen-address 0.0.0.0:9090
--web.max-connections 512
--web.page-title Prometheus Time Series Collection and Processing Server
--web.read-timeout 5m
--web.route-prefix /
--web.user-assets
--write-documentation false
Thanks a lot!