Prometheus stops working after 1-2h, and it restarts only if I delete the wal folder

giuliomagnifico · July 16, 2023, 7:46am

Hi, I’m using Prometheus (2.45) on a Raspberry PI with a NVMe storage, but since some days I’m having issues:

Prometheus stops running after about 1-2h (probabily due some wal error), if I delete the wal folder and I restart Prometheus it works but -obviously- I lost all the data since it stopped and I delete the wal folder.

In the few hours my wal folder gets full of files, this is the folder after 6-8h since Prometheus stopped working (I was sleeping)

And Prometheus stops without any errors, I simply found this

root@Grafana:~# service prometheus status
● prometheus.service - prometheus
     Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/prometheus.service.d
             └─dietpi-process_tool.conf
     Active: activating (auto-restart) (Result: exit-code) since Sun 2023-07-16 08:49:19 CEST; 8s ago
    Process: 40471 ExecStart=/home/dietpi/prometheus/prometheus --config.file=/home/dietpi/prometheus/prometheus.yml --storage.tsdb.path=/home/dietpi/prometheus/data --storage.tsdb.retention.time=
3y (code=exited, status=1/FAILURE)
   Main PID: 40471 (code=exited, status=1/FAILURE)
        CPU: 610ms

When I delete the wal folder and I restart Prometheus all is working again:

root@Grafana:~# service prometheus status
● prometheus.service - prometheus
     Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/prometheus.service.d
             └─dietpi-process_tool.conf
     Active: active (running) since Sun 2023-07-16 09:33:53 CEST; 33s ago
   Main PID: 41120 (prometheus)
      Tasks: 10 (limit: 4531)
        CPU: 3.399s
     CGroup: /system.slice/prometheus.service
             └─41120 /home/dietpi/prometheus/prometheus --config.file=/home/dietpi/prometheus/prometheus.yml --storage.tsdb.path=/home/dietpi/prometheus/data --storage.tsdb.retention.time=3y

Jul 16 09:33:53 Grafana prometheus[41120]: ts=2023-07-16T07:33:53.724Z caller=main.go:1224 level=info msg="Loading configuration file" filename=/home/dietpi/prometheus/prometheus.yml
Jul 16 09:33:53 Grafana prometheus[41120]: ts=2023-07-16T07:33:53.726Z caller=main.go:1261 level=info msg="Completed loading of configuration file" filename=/home/dietpi/prometheus/prometheus.yml
totalDuration=2.074037ms db_storage=6.685µs remote_storage=7.907µs web_handler=3.778µs query_engine=4.388µs scrape=674.481µs scrape_sd=85.537µs notify=86.13µs notify_sd=60.722µs rules=6.426µs trac
ing=20.092µs
Jul 16 09:33:53 Grafana prometheus[41120]: ts=2023-07-16T07:33:53.727Z caller=main.go:1004 level=info msg="Server is ready to receive web requests."
Jul 16 09:33:53 Grafana prometheus[41120]: ts=2023-07-16T07:33:53.727Z caller=manager.go:995 level=info component="rule manager" msg="Starting rule manager..."
Jul 16 09:34:01 Grafana prometheus[41120]: ts=2023-07-16T07:34:01.159Z caller=compact.go:514 level=info component=tsdb msg="write block resulted in empty block" mint=1689451200000 maxt=16894584000
00 duration=78.179759ms
Jul 16 09:34:01 Grafana prometheus[41120]: ts=2023-07-16T07:34:01.174Z caller=head.go:1293 level=info component=tsdb msg="Head GC completed" caller=truncateMemory duration=8.437611ms
Jul 16 09:34:02 Grafana prometheus[41120]: ts=2023-07-16T07:34:02.165Z caller=compact.go:464 level=info component=tsdb msg="compact blocks" count=3 mint=1689422400000 maxt=1689444000000 ulid=01H5E
SXRWNSSRSV2BEYV2DCSQ7 sources="[01H5D11PBTCHTBP540V9AWV9Z8 01H5D7XDM5VA335SNYQBYRZ6AK 01H5DES4V60ZZKGAJA6SKGS6GE]" duration=800.055555ms
Jul 16 09:34:02 Grafana prometheus[41120]: ts=2023-07-16T07:34:02.177Z caller=db.go:1617 level=info component=tsdb msg="Deleting obsolete block" block=01H5D11PBTCHTBP540V9AWV9Z8
Jul 16 09:34:02 Grafana prometheus[41120]: ts=2023-07-16T07:34:02.180Z caller=db.go:1617 level=info component=tsdb msg="Deleting obsolete block" block=01H5DES4V60ZZKGAJA6SKGS6GE
Jul 16 09:34:02 Grafana prometheus[41120]: ts=2023-07-16T07:34:02.184Z caller=db.go:1617 level=info component=tsdb msg="Deleting obsolete block" block=01H5D7XDM5VA335SNYQBYRZ6AK

Looks like that something is broken with the compaction of some blocks or something similar… I also deleted some obsolete blocks but is the same.

There’s a way/tool to check the Prometheus data folder? Or to force the compaction of the wal folder into a block? Maybe I have to lower the 2h default time? Or maybe it’s my SSD/NVMe (but it works fine)?

I don’t know what to do to have again a working setup that not stops after few hours.

Here’s my Prometheus config:

--alertmanager.notification-queue-capacity	10000
--alertmanager.timeout	
--config.file	/home/dietpi/prometheus/prometheus.yml
--enable-feature	
--log.format	logfmt
--log.level	info
--query.lookback-delta	5m
--query.max-concurrency	20
--query.max-samples	50000000
--query.timeout	2m
--rules.alert.for-grace-period	10m
--rules.alert.for-outage-tolerance	1h
--rules.alert.resend-delay	1m
--scrape.adjust-timestamps	true
--scrape.discovery-reload-interval	5s
--scrape.timestamp-tolerance	2ms
--storage.agent.no-lockfile	false
--storage.agent.path	data-agent/
--storage.agent.retention.max-time	0s
--storage.agent.retention.min-time	0s
--storage.agent.wal-compression	true
--storage.agent.wal-segment-size	0B
--storage.agent.wal-truncate-frequency	0s
--storage.remote.flush-deadline	1m
--storage.remote.read-concurrent-limit	10
--storage.remote.read-max-bytes-in-frame	1048576
--storage.remote.read-sample-limit	50000000
--storage.tsdb.allow-overlapping-blocks	true
--storage.tsdb.head-chunks-write-queue-size	0
--storage.tsdb.max-block-chunk-segment-size	0B
--storage.tsdb.max-block-duration	31d
--storage.tsdb.min-block-duration	2h
--storage.tsdb.no-lockfile	false
--storage.tsdb.path	/home/dietpi/prometheus/data
--storage.tsdb.retention	0s
--storage.tsdb.retention.size	0B
--storage.tsdb.retention.time	3y
--storage.tsdb.samples-per-chunk	120
--storage.tsdb.wal-compression	true
--storage.tsdb.wal-segment-size	0B
--web.config.file	
--web.console.libraries	console_libraries
--web.console.templates	consoles
--web.cors.origin	.*
--web.enable-admin-api	false
--web.enable-lifecycle	false
--web.enable-remote-write-receiver	false
--web.external-url	
--web.listen-address	0.0.0.0:9090
--web.max-connections	512
--web.page-title	Prometheus Time Series Collection and Processing Server
--web.read-timeout	5m
--web.route-prefix	/
--web.user-assets	
--write-documentation	false

Thanks a lot!

Topic		Replies	Views
Prometheus crashes during compaction process Prometheus server	16	6417	May 19, 2021
WAL space Issue Prometheus server	0	672	November 30, 2021
Running a large backfill and now server is not deleting obsolete blocks Prometheus server	0	174	August 30, 2024
Tsdb backfill from openmetrics Prometheus server	1	490	September 10, 2024
Prometheus stop respond after some time General Help/Support	4	754	June 12, 2021

Prometheus stops working after 1-2h, and it restarts only if I delete the wal folder

Related topics