we are getting our feet wet with Prometheus and have thrown too much stuff at it already ending with too high cardinality and too many metrics (node exporter of 3000+ physical servers, IPMI exporter, manually generated data for the batch scheduler, almost 2000 BACnet data sets, …). At the moment TSDB status lists 4758293 series, 24803125 chunks and 107935 label pairs.
On top of that, the current server is quite busy ingesting data (WAL directory holds about 150 128MByte files, data going back to February 27th is totaling up to about 2.1 TByte) and querying anything across the 3000+ instances going back more than a few hours (even at reduced resolution) is literally taking minutes or runs into a timeout regularly.
In other words, we should probably start from scratch and re-evaluate our actions.
(1) First, we should probably see whether we should/must use a single server for everything or can use different servers for different aspects and combine those with Grafana at a later stage. Obviously, recording rules would not be able to join between these “domains” but that may not be too hard a problem.
(2) Then we need to tackle the dreaded topic of long-term storage. Some of the metrics, I already know we will want to keep for years to come (say at least 5yrs) but those could be down sampled. Some/most of the metrics we would probably not care about after say 4 weeks or so. But then, we may need quite a number of metrics for obervability instead of just monitoring, thus we may need to keep more metrics than we may initially think we needed.
Therefore, which long term storage model/tool/project would be suitable for us? We are a mostly “bare metal shop” here, no OpenStack, docker swarm, k8n or others and we are very, very limited on man-power. That being said, having been burned too often by $LargeCompany which promised too much, got payed too much but never really delivered, we are quite hesitant to offload this to some large/small company.
Thus, anything which is “easy” to set-up and maintain and where we could just throw enough hardware at (>100 physical CPU cores, > 100 TByte NVMe storage should be no problem), would be at the top of our list.
Is there anything out there fitting our bill?
Cheers and thanks a lot in advance for any pointers.