Long Term Storage choice for very large number of metrics


we are getting our feet wet with Prometheus and have thrown too much stuff at it already ending with too high cardinality and too many metrics (node exporter of 3000+ physical servers, IPMI exporter, manually generated data for the batch scheduler, almost 2000 BACnet data sets, …). At the moment TSDB status lists 4758293 series, 24803125 chunks and 107935 label pairs.

On top of that, the current server is quite busy ingesting data (WAL directory holds about 150 128MByte files, data going back to February 27th is totaling up to about 2.1 TByte) and querying anything across the 3000+ instances going back more than a few hours (even at reduced resolution) is literally taking minutes or runs into a timeout regularly.

In other words, we should probably start from scratch and re-evaluate our actions.

(1) First, we should probably see whether we should/must use a single server for everything or can use different servers for different aspects and combine those with Grafana at a later stage. Obviously, recording rules would not be able to join between these “domains” but that may not be too hard a problem.

(2) Then we need to tackle the dreaded topic of long-term storage. Some of the metrics, I already know we will want to keep for years to come (say at least 5yrs) but those could be down sampled. Some/most of the metrics we would probably not care about after say 4 weeks or so. But then, we may need quite a number of metrics for obervability instead of just monitoring, thus we may need to keep more metrics than we may initially think we needed.

Therefore, which long term storage model/tool/project would be suitable for us? We are a mostly “bare metal shop” here, no OpenStack, docker swarm, k8n or others and we are very, very limited on man-power. That being said, having been burned too often by $LargeCompany which promised too much, got payed too much but never really delivered, we are quite hesitant to offload this to some large/small company.

Thus, anything which is “easy” to set-up and maintain and where we could just throw enough hardware at (>100 physical CPU cores, > 100 TByte NVMe storage should be no problem), would be at the top of our list.

Is there anything out there fitting our bill?

Cheers and thanks a lot in advance for any pointers.

There is no one or easy answer. It very much depends on what you are willing/able to spend in terms of both time & money.

If you are so strapped for time/experience then a third party solution such as Grafana Cloud could be the best option - they handle all the pain around scaling, updates and management, while you focus on the data scraping (so exporters, instrumenting custom apps, etc.), dashboards & alerts. But it will cost money (and still some level of time to keep the bits you look after in check).

If you want to run it all yourself it will add a big chunk of extra work and responsibility. You now also need to handle all the scaling, backups, updates, etc. and well as the initial design.

You could go with a purely Prometheus solution. Or you could look at one of the solutions within the wider Prometheus ecosystem, such as Thanos, Cortex or Mimir.

If you went with the pure Prometheus option you’d need to have a selection of Prometheus servers linked via remote read/write and/or federation. Promtetheus doesn’t currently have the ability to downsample (but it can be achieved via federation to some extent) or have variable retentions (but can be achieved via multiple servers). The suggestion would be to shard things based on location, funcitonal area, etc. and then have multiple servers - this does have an overhead in terms of things that need management, updates, etc.

Alternatively using something like Thanos which does have downsampling/variable retention already available could be simpler, but does work fairly differently (and likely alongside some level of normal Prometheus infrastructure). They normally also work with cloud storage (although there are probably various non-cloud options available too).

3k servers is a fair amount (which I’m guessing are in multiple different locations), although 4 million series isn’t huge. Depending what you come up with it could quite easily need 1 or more FTE worth of effort to properly maintain, so it very much depends how extreme your “very, very limited on man-power” limitaiton is - you need to be able to spend sufficient time to be able to keep things working well, as otherwise you might find that when you need it the most it doesn’t work.

Hi Stuart,

starting backwards in my reply (thanks a bunch for yours!).

3k servers are all in one location, basically one large room, mostly compute nodes loosely joined together with a batch scheduler (HTCondor), i.e. quite like a web server farm where nodes are not talking much with each other and if a few fail from time to time it’s not a big deal.

The team is 3FTE (4 shoulders, actually pair of shoulders as there are right and left shoulders, right?) handling every(!) aspect of this scientific research endeavor supporting computing for about 80 users. This means, basically everything has to be fully automated, otherwise we are out of work force/time.

Cortex, Mimir and Thanos were the “obvious” candidates after looking through Integrations | Prometheus but given our “bare metal” background we shied away from anything which sounded too complex to begin with, e.g. assuming we need some scaling with those, Cortex requires Consul/Docker, Thanos seems to be quite a complex beast (based on https://docs.google.com/drawings/d/e/2PACX-1vSJd32gPh8-MC5Ko0-P-v1KQ0Xnxa0qmsVXowtkwVGlczGfVW-Vd415Y6F129zvh3y0vHLBZcJeZEoz/pub?w=960&h=720) and currently we would not even have a compatible storage backend, Mimir seems to be “better” suited for us as monolithic mode should be easy to deploy across nodes but the we again need an object store which we currently don’t have. Well at least ceph with its potential S3 storage is on our immediate road map.

So, the dire truth could be that this would become too complex to handle in such a small team and we either need to use one of the paid “cloud” options or look at other tools which probably will have the same issues for us.

Hi, @carsten-AEI, I run Prom/Graf, all by my lonesome, in a high performance computing environment, and have run into similar Prom query performance and storage issues. Sounds like we’re in similar boats. Another product you might want to look at is Victoria Metrics (VM) It has both a Community and Enterprise edition. They look to be the same in features/functionality, but Enterprise adds different support features, depending on the tier.

I’m looking at both VM and Grafana’s Mimir (community) on a lab node at the moment. We run upwards of 10 exporters on ~ 2,200 compute, scheduler, storage, lustre, OOD, and head nodes in a single datacenter. (exporter(s) run depends on role of each node, but all run node exporter). Prom runs on a dedicated node, Graf is on another. We retain 2y of data. I’m shy of going with a cloud / managed solution due to potential of skyrocketing costs.

Thank you Greg for the pointer. We will have a look at it and see if it can fit our bill (and report back here our findings - unless I will forget about this thread which does have quite a high probability :frowning: )