Use case discussion: IT monitoring, bare metal and network devices

Hello Prometheus Community,

I am currently working on a project to monitor the infrastructure of 450 school sites. Our environment mainly consists of switches, NAS, and Windows servers with Hyper-V virtualized servers. We are exploring the possibility of using Prometheus for this purpose and would greatly appreciate any insights or experiences you might have.

Here are some specific questions we have:

  1. Scalability: Can Prometheus effectively handle monitoring 450 sites in 5 regions with potentially thousands of devices and metrics? What are the best practices for scaling Prometheus in such a distributed environment? Using remote write? Prometheus as agent or a mix of different solutions?
  2. Data Collection: What are the recommended exporters for collecting metrics from switches, NAS, and Hyper-V environments? Are there any particular challenges or considerations when using these exporters? I have checked SNMP exporters, the windows exporters and the remote IPMI exporter, is there anything else I might want to look at?
  3. Alerting: How well does Prometheus handle alerting for such a large number of sites? Can it provide real-time alerts with minimal delay? What are some strategies to avoid alert fatigue in a large-scale deployment? Is the alertmanager everything I have to look at? The infrastructure is not critical but purposeful alerting, maybe integrated in the ticket platform is what we are looking for.
  4. Storage: What storage solutions work best with Prometheus for retaining large volumes of time-series data? How do you manage long-term storage and retrieval of historical data? Is it even needed?
  5. Network Traffic: How does Prometheus manage network traffic in a wide-area network (WAN) setup? Are there any optimizations or configurations needed to minimize the impact on network performance?
  6. Redundancy and Failover: What are the best practices for ensuring high availability and failover in a Prometheus setup monitoring multiple remote sites?
  7. Integration: How well does Prometheus integrate with other monitoring tools and dashboards? Are there any complementary tools that you recommend for a comprehensive monitoring solution?
  8. Real-world Experiences: If you have experience monitoring similar environments with Prometheus, could you share your setup, challenges faced, and how you addressed them? Any success stories or lessons learned would be incredibly valuable.

I am eager to hear your thoughts and experiences to determine if Prometheus is the right fit for our needs. Thank you in advance for your assistance!

Best regards,