Is this alerting architecture crazy?

In relation to alerting, I’m looking for a way to get strong alert delivery guarantees (and if delivery is not possible I want to know about it quickly).

Unless I’m mistaken AlertManager only offers best-effort delivery. What’s puzzled me though is that I’ve not found anyone else speaking about this, so I worry I’m missing something obvious. Am I?

Assuming I’m not mistaken I’ve been thinking of building a system with the architecture shown below.

Basically rather than having AlertManager try and push to destinations I’d have an AlertRouter which polls AlertManager. On each polling cycle the steps would be (neglecting any optimisations):

  • All active alerts are fetched from AlertManager.
  • The last known set of active alerts is read from the Alert Event Store.
  • The set of active alerts is compared with the last known state.
  • New alerts are added to an “active” partition in the Alert Event Store.
  • Resolved alerts are removed from the “active” partition and added to a “resolved” partition.

A secondary process within AlertRouter would:

  • Check for alerts in the “active” partition which do not have a state of “delivered = true”.
  • Attempt to send each of these alerts and set the “delivered” flag.
  • Check for alerts in the “resolved” partition which do not have a state of “delivered = true”.
  • Attempt to send each of these resolved alerts and set the “delivered” flag.
  • Move all alerts in the “resolved” partition where “delivered=true” to a “completed” partition.

Among other metrics, the AlertRouter would emit one called “undelivered_alert_lowest_timestamp_in_seconds” and this could be used to alert me to cases where any alert could not be delivered quickly enough. Since the alert is still held in the Alert Event Store it should be possible for me to resolve whatever issue is blocking and not lose the alert.

I think there are other benefits to this architecture too, e.g. similar to the way Prometheus scrapes, natural back-pressure is a property of the system.

Anyway, as mentioned I’ve not found anyone else doing something like this and this makes me wonder if there’s a very good reason not to. If anyone knows that this design is crazy I’d love to hear!

Thanks