How to have different alert thresholds for different targets?

I have a few hosts that need different thresholds for alerts than others. In the following alert rule:

    - alert: HostUnusualDiskWriteRate                                                                                                                                                                                            
      expr: 'sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50'                                                                                                                                     
      for: 2m                                                                                                                                                                                                                    
      labels:                                                                                                                                                                                                                    
        severity: warning                                                                                                                                                                                                        
      annotations:                                                                                                                                                                                                               
        summary: Host unusual disk write rate (instance {{ $labels.instance }})                                                                                                                                                  
        description: "Disk is probably writing too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"                                                                                                      

I’d like to simply change the expression to >500 for a few hosts.

What’s the best practice for accomplishing this? Grouping the targets somehow and writing a separate alert with a different value? Set the value to some variable in the target definition and reference that variable in the alert expression? Keep a single alert rule with more complex logic like a case statement for different groups?..looking for advice, and examples.

Grouping the targets seems like it would become unwieldy quick as you have more different thresholds for different alerts on different hosts. Coming from another open-source monitoring solution where you’d apply a default template to targets and only use a different value if it was defined in the host config…guess I’m still looking for the same sort of thing unless there’s a better way.

Thanks!

Did you ever figure this out?

You can use the OR condition like this :

sum by (instance) (rate(node_disk_written_bytes_total{instance=~”regex for few hosts”}[2m])) / 1024 / 1024 > 500
or
sum by (instance) (rate(node_disk_written_bytes_total{instance!~”regex for few hosts”}[2m])) / 1024 / 1024 > 50

Or you can use custom thresholds by creating new metrics with recording rules :

groups:
- name: Disk thresholds
  rules:
  - record: unusual_disk_threshold
    labels:
      instance: "host"
    expr: 500

and then using this promql expr

sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50 unless on (instance) unusual_disk_threshold
or
sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > on (instance) unusual_disk_threshold

This way, you can use different custom thresholds