What should i be Alerting on

When setting up Alerting for the first time, many people instinctively set alerts on Cause Based MonitoringCause Based Monitoring
Cause Based [[Monitoring]] points us to a cause of an existing issue, but don't imply that issue exists in the first place. Some examples of Cause Metrics are:

CPU utilization
Free disk space
...
metrics – if my service's CPU is ramped to 100%, of course I want to be alerted!

If someone asks "why is that so?", a simple answer is that such high CPU utilization results in a poor service performance, but this stance becomes much less important once you realize that:

  • It's possible for your service to be slow while your CPU utilization is very low.
  • It's possible for your service to be fast while your CPU utilization is very high.

Looking at these statements, we can make a good assumption that users don't actually care about our CPU utilization.

As good SLISLI
Service Level Indicators are quantitative measures of provided level of service, often aggregated into rates, averages, percentiles.

Common SLIs are availability, error rate, latency, throughput, ...
s revolve around user experience, it's much better to look at Symptom Based MonitoringSymptom Based Monitoring
Symptom Based [[Monitoring]] points us to allows us to observe the user experience. A metric is Symptom based if it shows an actual symptom that is making our users happy or sad. We gather Symptom ...
when thinking about what metrics to set up Alerting on.

Why not just measure the user experience directly, rather than try to look at all possible causes that can lead to issues?

In addition to this, there are also some Cause Based MonitoringCause Based Monitoring
Cause Based [[Monitoring]] points us to a cause of an existing issue, but don't imply that issue exists in the first place. Some examples of Cause Metrics are:

CPU utilization
Free disk space
...
metrics that are useful to get alerts on – for example, for some things that aren't an issue right now, but are sure to become one in a short period of time. This would allow us to proactively act and negate the possible bad user experience. Make sure that alerts you set up with Cause Based Monitoring don't overlap with your Symptom Based Monitoring aalerts – inthis case, Cause Based alerts can be removed.


Status: #💡

References:

  • Video - Practices for Creating Effective Customer SLOsVideo - Practices for Creating Effective Customer SLOs


    Source: InfoQ: Stop Talking & Listen; Practices for Creating Effective Customer SLOs

    Status: #🛈/📹/✅




    sre workbook chapter 3 has case studies on implementing slos
    [[Cause Based Monitor...
    (Source)