Cause Based Monitoring

Cause Based MonitoringMonitoring
Monitoring is an integral part of running services in production. Without it, we are blind to what's going on, and thus unable to act according to our best interest.

Providing visibility is in the...
points us to a cause of an existing issue, but don't imply that issue exists in the first place. Some examples of Cause Metrics are:

  • CPU utilization
  • Free disk space

When users are seeing slow response times, I want to be able to easily tell that it's because our DB server is running close to 100% CPU utilization – this should be the primary use of Cause Based Monitoring.

On the other hand, Cause Metrics are only in rare cases useful in Alerting. See What should i be Alerting onWhat should i be Alerting on
When setting up [[Alerting]] for the first time, many people instinctively set alerts on [[Cause Based Monitoring]] metrics – if my service's CPU is ramped to 100%, of course I want to be alerted!

Status: #🌱


  • Video - Practices for Creating Effective Customer SLOsVideo - Practices for Creating Effective Customer SLOs

    Source: InfoQ: Stop Talking & Listen; Practices for Creating Effective Customer SLOs

    Status: #🛈/📹/✅

    sre workbook chapter 3 has case studies on implementing slos
    [[Cause Based Monitor...