Error Budgets

It's difficult for product and ops teams to find middle ground between investing in reliability vs taking risks. If you test your software too much before releasing, you are going too slow and the market will swallow you, but if you don't test enough you will have a system which is not reliable enough to be used by clients.

Error budgets give us a way to make data-driven decisions on this spectrum without guesswork.

Here is how error budgets work:

  • we define how much time we should be available in form of an SLOSLO
    Service Level Objectives are values (or ranges of values) in which [[SLI]]s are allowed to be. For example, if [[SLI]] is request latency, SLO could be that request latency should be less than 100m...
  • we do Measuring Service AvailabilityMeasuring Service Availability
    In order to know how available your service is, you need a way to measure it. One of the most straightforward ways to measure this is by measuring uptime:

    availability = uptime / (uptime + downtim...
    to figure out how far we are from breaching our SLOSLO
    Service Level Objectives are values (or ranges of values) in which [[SLI]]s are allowed to be. For example, if [[SLI]] is request latency, SLO could be that request latency should be less than 100m...
  • the remaining time represents our error budget
  • as long as there is more allowed downtime, new releases can be pushed
  • if SLOSLO
    Service Level Objectives are values (or ranges of values) in which [[SLI]]s are allowed to be. For example, if [[SLI]] is request latency, SLO could be that request latency should be less than 100m...
    is breached, only stuff which will improve our availability can be released

In a concrete example, having Service Availability TargetService Availability Target
When deciding the level of availability we want for our services, the target that we want to achieve is often described as a percentage of time the service is available.

It's worth noting that 100...
of two nines allows us to have 21.6 hours of downtime per quarter. If we have been down for 10 hours this quarter already, this means that we have 11.6 hours of unavailability to spend during this quarter. Knowing this, we can make risk tradeoffs accordingly.


Status: #💡

References: