SRE

Site Reliability Engineering is what happens when you treat operations as a software engineering problem. The goal of SRE is to make software reliable and scalable. SRE can be seen as a specific implementation of DevOps.

Some notes on this topic to check out:

  • Service ReliabilityService Reliability
    Reliability of services is crucial in most applications, and everyone should aim for their services to be reliable. However, reliability can be very expensive, so we should know how to manage it pr...
  • Service Availability TargetService Availability Target
    When deciding the level of availability we want for our services, the target that we want to achieve is often described as a percentage of time the service is available.

    It's worth noting that 100...
  • Error BudgetsError Budgets
    It's difficult for product and ops teams to find middle ground between investing in reliability vs taking risks. If you test your software too much before releasing, you are going too slow and the ...
  • ToilToil
    Toil is a type of work which is manual, repetitive, and brings no real long-term value to the project. In addition to this, toil is work that scales linearly with service, and can be solved by auto...
  • SLA / SLOSLO
    Service Level Objectives are values (or ranges of values) in which SLISLI
    Service Level Indicators are quantitative measures of provided level of service, often aggregated into rates, averages, percentiles.

    Common SLIs are availability, error rate, latency, throughput, ...
    s are allowed to be. For example, if SLISLI
    Service Level Indicators are quantitative measures of provided level of service, often aggregated into rates, averages, percentiles.

    Common SLIs are availability, error rate, latency, throughput, ...
    is request latency, SLO could be that request latency should be less than 100m...
    / SLISLI
    Service Level Indicators are quantitative measures of provided level of service, often aggregated into rates, averages, percentiles.

    Common SLIs are availability, error rate, latency, throughput, ...

Status: #💡 Tags: #🗺️

References: