Monitoring

Monitoring is an integral part of running services in production. Without it, we are blind to what's going on, and thus unable to act according to our best interest.

Providing visibility is in the center of monitoring, but speaking more broadly, monitoring allows us to:

easily debug our systems
see how a change impacts our system
implement Alerting that something is broken or will break soon
perform long-term trend analysis

In essence, monitoring is collection of events and their contexts (e.g., HTTP request is an event, its full context holds all details about it: url, body, status code…). Ideally, we'd have all events together with their whole contexts. In reality, that's a lot of data, so we need to divide monitoring into four categories based on how we collect the data:

ProfilingProfiling
Profiling is a type of [[Monitoring]] where we collect a lot of data for a short period of time, which we can use to debug an issue.

Profiling allows us to have all the events with most of the con...
TracingTracing
Tracing is a type of [[Monitoring]] which sacrifices the number of events it looks at to give us a picture of how a system behaves.

It's heavily relying on sampling - e.g. looking at only every hu...
LoggingLogging
Logging is a form of [[Monitoring]] which looks at a limited set of events (e.g. all http requests, all db queries) and records part of their contexts

Logging sacrifices the amount of context prov...
Metric Monitoring

Metric Monitoring

Metric Monitoring (often refered to as just monitoring) is a type of Monitoring which gives up on event context to provide event information over time. It's a system that focuses on overall system health and behavior - not on individual events.

Although metrics rely on having little context to function (too much context == too big metrics), we sometimes want to add some context to our metrics (e.g. path of the http request). We need to be careful because now each path our application has would now count as another metric.

Tracking user emails would be a bad idea as it's unbounded cardinality - each email would create a new number for us to track (not to mention that emails are personally identifiable information, which have no place in metrics).

As a general rule of thumb, no process should ever track more than 10000 distinct numbers.

When debugging an issue, Metrics can show you in which system the slowdown is, while LoggingLogging
Logging is a form of [[Monitoring]] which looks at a limited set of events (e.g. all http requests, all db queries) and records part of their contexts

Logging sacrifices the amount of context prov... can help you pinpoint where in that system it's occuring

Efficient metric monitoring systems are best achieved with heavy use of White Box MonitoringWhite Box Monitoring
White Box [[Monitoring]] is when we monitor the internal workings of our system. For example, users have no idea about our current CPU utilization, so that metric is a White Box metric. Its primary... with a bit of Black Box MonitoringBlack Box Monitoring
Black Box Monitoring is when we look at our system from the perspective of our users – without knowing anything about its internal state.

Since Black Box Monitoring is looking at customer experien.... It's very important for efficient monitoring systems to be able to tell what (Symptom Based MonitoringSymptom Based Monitoring
Symptom Based [[Monitoring]] points us to allows us to observe the user experience. A metric is Symptom based if it shows an actual symptom that is making our users happy or sad. We gather Symptom ...) from why (Cause Based MonitoringCause Based Monitoring
Cause Based [[Monitoring]] points us to a cause of an existing issue, but don't imply that issue exists in the first place. Some examples of Cause Metrics are:

CPU utilization
Free disk space
...), as this will have a large impact on how we actually use the metrics we collect.

The Four Golden Signals of MonitoringThe Four Golden Signals of Monitoring
The four golden signals of [[Monitoring]] are:

[[Measuring Request Latency]]
[[Measuring Traffic]]
[[Measuring Error Rate]]
[[Measuring Service Saturation]]

Status: #🌲

References:

... is a good place to start figuring out what to have on your service dashboards.

Don't shy off from recording "the same metric" in different places – see Where to Collect MetricsWhere to Collect Metrics
Different layers of infrastructure and application are exposing the same [[Monitoring]] metrics. For example, your [[Database]] reports the query duration, and so does your application. These two a....

Status: #💡

References:

Book - Site Reliability Engineering (Source)
Video - Practices for Creating Effective Customer SLOsVideo - Practices for Creating Effective Customer SLOs

Source: InfoQ: Stop Talking & Listen; Practices for Creating Effective Customer SLOs

Status: #🛈/📹/✅

sre workbook chapter 3 has case studies on implementing slos
[[Cause Based Monitor... (Source)

Monitoring

Metric Monitoring

Links to this note

SLI

Toil

The Four Golden Signals of Monitoring

What should I Monitor

Cause Based Monitoring

Symptom Based Monitoring

White Box Monitoring

Where to Collect Metrics

Logging

Profiling

Prometheus

Tracing