Tracing a Path to Observability

Mads Hartman, Tracing a path to observability, is a chronicle of Glitch’s efforts to gain visibility into its production systems—and make them more reliable. article

This article is a really nice outline of how different observability tools answer different kinds of questions.

There’s a pretty clear description of trade-offs in the options available.

Although not spelled out in the article, notice how much knowledge the engineers have to know about the underlying system in order to even make use of their metrics and logs?

> The problem is we only have access to aggregated values such as the mean, median, and 95th percentile. Since the slowdown isn’t affecting every project, these aren’t very useful. The slow-to-start projects haven’t made a dent in the aggregates, and we can’t use the metric to narrow down which projects are affected.

This reads like a commentary about limits of aggregated data, but notice the mismatch between the metrics they started with and the questions they were asking.

Turning to logs for granularity… > Currently, 65 percent of our log lines are for timing operations. As traffic increases, this is becoming prohibitively expensive. Just seeing the system has trade-off decisions—judgement calls.

They turned to honeycomb & distributed tracing to narrow their search. Even here… > The tool couldn’t tell us why—for that **we’d have to rely on our expertise** and knowledge of our systems, or pull out the source code and start digging. _(emphasis is mine)_

In their conclusion we learn that in addition to hard trade-off decisions described earlier, there are unexpected pay-offs too. Getting better in one focused area can support wide-spread improvement.

> The observability practices we’ve adopted to help debug slow project starts have had positive side effects. With richer telemetry and more sophisticated tooling, we’re quicker to identify problems and disprove hypotheses. We’ve greatly reduced the duration of incidents and rarely have to make code changes to get answers to our questions. This is where observability really shines: It enhances your ability to ask questions of your systems, including ones you hadn’t originally thought to ask.

Humans can apply their skills in diverse ways. Skills developed in one area become a resource for other dimensions of work.