Approaching Overload

Approaching Overload: Diagnosis and Response to Anomalies in Complex and Automated Production Software Systems. Marisa Bigelow, in her masters thesis, analyzed four incidents applying process tracing and the above-the-line-below-the-line framework to understand how modern systems development teams diagnose and respond to surprises. The work provides evidence for findings about difficulties with observability in automation and responses to saturation. researchgate

This page is a Forage.

The responders, at times, had to integrate and reason through multiple and sometimes ambiguous anomalous signals crossing the line to discover deeper layers of the network being affected. At the same time, they were corroborating and updating each other through the communication channels on emerging evidence and changing hypothesis. The narratives trace the evolving collective mindset and distributed actions of the engineers in resolving a variety of unexpected abnormalities, based in the language of their domain. (p.24)

Components, such as routers, servers, and load balancers, all have finite resources and responses to the threat of saturation. Although these boundaries are elastic and ambiguous, the units do change their behavior nearing and crossing these boundaries. The diversity of instantiations and responses requires a minimally equivalent diversity of actions above-the-line to handle changing dynamics. While the network and components are built with a variety of limiters and deflectors safeguarding functionality, changes in external demands and interconnections will inevitably test the capabilities beyond their competences. The complexity will be invisible to people above the line until the automation is unable to keep pace with the world. (p.48)

The chat logs capture the behaviors above the line in light of unexpected anomalies and patterns crossing from below the line. The engineers’ responses inform understanding of the underlying structures as the risk of saturation propagates throughout the network. (p.49)

One method of visualization was used to connect signals crossing the line of representation to the developing hypotheses of the responders. The collective hypothesis space above the line was created from the shared hypotheses in the chat logs and are relative to the line of certainty, which is an ambiguous zone separating tentative plans from ideas that were acted upon. Additional representations further support the case explanations, particularly for details below the line. The cases are re- represented to focus on the evolving mindsets of the people in response to the various anomalies and consequences of overload in the system. (pp.49-50)

The dual problem areas of the monitoring and memcache servers are not obviously linked; however, in this case, their interaction was a critical sign to responders of their incomplete models. (p.50)

Additionally, the structure of the code change distribution created a strange loop in that the time of the code deploy is usually marked on the charts, which are generated by the monitoring server. Since the server was overload, it could not display the timestamp for the change and made it invisible to the responders. (p.51)

The initial person was not directly involved in much of the discussion because he was working to revert the change. Another software development engineer was monitoring across several channels and connected the seemingly disparate issues. The other engineers added support to the hypothesis that the monitoring server and memcache were interacting in an unforeseen way. After the code was reverted, the servers slowly dealt with the massive number of requests and started closing connections, which allowed the engineers to access them and assess the situation better. The two groups reconciled their separate hypotheses into one unifying explanation. (pp.51-52)

Awesome visualizations. Might be more accessible with wiki graph and hypermedia support.

Figure 7 on p.54

Figure 8 on p.58

Figure 11 on p.64

Figure 12 on p.68

The discussion highlights the divide between the main application layer and the network layer. Most engineers operate on the application level dealing with the servers directly. A select few engineers have access to the load balancer structure, which is typically left to manage distributing network traffic on its own as a third party product. Not only is it an opaque operating unit, it also has a difficult language to use in directing its actions. Both facets hindered the engineers’ ability to understand where the problems originated. The monitoring tools were not displaying data in the usual logs at the application level and the initial responders did not have access to network or load balancer logs. The lack of expected signals hinted at a problem on a different level, forcing them to broaden the scope of their investigation and recruit other roles and expertise. (p.70)

I am all over how to check the health of machines behind the vip, but I am not sure how to check the vip itself. Table 3. Chat logs demonstrating observed effects at a distance (p.71)

Another consequence of complex, tangled systems is the strange loop phenomenon, in which a monitored process depends on its own operation. ... Given its function of gathering data from other components, the monitoring server cannot be isolated from the system. This interdependency became problematic when a change to it prevented another component from continuing its functioning. The strange loop appeared when responders were trying to determine the start of the anomalies. They were reliant upon the monitoring tools marking the time of a pushed code change. However, as seen in Table 4, the server creating the graphs could not accurately update the displays after the change finished implementing when it was under extreme load. (p.72)

so now we can't rely on our graphs showing pushes? not if we break graphite in the same push, no Table 4. Chat logs demonstrating an observed strange loop (p.73)

A key step to maintaining common ground and supporting coordination is to update involved parties and align mental models. New information can radically change a person’s mindset about the situation and highlight relevant gaps that must be filled. ... The first few people to respond in Case B had incomplete mental models of the underlying system, which required updating from other perspectives. The three individuals communicating in the logs in Table 5 were all conversing in the same Slack channel, but had a variety of skill sets and experiences. Different roles have access to different data and monitoring tools, which is useful for providing diverse perspectives to incidents with signals surfaced across a variety of sensors. ... The feedback loop explaining the full circumstances was an active learning cycle for all involved during the anomaly response process. Updating everyone’s mental models supported coordination and expanding their experiences so that they could use the information in recognizing and diagnosing future anomalies.(pp.74-75)

The ABL framework shifted the narrative of the corpus to focusing on the interactions of the engineers and the automation in responding to overload. The collective hypothesis chartings provides a supporting frame of reference connecting the evolving theories of the responders above the line with the representations emerging across the line throughout each incident. The vignettes highlight moments in the chat logs, which demonstrate behaviors above the line with model updating; below the line with effects at a distance; and strange loops crossing the line of representation. The re-represented cases act as evidence for findings about difficulties with observability in automation and responses to saturation. (p.76)

Focusing on how the systems managed the risk of saturation or overload, Bigelow found challenges to observabily to be a recurring obstacle that hindered the response.

Similar pattern of responses to overload happen above and below the line.

A series of compensations (e.g shed load, reduce thoroughness, add resources, or time-shift workloads) are undertaken, prove to be inadequate, and result in a sudden collapse of capability.

Below-the-line components have active fault tolerance measures built in to mitigate the likelihood of overload. However, parts of the system will saturate and reach the limits of their performance. (p.78)

Figure 13. Responses to overload in Case A (p.81)

Figure 14. Responses to overload in Case B (p.84)

Figure 15. Responses to overload in Case C (p.87)

Figure 16. Responses to overload in Case D (p.90)

A fundamental constraint with software systems is that every affordance and recorded measure must be purposefully built by a human. The designer may be displaced in time and space from the actual implementation and use of the technology, which can make the appropriateness of design decisions stale to current conditions. (p.96)

The engineers in the cases demonstrated a vital skill of interpreting data and providing context to ambiguous signals. Most warning signs were for singular state changes or threshold alarms, which do not provide much insight in isolation. The underlying automation is very opaque, especially when performing highly autonomous functions such as distributing network traffic in a load balancer. (p97.)

The anomaly response process is complicated by the interacting effects hidden below the observable line of representation. Limited measurable signals, masking, and strange loops restrict the human responders’ abilities to accurately calibrate and update their mental models of the system and to take appropriate corrective actions. (p.98)

Above the line, time also affected the scope of investigation. Recent changes were prioritized as likely contributors to the current issue, even when evidence supported other explanations. It was much harder for the people to trace changes disjointed in time, such as a week prior or long-term choices that left latent effects waiting to be activated by specific circumstances. (pp.98-99)

Table 8. Corpus patterns for observability (p.99)

Diverse perspectives expanded the hypotheses considered and beneficially broadened the scope of investigation. Explicit comments by engineers updating each other’s mental models were frequent in the chat logs and supports “Woods’ Theorem” on the importance of finding and filling gaps in understanding (p.100)

The unique experiences, skill sets, and roles of the individual responders contributed to resolve the complex challenges from beneath the line of representation. (p.101)

Core findings • Despite safeguards, overload occurs, propagates, and is hard to see. • Mental models have gaps and are updated during anomaly response. • Network complexity produces effects at a distance and weak representations hinder diagnostic search. • ABL was an effective framework for analyzing anomaly response. (p.101)

the distribution resource of the network pathways in Case B overutilized its capacity and did not have a single point of failure as in other cases. The interdependent network was adapting to prevent full system breakdown or decompensation. The engineers played a vital role in this adaptation, even with depreciated monitoring. (p102.)

Appendix A. Coding Schema (for messages in the timeline) Anomaly (intermittent|persistent),(expected|unexpected) Mental Model (info update|prior events) Hypothesis (generation|rule out|revisit) Actions (diagnostic|therapeutic|anticipated effects/risk) Question (clarification|puzzlement) (pp.113-114)

.

Collective Mindset <- what a great phrase

Distributed Actions <- what a great phrase

Observability below-the-line is a key weakness. This vacuum will draw a lot of investment in the next few years.

Correlating distant, weak, and ambiguous signals is especially difficult. We can already see growing investment to apply machine learning to mitigate this challenge. I am skeptical that machines will be more effective than humans at this, despite their ability to sift massive numbers of signals.

Ideas for wiki plugins or transporters

import message history from slack into a story with paragraphs and other wiki items.

create a page in a wiki who's owner is nrrdbot. Incident annalists can fork the page into an incident-specific wiki for annotation and elaboration.

enable annalists to charm paragraphs or other items according to the coding schema in Appendix A, the four responses to overload, highlighting moments that cross the line of representation and the direction of crossing, and grouping around effects at a distance, strange loops, and model updating.

responses to overload: shed load, reduce thoroughness, add resources, or time-shift workloads

flow of adaptations above and below the line including signals and actions that crossed the line

The overload analysis we practice for components below the line can also be applied above the line.