Reliability and Complexity

We consistently underestimate complexity. Yet somehow we also keep things running. Reliability is an emergent property of the entire system. It resides in the relationships between things. It remains profoundly difficult.

An insight from How Complex Systems Fail. site

> Catastrophe requires multiple failures – single point failures are not enough. > > The array of defenses works. System operations are generally successful. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure. —How Complex Systems Fail. #3

A tiny review of connected graphs. We've drawn a connected graph of seven nodes, each labeled Team. The same mathematics apply for Services, Objects, Classes, Functions, or any combination.

Nodes = 7 Edges = (N * (N - 1))/2 = 21 Subgraphs = it's complicated = 853 _See number of connected graphs with n nodes in the Online Encyclopedia of Integer Sequences oeis _

graph { layout=circo node [label=Team] 1 -- { 2 3 4 5 6 7} 2 -- { 3 4 5 6 7} 3 -- { 4 5 6 7} 4 -- { 5 6 7} 5 -- { 6 7} 6 -- { 7} }

Let us turn the dial up to 50.

Nodes = 50 Edges = 1,225 Subgraphs = 1.9e306 _N.B. There are 1e78 to 1e82 atoms in the known universe_ _See table of N=1 to N=75 oeis _

Two quick points to explain how we cope with such complexity. First, our actual systems are not completely connected. The sparse graph of connections collapses the numbers dramatically. Second, our human brains are more complex than the systems we create with them. We have around 1e11 neurons with 1e15 connections (clearly our brains are also not completely connected graphs of neurons).

We leave behind the staggering numbers from graph theory to zoom in on a model of a couple teams in a software company. We can use this diagram to describe many examples of the challenges of software design. Each team maintains a small graph of their own services or components which they understand to some degree. But outside of their own domain there are many unknown or misunderstood connections.

In general terms, software companies staff themselves with many more software engineers than reliability engineers. The software engineers tend to know their own systems relatively well—the things inside the circles in this diagram. Reliability engineers tend to know more about how things are connected or they know ways to navigate these vast unknowns—the grey clouds in the diagram.

As a company grows all of the numbers grow accordingly. The number of product teams grow. The number of customers grow. The number of services grows. The number of projects grows. Requests handled by the web site grows. Storage consumed by the data grows. The customer service organization grows. The sales and marketing teams grow.

The complexity between all these things grows much, much faster.

Another dimension of this challenge is that the software industry doesn't really know how to teach these relationship skills. Mostly, we throw people in the deep end with no preparation.

> Failure free operations require experience with failure. > > Recognizing hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure. More robust system performance is likely to arise in systems where operators can discern the “edge of the envelope”. This is where system performance begins to deteriorate, becomes difficult to predict, or cannot be readily recovered. —How Complex Systems Fail #18

pages/reliability-and-complexity

Continue the story to connect this example from Netflix Vizceral.

YOUTUBE MYHf_BXWuOc Netflix Vizceral Regional View (30s) youtube