Twilight of the Experts

Peter Alvaro argues why observability and fault injection is the only tractable future for addressing resilience of distributed systems. Love this talk for concisely capturing a number of fundamental truths about distributed systems. We disagree with the broad conclusion despite very much admiring his reasoning. Twenty-four minutes of gold. youtube

_(We admire Alvaro's reasoning and argument. We nevertheless disagree with the premise. Every new automation grows into the next in our vast collection of distributed systems problems. Rather than replacing genius with automation, we seek practices to develop a network of effective learners. Network effects spreading mundane expertise can outperform the super geniuses in Alvaro's argument. See Designing for Resilience)_

# Distributed Systems & Partial Failure

Distributed systems require the cooperation of some number of independent physical machines.

YOUTUBE A-frep1y80s Peter Alvaro, Twilight of the Experts: The Future of Observability

If a single site fails, like a laptop or phone, we detect the failure immediately and replace a component or reboot. In a distributed system, some subset of those communicating entities may fail to uphold their function but the computation may nevertheless finish and produce an outcome.

"Hmm. Is this outcome right? Is it correct?" "Is it complete despite the fact that partial failures have occurred?"

Because partial failures do occur, **all distributed software ever written has been written to be fault tolerant**. That is, it has software that attempts to anticipate, detect and ultimately mask or mitigate the partial failures so they don't effect our computational outcomes.

"Will my fault tolerant code actually tolerate faults that occur in practice?" "Have I tested it exhaustively?" (ahem, **you haven't**)

# Reassuring Myths

We've told ourselves bedtime stories so we could sleep at night knowing our shallowly tested code is running far away from us in production.

# The Expert

The oldest and most comforting myth.

Can't someone else do it? How many fault-tolerant challenges are there? Can't we delegate responsibility for solving these really hard problems to, you know, the greatest minds of our time? Can't we let the Lampsons and Liskovs and (Jim) Grays and the Lampports and the Chandys solve fault tolerance and then encapsulate those solutions as reusable components, you know, libraries?

Then mere mortals like us could just build stacks on those libraries?

This isn't so unreasonable of a thing to hope for. It worked for redundant arrays of inexpensive discs... taking a bunch of unreliable components and have a genius turn them into the illusion of a reliable component. Happy-go-lucky programmers think they're interacting with logical volumes that just don't fail. Underneath the covers disks partially fail, but those failures can be completely masked. The controller can detect the loss of a volume and migrate over from a mirror while preserving the illusion of disks that don't go down.

The programmer can build a stack of pancakes and we can keep moving up. This is what keeps us from having to be experts in the bits and bytes and the signals. This works great because the RAID controller, a hardware component in this case, can be a perfect failure detector.

In a distributed system there's no such thing as a perfect failure detector

This is fundamental. If there's one thing you learn from this talk remember that there's no difference to me if a computer on the other side of the world crashes and it's never coming back than if that computer on the other side of the world just went into a lengthy GC pause.

**Failure is just the absence of a message in a distributed system**

We can't hide the complexity. We can't turn this substrate into a reliable component and build up. Our abstractions are gonna leak.

They're gonna do more than leak. They're gonna burn down.

Because of the leaky abstractions in distributed systems, fault tolerance goes all the way up the stack.

So much for that myth.

# Formal Verification

Maybe mere mortals like me could build these systems and I got some really good tools to check my work. That's how the model checker works. It takes a program that some idiot like me wrote and it just enumerates all the states that the program could ever be in and checks them all against my invariants. It says all of the states are good this program can never get into a bad state so you're cool.

It turns out this isn't gonna work for the same reason that the experts didn't work because of this elusiveness of compositionality for fault tolerance.

Model checkers, for example, do not scale well to large state spaces. They work great on hardware, not that well on software, and really terrible in distributed concurrent software. You couldn't, for example, model check Netflix. There's too many states. The sun will be burnt out before you check them all.

The best hope would be what's called modular verification. Verify the pieces, the services, and then make sure that your guarantees compose as you compose the services. But as I already indicated this doesn't seem to work.

So we're not going to be able to have experts to do the hard stuff. And we're not going to try to solve it ourselves and have a tool check it.

# Trend in Industry and Academia In industry as well as academia in the last 10 years there's been a race from exhaustive, but intractable approaches to software quality to incremental, pay-as-you-go, but kind of unprincipled approaches based on extending the best practices in integration testing with fault injection. We have moved from understanding a system in some formal way to running a system and doing stuff and seeing what happens. That's not as bad as it sounds because it represents a fundamental shift that we were gonna have to make sooner or later.

# Complexity Leads this Trend

A complex artifact only gets bigger. At some point it will be too big even for a supergenius, even for a Lampson, or even for a Liscov, to hold it in their head. It will be too difficult for a model checker to enumerate all the states. In the limit, as our systems scale, they're going to become black boxes. Maybe you're not at that inflection point yet. But you will be. If what you've got is a black box you're gonna have to give up on any approach to making it work that involves understanding what goes on inside it. In the small, the experts understood what went on inside it. The model checker understood what went on inside it. The tester does not. The tester does understand the inputs and outputs to the black box though. So we begin to build hypotheses about what a black box is doing based on the behavior it exhibits as it transduces inputs into outputs. You give it some inputs you see what outputs it produces. You give it some inputs and inject some failures or delay and you see what outputs it produces. You're sort of walking around the black box and formulating a hypothesis about its behavior. What it does rather than what it is. This is the future. Because in the limit everything is a black box.

# Combinatorial Search Space

When I was at Netflix a couple years ago we were studying this application that involved about a hundred communicating services. Pretty large number. If we wanted to combine integration of fault testing in some exhaustive way we would say "okay what we'll do is we'll run an integration test for every possible combination of faults." If we were only looking at one fault that's fine because you know 100 executions. Say an integration test takes a minute. Ninety minutes to run all your tests. If you want to be a little ambitious "what if what if we wanted to see if the service stood up if combinations of up to four services were crashed" that's actually three million executions or six years. If you want to go up to seven, that's 16 billion executions or over a million years of running integration tests. If we want to test them all that's a 1 with 30 zeros after it. That's heat death of the universe time just to test this one application out of dozens at Netflix. We're gonna have to be smart about how we select our experiments if we're gonna do this testing plus fault injection thing.

# Emerging State of the Art

Alvaro paraphrases the current state of the art at video 11:10

Industry is bringing back experts to use domain knowledge of the system and inject the faults that are most likely to expose failures. They take a genius-guided walk through the combinatoric space of potential failures.

There are two giants in this space and they do the same thing: Chaos Engineering by Nora Jones and Jepsen testing by Kyle Kingsbury.

Observation of a distributed system followed by targeted fault injection. The superuser has formulated a hypothesis about what set of failures in the system is the most likely to drive the system into a bad place.

The problem is there's only one Nora and there's only one Kyle. And they are already in high demand. Good luck hiring them. Do you really wanna build your system around scarce genius?

# Twilight of the Experts

It's really worth watching how Alvaro presents his case for replacing genius with automation starting at video 12:23

If you ignore minor distinctions, Nora and Kyle are in the same game. Each: * Observes Read docs Observe executions * Thinks Genius magic happens here * Acts Introduce targeted faults that drive that system into a bad space very quickly

I don't know how the genius transforms observability into fault injection schedules. But we can look at genius as a black box, examine closely their inputs and outputs just the way they look at the inputs and outputs of the system under test.

Logical steps to replace genius with automation: 1. Let's call the Genius Magic a Mental Model. 2. It must be a Model of Fault Tolerance because the model lets us skip the faults we know the system can tolerate. 3. It must be a Model of System Redundancy because all fault tolerance involves redundancy. 4. We can build models of redundancy with Lineage-driven Fault Injection.

# Lineage-driven Fault Injection

Also very worth watching Alvaro explain the automation starting at video 17:10

It turns out dapper-style call graphs are good enough.

1. trace successful outcomes of a distributed system 2. graph the paths through the system 3. enumerate all the unique paths 4. enumerate the ways to break those paths 5. pass the results to a SAT solver the solutions to this solver are going to be the interesting cuts that break the graph 6. inject faults where the solver points

strict digraph { rankdir=BT Client1 [label=Client] Client2 [label=Client] Bcast1 Bcast2 RepA [label="Stored on\nRepA"] RepB [label="Stored on\nRepB"] Stable [label="The write\nis stable"] Client1 -> Bcast1 -> { RepA RepB } -> Stable Client2 -> Bcast2 -> { RepA RepB } -> Stable }

In his example, there are four successful paths from the clients to a stable write in storage.

# What would have to go wrong to fail? (RepA OR Bcast1) AND (RepA OR Bcast2) AND (RepB OR Bcast2) AND (RepB OR Bcast1) # SAT solver reduces this to Hypothesis: {Bcast1, Bcast2}

Those are a good target for fault injection to see how the system responds to specific failure.

**The key enabling technology that makes this all possible is that finally observability infrastructure is starting to come of age.**

Let the distributed traces tell me what went right!

.

Fantastic collection of truths littered throughout this talk.

All distributed systems ever written were written to be fault tolerant.

All fault tolerance involves redundancy.

Partial failure is what makes distributed systems hard to build and all but impossible to reason about.

Failure in a distributed system is just the absence of a message.

There is no perfect failure detector.

A complex artifact only gets bigger.

Distributed systems are leaky abstractions. They do more than leak, they burn down.

You haven't tested it exhaustively.

It is a combinatorial search space. You must be smart about your experiments.

Observability is finally giving us a tool we can use to wrestle with these truths.

.

Joe Armstrong, of erlang fame, explained a similar combinatorial search space at Strange Loop 2014: video 13:03

With El Dorado, Ward is leading New Relic along another path to search the combinatorial space of our large distributed systems.

Our system graphs emerge from many data sources. Especially noteworthy, the nodes in El Dorado include the people and teams and source code repos, not just the computers. Because there is a graph database under the hood, the relationships between people and computers become tangible. The System is more broad than a network of computers.

Developers, Operators, Managers, Writers, Communicators are also part of The System. John Allspaw's talk at Velocity 2017 has another look at mental models of complex systems. The SNAFUcatchers identify the same symptoms as Alvaro. They suggest learning deeply from incidents as the most efficient way to select the next experiments. See Human Performance in Systems.

I didn't know exactly what a "SAT solver" is. Google and Wikipedia to the rescue: See Boolean Satisfiability Problem wikipedia