Velocity San Jose 2019

"Pay no attention to that man behind the curtain. The Great Oz has spoken." We are The Wizard of Oz: we show the world the magic of automation, yet there are people behind the curtain wiggling the levers of these complex systems. This Velocity conference shows we have internalized some of the lessons from Velocity NYC 2017 when David Woods and Richard Cook presented the STELLA report: the humans above the line of representation matter; we are starting to pay attention to relationships that cross that line.

Chen Goldberg (Google) keynote "Scaling Teams with Technology (or is it the other way around?)"

This talk picked up four topics typical of distributed systems and elaborated on their impact to the human teams in the system: extensible, consistent, integrated, and open source. The titles there are from below the line. The content of the talk was all above the line.

Jessica Kerr (Atomist) keynote "From Puzzles to Products". Our job is not to write software. It is to change software. Not just the code, change the system. @jessitron blog

The hardest obstacles to change are both above the line and below the line (e.g. when customer service reports reach into our database tables – suddenly even internal parts are frozen in fear of breaking systems more important than ours.)

We are not designing static code. We are designing change. Where to next and how will the whole system get there?

Data migrations, deprecation, feature flags, documentation, versioning; how will we bring adjacent systems along and get them to use the new features, to upgrade authentication?

All this beautiful complexity is a system in movement.

Since change is our job, deploy pipeline is crucial. Designing change starts with a code push.

Growing a product is hard like mud: dealing with people, ambiguity, probabilities. It is messy, entangled, political, context-dependent, (constantly changing, and resource constrained)

K8S is the platform on which you can build your custom platform.

We have opportunity to change complex systems at a faster pace than humanity has never known.

Yaniv Aknin (Google Cloud) keynote "The SRE I Aspire To Be"

Measurably optimize the trade-off of reliability vs innovation. SLOs seem promising; e.g. MTBF/MTTR. But what's a "failure" in a system that's always microscopically failing? How do you measure a mean when failures come in unique shapes and sizes. Nines of uptime seem great, but what if the 0.1% of failing queries are all your write requests?

Nines are just one axis of the problem. The other is alignment with customer needs and happiness.

Obvious: Monitoring, Alerting, Capacity planning, CI/CD & Rollouts, Load Balancing.

Less Obvious: system architecture, distributed algorithms, networking, OS

Least Obvious: PM, data science, Business sense, nose for UX

Everett Harper (Truss) keynote "Infrastructure First: Solving Complexity Needs More than Tech"

There's an automated soap dispenser that recognizes pale skin and fails to recognize dark skin. In a soap dispenser it's annoying or a funny commentary on bias. What if that behavior shows up in an autonomous truck and dark skinned person caught in its path raises their hands to draw attention for the truck to stop? Include people at the table who look different.

Create organizational infrastructure to help people learn their blind spots. No one can see the whole picture. We need structures for people to take the vulnerable first steps of bravery, when feeling small and at risk and acting anyway. The speaker adopted salary transparency at their company. No one left. NO one was outraged. In fact, someone advocated for a peer's promotion. Does that happen at your org?

When thinking about infrastructure, design space for vulnerability and bravery.

In "Your Team as a Distributed System", Andrew Harvey took the distributed system as a metaphor to talk about managing technical teams. He took up 4 properties and 8 fallacies of distributed systems, and even the CAP theorem. With all of these below-the-line ideas he drew convincing lessons about team relationships above the line.

Learning from Incidents

Move Fast and Learn From Incidents. Ryan Kitchens, Lorin Hochstein, and Nora Jones lead a workshop to give participants experience role-playing through an investigation of a couple incidents based on true stories. I participated as a teaching assistant to help attendees steer away from traps of conventional incident analysis. Some wonderful things from this workshop. The most important general idea I took away from this experience is to really give up on quantitative analysis of incidents. Qualitative research yields more insight.

In "How do things go right?" Ryan Kitchens drew our attention to many dimensions of learning. Five Why's and Root Cause have the effect of narrowing our view of an incident. We stop asking questions when we reach any answer that feels satisfying. There are so many more things to learn if we follow the other branches from the incident. When you chose this path, what were the trade-offs under consideration for the other paths at that time? What made this one look better? "How did you know to look there?"

Instead of "Root Cause" talk about "Contributing Factors": our failures are "perfect storms" not "one stupid (or careless, or reckless, or rookie) mistake". Look for mitigators: how did we keep the surprise from being much worse than it was? Identify the sources of resilience so we can make them even better. In the report, include a section of Open Questions that are unanswered by the investigation.

SLOs have blind spots too. How hard are people working when the system is healthy? If there's no violation, we probably aren't looking at the toil associated with keeping that thing running.

Beth Adell Long's talk "Having the Bubble" dug into what experts do. The biggest surprise for me in this talk is understanding that expertise is social. Some experts are well calibrated with parts of the system below the line. Others know broadly where the spicy parts are, but more importantly, know who to call to get the deep knowledge for specialized places below the line. To keep the system running, we need both of those kinds of experts, and we need good tools and healthy relationships in order to coordinate their different expertise. Also notable (tho something I well understand already)... expertise is embodied in multiple, overlapping neural networks and chemical & hormonal communication systems – we gotta pay attention to nutrition, sleep deprivation, ... to keep systems running.

Laura Maguire's talk "Lowering Costs of Coordination" took several deep dives into the challenges of having communications move across the line of representation. Human to human, human to machine to human, human to machine, and machine to machine coordination are all different kinds of expensive. Human operators, in addition to all the other things they're juggling in an incident, need anticipate the bots capabilities and limitations when interacting with them. Complexity, tempo, and consequences are continually growing through an incident. Compare the coordination visible in a picture during Apollo 13 with typical after-hours incident where teams are distributed (even normally co-located teams) having been paged at home. How much harder is it to signal coordination?

"Everything is a Little Broken; Illusion of Control" notes from a talk given by Heidi Waterhouse at Velocity San Jose in June 2019.

The nature of complexity and the nature of humanity means we will sometimes fail. Our job is to include safety nets. The talk gave many suggestions of ways to build software systems with an expectation that things will fail.

One-cause disaster isn't a thing. A real disaster requires teamwork.

What measures can we take to reduce the scope of catastrophes when failure happens?

Circuit breakers, Isolations, Control Points, Rapid Rollback, Rapid Deploy, Layered Access, Authorized People Make Changes, Abstractions Manage Change