Zombie Monolith

A familiar conversation from five different jobs. The backstory involves a monolith at a fast growing startup. Success! The business wildly outgrows the monolith. The drama builds as the costs of disentangling critical business process from the monolith are deferred and postponed. The monolith gets emotionally divested by everyone in the business and the costs continue to be deferred. Some people leave and crucial context is lost. Never seen how that story actually ends.

We took this question to former coworkers where there had been some success in both whittling services out of the monolith and also improving parts that remained within.

The darkest moment was after the entire UI team left for different internal projects. Engineers can feel when an organization abandons a mess of code—whether or not there is a formal decision about it. There's a limit to how long people are willing to "take one for the team". This group departure did serve as a wake-up call.

One VP marshaled a volunteer rescue squad as an intervention. Decisions were made to staff new teams to migrate and decommission and sustain parts of the monolith. About five different teams took on different parts of the problem: two newly assembled from existing staff, the existing database team, and others (an authentication team, for example) who had already moved some of the services out of the monolith but were still entangled with it.

What follows are the just slightly paraphrased or anonymized quotes from former colleagues.

One thing that causes huge problems is that once the decision to break apart the monolith happens, everyone seems to forget that the monolith should be maintained and treated as a first class citizen of an SOA. It’s very hard to justify removing everything from a monolith and decommissioning it completely so rather than seeing it as a replacement effort it should be thought of as a rebalancing effort. The critical services that need more resources, different architecture, etc than the monolith are the good first choices for extraction.

The entire UI team left for different internal and external jobs including engineers, managers, and product leadership. It was a hard time!

I created the Rescue Squad right before I left and didn't see it through to the end. It was clearly a band aid for a bad situation.

Front line tech support humans were told on multiple occasions that their filed UI bugs wouldn't be fixed, and no one provided them with alternative solutions (other than 1 off fire-fighting).

My understanding is that there have been dozens of iterations of effort over many years in the process of supporting & breaking apart the monolith:

—moving timeslice logic out

—introduction of an authentication gateway to begin splitting traffic among services

—early work to support react UIs

—introducing GraphQL as a way to incrementally transition API implementations away from the monolith.

—The Continuity Team were more SRE and infrastructure oriented: improved CI, dockerized in prod, & generally shepherded the codebase.

—2-3 teams doing bug, feature, and factoring work on their parts of the codebase

(I'm pretty sure the authentication team still owned significant amounts of code & were actively working to refactor their logic out of it)

—DB team doing non-trivial work to scale & maintain the DBs.

—Eventually the Rescue Squad as a short term solution to close at least a few bugs when the previous team that dissolved

We ended up sun setting the rescue squad after spinning up the new Core team. We got a lot of great stuff done, including EOLing crusty code and features, improving performance issues and shoring up the product. Great team I was proud to be a part of!

Also two other teams at the end of 2019, one to eliminate unused code & a second focused on porting features to the new UI.