Our First Kubernetes Outage

This is public postmortem for an a complete shutdown of our internal Kubernetes cluster. post

# Impact * Internal Kubernetes services unavailable * Tiller unavailable (helm commands fail, no helm installs) * Unable to create new pods * Existing pods in Pending phase become Unknown * Flaky delete behavior

# Observations This sequence of events was quite concerning. The normal activity of deploying our application could collapse the cluster and leave things in disarray. Terminating worker nodes is undesirable solution because pods are not drained to other nodes which would like create unavailability for our customers. It works, but it cannot be standard procedure for these kind of outages. Beyond that there were other important observations about out current setup: * Node unavailability was not picked up any monitoring. kubectl get nodes reported that all three worker nodes were not ready, but nothing was reported. * We had no pods/phase metrics (i.e. Number of pods in Unknown or Pending). * Deploying a small number of a containers in parallel completely overloaded the cluster * Tiller pods failed to reschedule because of CPU limits. This is curious because there were no CPU request issues at the node level. This warrant future investigation. * helm init runs tiller with a single replica. This is not HA. It’s also uncertain what the HA story is with tiller. * We had no node (CPU/Memory/Disk) metrics * We did not understand what the Kubelet checks test under the covers (e.g. What is “Disk Pressure?”) * We had no way to throttle the number of newly created pods. This would not solve the problem, but it would mitigate risk in future scenarios. There’s no need to DoS our own cluster in case something goes wrong. * We had no CloudWatch metrics in DataDog for this AWS account.

# Outcomes I also spend time investigating the other observations. * The DataDog agent does not yet support the full suite of kube-state-metrics (notably the pod/phase metrics). This planned for a future release. * We need to configure different allocation limits for different layers in Kubernetes. There are two things that keep me up at night right now. * HA tiller. If tiller enters a crash loop this blocks our ability to deploy fixes (via chart installs/upgrades). HA must be discussed with the Helm team. * Pod throttling. Our product comprises ~35 different containers deployed 4 different times. We can scale the cluster to handle this but it’s an unending battle. It would be nice though to have something in this area.

This incident includes AWS-specific and kubernetes-specific details, but nevertheless suggest ways we should monitor and instrument our systems