Kubernetes on AWS Quick Start: High Availability and multi-AZ guidance


The AWS Quick Start launches a cluster in a single Availability Zone with a single master node, appropriate for non-critical use.

To build a more production-ready configuration, users should consider:

  • running Kubernetes in a High Availability mode (out of scope)
  • running across multiple Availability Zones

Before jumping into solutions, let’s look at what actually happens if the master node goes down, and why that (probably) isn’t as bad as it sounds.

Mean time To recovery (what happens if the master goes down)

For many users, running a highly available Kubernetes control plane is more than they may need. Heptio’s position is that the cost and complexity to make the master node highly available isn’t always justified.

During normal cluster operations, the control plane is needed only when another node fails, or when you’re deploying new applications and services. So, if the master node goes down, the cluster should continue to “cruise” in its most recent state; e.g., your application is still running.

In a cloud environment (using features like EC2 Auto Recovery) a singleton master can be rebooted automatically on hardware failure. In case of hardware failure, the control plane and Kubernetes API will be disrupted temporarily while the recovery happens, but the workload should continue to run while the control plane is down.

In this situation, the metric that matters is Mean Time To Recovery (MTTR). With EC2 Auto Recovery, the MTTR should be fast enough that running a single Kubernetes master is often acceptable.

(This would not apply to a non-cloud environment, where the MTTR would be based on human intervention and would be much longer.)

With that caveat, let’s look at some high availability solutions for Kubernetes.

High Availability mode

Kubernetes does support a High Availability mode. This is out of scope for this guide, and is not currently supported by the Kubernetes configuration tool (kubeadm) chosen for the AWS Quick Start stack.

To run Kubernetes in HA mode, both the backing database for Kubernetes (etcd) and the Kubernetes control plane components must be run in an HA mode. HA mode ensures that the control plane continues to run even in the case of instance failure. This can be more complicated to configure and increases resource costs.

Multiple Availability Zones

AWS’s best practice is to spread applications across multiple Availability Zones.

It is possible, but not our recommended solution, to run a single Kubernetes cluster that spans multiple AZs. This should be paired with HA and span an odd number of AZs (at least 3). If a single zone fails, the rest of cluster should continue to run.

Heptio’s recommended solution is to run a separate cluster in each Availability Zone. Then, push a similar configuration to each cluster separately, using client side tools. Deploy a set of services across each cluster by parameterizing configs based on the cluster. Set up ELBs to point to NodePorts in each cluster.

In the future, Kubernetes cluster federation may be an appropriate approach to coordinate across clusters.

Other tools

To run a Kubernetes cluster for long term production purposes, please consider the following resources:

  • For more mature clusters on AWS, check out kops. kops supports HA Kubernetes and a cluster spanning multiple zones. It has an active community and supports upgrades
  • High availability clusters are documented here: Building High-Availability Clusters
  • Cluster federation is documented here: Federation User Guide