Engineering Playbook
Kubernetes

Cluster Operations

Managed K8s (EKS/GKE), Upgrades, and Node Pools.

Kubernetes Clusters

Kubernetes is the Operating System of the cloud. But running the OS involves two parts: the Control Plane (The Master) and the Data Plane (The Workers).

Managed vs. Self-Hosted

  • Hard Mode (Kops/Kubeadm): You manage the Control Plane (etcd, api-server) on EC2 instances. Don't do this unless you are a bank or a massive tech corp.
  • Easy Mode (EKS/GKE/AKS): AWS/Google manages the Control Plane. You just manage the Worker Nodes.

Node Pools (Data Plane)

Organize your worker nodes into groups based on hardware.

  1. General Purpose: M5/T3 instances. For web apps.
  2. Compute Optimized: C5. For batch processing.
  3. Spot Instances: Cheap, unreliable instances. Great for stateless workers, but you must handle interruptions (Graceful Shutdown).

Upgrade Strategy

Upgrading Kubernetes is terrifying. It forces a restart of every container in your cluster.

The Blue/Green Node Strategy:

  1. You are on Version 1.26 (Node Group A).
  2. Spin up a new Node Group B on Version 1.27.
  3. Taint Group A so no new pods land there.
  4. Drain Group A. This evicts pods, forcing them to reschedule onto Group B.
  5. Once empty, delete Group A.

Pod Disruption Budgets (PDB)

When draining nodes, K8s might kill all your API replicas at once to move them.

Define a PDB (minAvailable: 1) to tell K8s: "You can move me, but ensure at least 1 replica is always running."