Removing a node from service


There may be times when you need to remove a specific node (or multiple nodes) from service, either temporarily or permanently. This may include cases of troubleshooting nodes that are in a bad state, or retiring nodes after an update to the AMI so that all nodes are using the new AMI.

This page describes how to temporarily prevent new workloads from being assigned to a node, as well as how to safely remove workloads from a node so that it can be permanently retired.

Temporarily removing a node from service

The kubectl cordon <node> command will prevent any additional pods from being scheduled onto the node, without disrupting any of the pods currently running on it. For example, let’s say a new node in your cluster has come up with some problems, and you want to cordon it before launching any new runs to ensure they will not land on that node. The procedure might look like this:

$ kubectl get nodes
NAME                                          STATUS   ROLES    AGE   VERSION   Ready    <none>   12d   v1.14.7-eks-1861c5    Ready    <none>   12d   v1.14.7-eks-1861c5   Ready    <none>   51m   v1.14.7-eks-1861c5   Ready    <none>   12d   v1.14.7-eks-1861c5
$ kubectl cordon
node/ cordoned
$ kubectl get no
NAME                                          STATUS                     ROLES    AGE   VERSION   Ready                      <none>   12d   v1.14.7-eks-1861c5    Ready                      <none>   12d   v1.14.7-eks-1861c5   Ready,SchedulingDisabled   <none>   53m   v1.14.7-eks-1861c5   Ready                      <none>   12d   v1.14.7-eks-1861c5

Notice the SchedulingDisabled status on the cordoned node.

You can undo this and return the node to service with the command kubectl uncordon <node>.

Permanently removing a node from service

Identifying user workloads

Before removing a node from service permanently, you should ensure there are no workloads still running on it that should not be disrupted. For example, you might see the following workloads running on a node (notice the specification of the compute namespace with -n and wide output to include the node hosting the pod with -o):

$ kubectl get po -n domino-compute -o wide | grep
run-5e66acf26437fe0008ca1a88-f95mk               2/2     Running     0          23m    <none>           <none>
run-5e66ad066437fe0008ca1a8f-629p9               3/3     Running     0          24m    <none>           <none>
run-5e66b65e9c330f0008f70ab8-85f4f5f58c-m46j7    3/3     Running     0          51m    <none>           <none>
model-5e66ad4a9c330f0008f709e4-86bd9597b7-59fd9  2/2     Running     0          54m    <none>           <none>
domino-build-5e67c9299c330f0008f70ad1            1/1     Running     0          3s    <none>           <none>

Different types of workloads should be treated differently. You can see the details of a particular workload with kubectl describe po run-5e66acf26437fe0008ca1a88-f95mk -n domino-compute. The labels section of the describe output is particularly useful to distinguish the type of workload, as each of the workloads named as run-... will have a label like<type of workload>. The example above contains one each of the major user workloads:

  • run-5e66acf26437fe0008ca1a88-f95mk is a Batch Job, with label It will stop running on its own once it is finished and disappear from the list of active workloads.
  • run-5e66ad066437fe0008ca1a8f-629p9, is a Workspace, with label It will keep running until the user who launched it shut it down. You have the option of contacting users to shut down their workspaces, waiting a day or two in the expectation they will shut them down naturally, or removing the node with the workspaces still running. (The last option is not recommended unless you are certain there is no un-synced work in any of the workspaces and have communicated with the users about the interruption.)
  • run-5e66b65e9c330f0008f70ab8-85f4f5f58c-m46j7, is an App, with label It is a long-running process, and is governed by a kubernetes deployment. It will be recreated automatically if you destroy the node hosting it, but will experience whatever downtime is required for a new pod to be created and scheduled on another node. See below for methods to proactively move the pod and reduce downtime.
  • model-5e66ad4a9c330f0008f709e4-86bd9597b7-59fd9, is a Model API. It does not have a label, and instead is easily identifiable by the pod name. It is also a long-running process, similar to an app, with similar concerns. See below for methods to proactively move the pod and reduce downtime.
  • domino-build-5e67c9299c330f0008f70ad1 is a Compute Environment build. It will finish on its own and go into a Completed state.

Dealing with long-running workloads

For the long-running workloads governed by a Kubernetes deployment, you can proactively move the pods off of the cordoned node by running a command like this:

$ kubectl rollout restart deploy model-5e66ad4a9c330f0008f709e4 -n domino-compute

Notice the name of the deployment is the same as the first part of the name of the pod in the above section. You can see a list of all deployments in the compute namespace by running kubectl get deploy -n domino-compute.

Whether the associated app or model experiences any downtime will depend on the update strategy of the deployment. For the two example workloads above in a test deployment, one App and one Model API, we have the following (describe output filtered here for brevity):

$ kubectl describe deploy run-5e66b65e9c330f0008f70ab8 -n domino-compute | grep -i "strategy\|replicas:"
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
RollingUpdateStrategy:  1 max unavailable, 1 max surge

$ kubectl describe deploy model-5e66ad4a9c330f0008f709e4 -n domino-compute | grep -i "strategy\|replicas:"
Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType:           RollingUpdate
RollingUpdateStrategy:  0 max unavailable, 25% max surge

The App in this case would experience some downtime, since the old pod will be terminated immediately (1 max unavailable with only 1 pod currently running). The model will not experience any downtime since the termination of the old pod will be forced to wait until a new pod is available (0 max unavailable). If desired, you can edit the deployments to change these settings and avoid downtime.

Dealing with older versions of Kubernetes

Earlier versions of kubernetes do not have the kubectl rollout restart command, but a similar effect can be achieved by “patching” the deployment with a throwaway annotation like this:

$ kubectl patch deploy run-5e66b65e9c330f0008f70ab8 -n domino-compute -p '{"spec":{"template":{"metadata":{"annotations":{"migration_date":"'$(date +%Y%m%d)'"}}}}}'

The patching process will respect the same update strategies as the above restart command.

Sample commands for iterating over many nodes and/or pods

In cases where you need to retire many nodes, it can be useful to loop over many nodes and/or workload pods in a single command. Customizing the output format of kubectl commands, appropriate filtering, and combining with xargs makes this possible.

For example, to cordon all nodes in the default node pool, you can run the following:

$ kubectl get nodes -l -o --no-headers | xargs kubectl cordon

To view only apps running on a particular node, you can filter using the labels discussed above:

$ kubectl get pods -n domino-compute -o wide -l | grep <node-name>

To do a rolling restart of all model pods (over all nodes), you can run:

$ kubectl get deploy -n domino-compute -o --no-headers | grep model | xargs kubectl rollout restart -n domino-compute deploy

When constructing such commands for larger maintenance, always run the first part of the command by itself to verify that the list of names being passed to xargs and to the final kubectl command are what you expect!