Caching environment images in EKS

When a user launches a Domino Run, part of the start-up process is loading the user’s environment onto the node that will host the Run. For large images, the process of transferring the image to a new node can take several minutes. Once an image has been loaded onto a node once, it gets cached, and future Runs that use the same environment will start up faster.

When running Domino on EKS, you can pre-cache popular environments and base images on the Amazon Machine Image (AMI) used for new nodes. This can speed up the start time of Runs on new nodes significantly. This page describes the process of creating a new AMI with cached environments and configuring EKS to use it for new nodes.




AMI requirements

In addition to any dependencies required by Kubernetes itself, your AMI should contain the following:

  • Docker
  • Cache of Domino’s compute environments
  • Nvidia-Docker 2 (GPU nodes only)
  • Nvidia GPU driver 410+ (GPU nodes only)
  • Change the default docker runtime (GPU nodes only)

For simplicity, recommends that you use the official EKS default AMIs, which come pre-configured with Docker and the GPU tools.

Alternatively, you can use Amazon’s build scripts to create your own AMI for use with EKS.




AMI operations

The following sections describe how to perform several important types of operations on an EC2 instance to set it up as the template for a new AMI suitable for Domino.


Pull environment images

Pre-caching environment images is a simple process of running docker pull for the base images those environments are built on, or the built environments from the internal registry itself.

To pull the Domino Standard Environment base images, your command would look like this, substituting in the version string for the image you want to cache.

docker pull quay.io/domino/base:<desired version>

To pull a built image from the Domino internal registry, you will need to find its URI from the Revisions tab in the environment details page.

../_images/environment-image-url.png

[ Click to view full size ]

For example, to cache revision #9 of the environment shown in the screenshot above, you would run:

docker pull 100.97.56.113:5000/domino-5d7abf2715f3690007f23081:9

Install GPU drivers (GPU AMIs only)

To use the GPU on a GPU node, you need to install the appropriate driver on the machine image. Domino does not have a requirement for any specific driver version, however, if you want to use a Domino Standard Environment, it should be a version that is compatible with the current version of Cuda shown in standard environments.

Click to view a compatibility matrix.

If you’d like to install the GPU drivers manually, you can follow these instructions.

To validate that your GPU machine is configured properly, reboot the machine and run the following:

docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

This will show the driver number and GPU devices if installed successfully.


Change the default Docker runtime (GPU AMIs only)

Read the official instructions from NVIDIA on using the container runtime.

Note that you must restart Docker before this will work.




Complete AMI caching procedure

  1. Determine which AMI you want to use as the base for the new AMI. If you’re performing this operation on an operational Domino node pool, you should use the AMI that’s currently used in the active launch configuration.

    ../_images/launch_config_name.png

    Once you’ve identified the name of the active launch configuration, view its details to see the AMI ID it uses.

    ../_images/ami_id.png
  2. Launch a new EC2 instance from the base AMI.

  3. Connect to the instance via SSH and perform any of the operations listed above that you want to apply to your new AMI, including pulling any environment images you want to cache.

  4. Snap a new AMI from the EC2 instance.

  5. Create a copy of the launch configuration currently used by any ASGs you want to switch to using the new AMI.

  6. Edit the AMI for the copied launch configuration to be the ID of the new AMI you snapped.

  7. For any ASGs that you want to start using the new AMI, switch them over to the new launch configuration.

Once you complete the final step, any ASGs you switched to using the new launch configuration will start using the new AMI whenever they create new nodes. These new nodes will therefore have any environment images you pulled onto the AMI template cached, and will be fast to start new Domino Runs.