Caching environment images in EKS¶
When a user launches a Domino Run, part of the start-up process is loading the user’s environment onto the node that will host the Run. For large images, the process of transferring the image to a new node can take several minutes. Once an image has been loaded onto a node once, it gets cached, and future Runs that use the same environment will start up faster.
When running Domino on EKS, you can pre-cache popular environments and base images on the Amazon Machine Image (AMI) used for new nodes. This can speed up the start time of Runs on new nodes significantly. This page describes the process of creating a new AMI with cached environments and configuring EKS to use it for new nodes.
In addition to any dependencies required by Kubernetes itself, your AMI should contain the following:
- Cache of Domino’s compute environments
- Nvidia-Docker 2 (GPU nodes only)
- Nvidia GPU driver 410+ (GPU nodes only)
- Change the default docker runtime (GPU nodes only)
For simplicity, recommends that you use the official EKS default AMIs, which come pre-configured with Docker and the GPU tools.
- Click to read about the official EKS AMI Domino recommends for default compute nodes
- Click to read about the official EKS AMI Domino recommends for GPU nodes
Alternatively, you can use Amazon’s build scripts to create your own AMI for use with EKS.
The following sections describe how to perform several important types of operations on an EC2 instance to set it up as the template for a new AMI suitable for Domino.
Pull environment images¶
Pre-caching environment images is a simple process of running
docker pull for the base images those environments are
built on, or the built environments from the internal registry itself.
To pull the Domino Standard Environment base images, your command would look like this, substituting in the version string for the image you want to cache.
docker pull quay.io/domino/base:<desired version>
To pull a built image from the Domino internal registry, you will need to find its URI from the Revisions tab in the environment details page.
[ Click to view full size ]
For example, to cache revision #9 of the environment shown in the screenshot above, you would run:
docker pull 100.97.56.113:5000/domino-5d7abf2715f3690007f23081:9
Install NVIDIA Docker 2.0 (GPU AMIs only)¶
Install GPU drivers (GPU AMIs only)¶
To use the GPU on a GPU node, you need to install the appropriate driver on the machine image. Domino does not have a requirement for any specific driver version, however, if you want to use a Domino Standard Environment, it should be a version that is compatible with the current version of Cuda shown in standard environments.
If you’d like to install the GPU drivers manually, you can follow these instructions.
To validate that your GPU machine is configured properly, reboot the machine and run the following:
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
This will show the driver number and GPU devices if installed successfully.
Change the default Docker runtime (GPU AMIs only)¶
Note that you must restart Docker before this will work.
Complete AMI caching procedure¶
Determine which AMI you want to use as the base for the new AMI. If you’re performing this operation on an operational Domino node pool, you should use the AMI that’s currently used in the active launch configuration.
Once you’ve identified the name of the active launch configuration, view its details to see the AMI ID it uses.
Launch a new EC2 instance from the base AMI.
Connect to the instance via SSH and perform any of the operations listed above that you want to apply to your new AMI, including pulling any environment images you want to cache.
Snap a new AMI from the EC2 instance.
Create a copy of the launch configuration currently used by any ASGs you want to switch to using the new AMI.
Edit the AMI for the copied launch configuration to be the ID of the new AMI you snapped.
For any ASGs that you want to start using the new AMI, switch them over to the new launch configuration.
Once you complete the final step, any ASGs you switched to using the new launch configuration will start using the new AMI whenever they create new nodes. These new nodes will therefore have any environment images you pulled onto the AMI template cached, and will be fast to start new Domino Runs.