NVIDIA DGX in Domino¶
NVIDIA DGX systems can run Domino workloads if they are added to your Kubernetes cluster as compute (worker) nodes. Read below for how to setup and add DGXes to Domino.
The flow chart begins from the top left, with a Domino end user requesting a GPU tier.
If a DGX is already configured for use in Domino’s Compute Grid, the Domino platform administrator can define a GPU-enabled Hardware Tier from within the Admin console.
The middle lane of the flow chart outlines the steps required to integrate a provisioned DGX system as a node in the Kubernetes cluster that is hosting Domino, and subsequently configure that node as a GPU-enabled component of Domino’s compute grid.
The bottom swim lane outlines that, to leverage a Nvidia DGX system with Domino, it must be purchased and provisioned into the target infrastructure stack hosting Domino.
Preparing & Install DGX System(s)¶
Nvidia DGX systems can be purchased through Nvidia’s Partner Network.. Install the DGX system in a hosting environment with network access to additional host & storage infrastructure required to host Domino.
Configure DGX System for Domino¶
Option A: New Kubernetes Cluster & Domino Install¶
If this is a new (greenfield) deployment of Domino, one must first install & configure a Kubernetes cluster that meets Domino’s Cluster Requirements, including valid configuration of your Kubernetes’ network policies to support secure communication between pods that will host Domino’s platform services and compute grid.
Option B: Existing Kubernetes Cluster and/or Domino Installation¶
Adding a DGX to an existing Domino is as simple as adding the DGX to your K8s API server as a worker node, with a node label consistent with your chosen naming conventions. The default node label for GPU-based worker nodes is ‘default-gpu’.
Additionally, proper taints must be added to your DGX node. This facilitates the selection of the DGX for GPU-based workloads running on Domino.
Configuring a Domino Hardware Tier to leverage your configured DGX Compute Node¶
Domino provides governance features from within this interface, supporting LDAP/AD federation or SSO-based attributes for managed access control and user execution quotas. We have also published a series of best practices for managing hardware tiers in your compute grid.
CUDA / NVIDIA driver configuration¶
Configuration of the Nvidia driver at the host level should be performed by your Server administrator. The correct Nvidia driver for your host can be identified by using the configuration guide found here. More information can be found in the DGX Systems Documentation.
The CUDA software version required for a given development framework, such as Tensorflow, will be documented on their website. For example, Tensorflow >=2.1 requires CUDA 10.1 and some additional software packages, e.g., CuDNN.
CUDA & Nvidia Driver Compatibility
Once the correct CUDA version is identified for your specific needs, one must consult the CUDA-Nvidia Driver Compatibility Table.
In the Tensorflow 2.1 example, the CUDA 10.1 requirement means one must be running CUDA >=10.1 and Nvidia driver >=410.48 on the host. Table 1 in the link above will guide your choice of matching CUDA & Nvidia driver versions.
Subsequently, the Domino Compute Environment must be configured to leverage the exact CUDA version that corresponds to the desired application.
Simplifying this constraint, note that CUDA drivers provide backwards compatibility: the CUDA version on the host can be greater or equal to that which is specified in your Compute Environment.
And because the CUDA software installation process often returns unexpected results when attempting to install an exact CUDA version, including patch version, the fastest route to a functioning configuration is typically to install the latest available minor release from your required major version of CUDA, and subsequently creating a Docker environment variable (ENV) from within your Compute Environment that constrains compatible sets of CUDA, GPU generations, and Nvidia drivers.
Need Additional Assistance?
Please consult your Domino customer success engineer for guidance on your specific needs. Domino can sample configurations that will simplify your configuration process.
We recommend you do not use a DGX GPU as a build node for environments. Instead, opt for a CPU resource as part of your overall Domino architecture.
Splitting GPUs per Tier
We recommend providing several GPU tiers with different numbers of GPUs in each tier e.g. 1, 2, 4, and 8 GPU hardware tiers as different training jobs can take use of single or parallel GPU usage and consuming a whole DGX box for one workload may not be feasible in your environment.
After splitting up hardware tiers, access can be global or, alternatively, limited to specific organizations. We recommend ensuring that the right organizations have GPU Hardware Tier access –or are restricted– for the purpose of ensuring availability for critical work, and/or to prevent the unauthorized use of GPU tiers.