domino logo
About DominoArchitecture
Kubernetes
Cluster RequirementsDomino on EKSDomino Kubernetes Version CompatibilityDomino on GKEDomino on AKSNVIDIA DGX in Domino
Installation
Installation ProcessConfiguration ReferenceInstaller Configuration ExamplesPrivate or Offline Installationfleetcommand-agent release notes
Configuration
Central ConfigurationNotificationsChange The Default Project For New UsersProject Stage ConfigurationDomino Integration With Atlassian Jira
Compute
Manage Domino Compute ResourcesHardware Tier Best PracticesModel Resource QuotasPersistent Volume ManagementAdding a Node Pool to your Domino ClusterRemove a Node from Service
Keycloak Authentication Service
Operations
Domino Application LoggingDomino MonitoringSizing Infrastructure for Domino
Data Management
Data in DominoData Flow In DominoDatasets AdministrationSubmit GDPR Requests
User Management
RolesLicense Usage Reporting
Environments
Environment Management Best PracticesCache Environment Images in EKS
Disaster Recovery
Control Center
Control Center OverviewExport Control Center Data with The API
domino logo
About Domino
Domino Data LabKnowledge BaseData Science BlogTraining
Admin Guide
>
Kubernetes
>
NVIDIA DGX in Domino

NVIDIA DGX in Domino

NVIDIA DGX systems can run Domino workloads if they are added to your Kubernetes cluster as compute (worker) nodes. This topic covers how to setup and add DGXes to Domino.

DGX & Domino Integration Flow Diagram

The flow chart begins from the top left, with a Domino end user requesting a GPU tier.

If a DGX is already configured for use in Domino’s Compute Grid, the Domino platform administrator can define a GPU-enabled Hardware Tier from within the Admin console.

The middle lane of the flow chart outlines the steps required to integrate a provisioned DGX system as a node in the Kubernetes cluster that is hosting Domino, and subsequently configure that node as a GPU-enabled component of Domino’s compute grid.

The bottom swim lane outlines that, to leverage a Nvidia DGX system with Domino, it must be purchased and provisioned into the target infrastructure stack hosting Domino.

Prepare & install DGX systems

Nvidia DGX systems can be purchased through Nvidia’s Partner Network. Install the DGX system in a hosting environment with network access to additional host & storage infrastructure required to host Domino.

Configure DGX System for Domino

Option A: New Kubernetes cluster & Domino installation

If this is a new (greenfield) deployment of Domino, you must first install and configure a Kubernetes cluster that meets Domino’s Cluster Requirements, including valid configuration of your Kubernetes' network policies to support secure communication between pods that will host Domino’s platform services and compute grid.

Option B: Existing Kubernetes cluster and/or Domino installation

Adding a DGX to an existing Domino is as simple as adding the DGX to your K8s API server as a worker node, with a node label consistent with your chosen naming conventions. The default node label for GPU-based worker nodes is default-gpu.

Additionally, proper taints must be added to your DGX node. This facilitates the selection of the DGX for GPU-based workloads running on Domino.

Configure a Domino Hardware Tier to leverage your configured DGX Compute Node

Now that the DGX is added to your API server and labeled properly, you can configure Domino Hardware Tiers from within Domino’s Admin UI.

Domino provides governance features from within this interface, supporting LDAP/AD federation or SSO-based attributes for managed access control and user execution quotas. Domino has also published a series of best practices for managing hardware tiers in your compute grid.

CUDA / NVIDIA driver configuration

Nvidia Driver

Configuration of the Nvidia driver at the host level must be performed by your Server administrator. The correct Nvidia driver for your host can be identified by using the configuration guide found here. More information can be found in the DGX Systems Documentation.

CUDA Version

The CUDA software version required for a given development framework, such as Tensorflow, will be documented on their website. For example, Tensorflow >=2.1 requires CUDA 10.1 and some additional software packages, for example, CuDNN.

CUDA & Nvidia Driver Compatibility

After the correct CUDA version is identified for your specific needs, one must consult the CUDA-Nvidia Driver Compatibility Table.

In the Tensorflow 2.1 example, the CUDA 10.1 requirement means one must be running CUDA >=10.1 and Nvidia driver >=410.48 on the host. Table 1 in the previous link will guide your choice of matching CUDA & Nvidia driver versions.

Subsequently, the Domino Compute Environment must be configured to leverage the exact CUDA version that corresponds to the desired application.

Simplifying this constraint, CUDA drivers provide backwards compatibility: the CUDA version on the host can be greater or equal to that which is specified in your Compute Environment.

And because the CUDA software installation process often returns unexpected results when attempting to install an exact CUDA version, including patch version, the fastest route to a functioning configuration is typically to install the latest available minor release from your required major version of CUDA, and subsequently creating a Docker environment variable (ENV) from within your Compute Environment that constrains compatible sets of CUDA, GPU generations, and Nvidia drivers.

Need Additional Assistance?

Consult your Domino customer success engineer for guidance on your specific needs. Domino can sample configurations that will simplify your configuration process.

Best practices

  1. Build Node

    Domino recommends you do not use a DGX GPU as a build node for environments. Instead, opt for a CPU resource as part of your overall Domino architecture.

  2. Splitting GPUs per Tier

    Domino recommends providing several GPU tiers with different numbers of GPUs in each tier for example, 1, 2, 4, and 8 GPU hardware tiers as different training jobs can take use of single or parallel GPU usage and consuming a whole DGX box for one workload might not be feasible in your environment.

  3. Governance

    After splitting up hardware tiers, access can be

    global or, alternatively,
    limited to specific organizations. Domino recommends ensuring that the right organizations have GPU Hardware Tier access --or are restricted-- for the purpose of ensuring availability for critical work, and/or to prevent the unauthorized use of GPU tiers.

Domino Data LabKnowledge BaseData Science BlogTraining
Copyright © 2022 Domino Data Lab. All rights reserved.