Persistent volume management

Overview

When not in use, Domino project files are stored and versioned in the Domino blob store. When a Domino run is started from a project, the projects files are copied to a Kubernetes persistent volume that is attached to the compute node and mounted in the run.




Definitions

  • Persistent Volume (PV)

    A storage volume in a Kubernetes cluster that can be mounted to pods. Domino dynamically creates persistent volumes to provide local storage for active runs.

  • Persistent Volume Claim (PVC)

    A request made in Kubernetes by a pod for storage. Domino uses these to correctly match a new run with either a new PV or an idle PV that has the project’s files cached.

  • Idle Persistent Volume

    A PV that was used by a previous run, and which is currently not being used. Idle PV’s will either be re-used for a new run or garbage collected.

  • Storage Class

    Kubernetes method of defining the type, size, provisioning interface, and other properties of storage volumes.




Storage workflow for Jobs

When a user starts a new job, Domino will broker assignment of a new execution pod to the cluster. This pod will have an associated PVC which defines for Kubernetes what type of storage it requires. If an idle PV exists matching the PVC, Kubernetes will mount that PV on the node it assigns to host the pod, and the job or workspace will start. If an appropriate idle PV does not exist, Kubernetes will create a new PV according to the Storage Class.

When the user completes their workspace or job, the PV data will be written to the Domino File System, and the PV will be unmounted and sit idle until it is either reused for the user’s next job or garbage collected. By reusing PV’s, users who are actively working in a project will not need to copy data from the blob store to a PV repeatedly.

A job will only match with either a fresh PV or one previously used by that project. PV’s are not reused between projects.


Storage workflow for Workspaces

Workspace volumes are handled differently than volumes for Jobs. Workspaces are potentially long lived development environments that users will stop and resume repeatedly without writing data back to the Domino File System each time. As a result, the PV for the workspace is a similarly long-lived resource that stores the user’s working data.

These workspace PVs are durably associated with the resumable workspace they are initially created for. Each time that workspace is stopped, the PV is detached and preserved so that it’s available the next time the user starts the workspace. When the workspace starts again, it reattaches its PV and the user will see all of their working data saved during the last session.

Only when a user chooses to initiate a sync will the contents of their project files in the workspace PV be written back to the Domino File System. A resumable workspace PV will only be deleted if the user deletes the associated workspace.




Resumable Workspace volume backups on AWS

Since the data in resumable workspace volumes is not automatically written back to the Domino File System, there is a risk of lost work should the volume be lost or deleted. When Domino is running on AWS, it safeguards against this by backing up the EBS volume that backs the workspace PV with EBS snapshotting to S3. If you have accidentally deleted or lost a resumable workspace volume that contains data you want to recover, contact Domino support for assistance in restoring from the snapshot.




Garbage collection

Domino has configurable values to help you tune your cluster to balance performance with cost controls. The more idle volumes you allow the more likely it is that users will be able to reuse a volume and avoid needing to copy project files from the blob store. However, this comes at the cost of keeping additional idle PVs.

By default, Domino will:

  • Limit the total number of idle PV’s to 32. This can be adjusted by setting the following option in the central config:

    common com.cerebro.domino.computegrid.kubernetes.volume.maxIdle
    
  • Terminate any idle PV that has not been used in a certain number of days. This can be adjusted by setting the following option in the central config:

    common com.cerebro.domino.computegrid.kubernetes.volume.maxAge
    

    This value is expressed in terms of days. The default value is empty, which means unlimited. A value of 7d will terminate any idle PV after seven days.




Salvaged volumes

In the scenario when a user’s job fails unexpectedly, Domino will preserve the volume so data can be recovered. After a workspace or job ends, claimed PV’s are placed into one of the following states, indicated with the dominodatalab.com/volume-state label.

  • available

    If the run ends normally, the underlying PV will be available for future runs.

  • salvaged

    If the run fails, the underlying PV will not be eligible for reuse, and is held in this state to be salvaged.

Salvaged PV’s will not be reused automatically by the future workspaces or jobs, but can be manually mounted to a workspace in order to recover work.

By default, Domino will:

  • Limit the total number of salvaged PV’s to 64. This can be adjusted by setting the following option in the central config:

    common com.cerebro.domino.computegrid.kubernetes.volume.maxSalvaged
    
  • Terminate any salvaged PV that has not been used in a certain number of days. This can be adjusted by setting the following option in the central config:

    common com.cerebro.domino.computegrid.kubernetes.volume.maxSalvagedAge
    

    The value is expressed in terms of days. The default value is seven days. A value of 14d will terminate any salvaged PV after fourteen days.

To recover a salavaged volume,

  1. Find the PV that was attached to your job or workspace, which will be in the Deployment logs for your job or workspace.
  2. Create a pod attached to the salvaged volume.
  3. Recover the files with your most convenient method (scp, AWS CLI, kubectl cp, etc.)

This script will do Step 2 and will provide the appropriate commands in its output. Remember to delete the PVC and PV, otherwise these resources will continue to be used.




FAQ

How do I see the current PV’s in my cluster?

Run the following command to see all current PV’s sorted by last-used:

kubectl get pv --sort-by='.metadata.annotations.dominodatalab.com/last-used'

How do I change the size of the storage volume for my jobs or workspaces?

You can set the volume size for new PV’s by editing the following central config value:

Namespace Key Value
common com.cerebro.domino.computegrid.kubernetes.volume.volumesSizeInGB Volume size in GB (default 15)