domino logo
Tech Ecosystem
Get started with Python
Step 0: Orient yourself to DominoStep 1: Create a projectStep 2: Configure your projectStep 3: Start a workspaceStep 4: Get your files and dataStep 5: Develop your modelStep 6: Clean up WorkspacesStep 7: Deploy your model
Get started with R
Step 0: Orient yourself to Domino (R Tutorial)Step 1: Create a projectStep 2: Configure your projectStep 3: Start a workspaceStep 4: Get your files and dataStep 5: Develop your modelStep 6: Clean up WorkspacesStep 7: Deploy your model
Get Started with MATLAB
Step 1: Orient yourself to DominoStep 2: Create a Domino ProjectStep 3: Configure Your Domino ProjectStep 4: Start a MATLAB WorkspaceStep 5: Fetch and Save Your DataStep 6: Develop Your ModelStep 7: Clean Up Your Workspace
Step 8: Deploy Your Model
Scheduled JobsLaunchers
Step 9: Working with Domino Datasets
Domino Reference
Projects
Projects Overview
Revert Projects and Files
Revert a ProjectRevert a File
Projects PortfolioProject Goals in Domino 4+Jira Integration in DominoUpload Files to Domino using your BrowserFork and Merge ProjectsSearchSharing and CollaborationCommentsDomino Service FilesystemCompare File RevisionsArchive a Project
Advanced Project Settings
Project DependenciesProject TagsRename a ProjectSet up your Project to Ignore FilesUpload files larger than 550MBExporting Files as a Python or R PackageTransfer Project Ownership
Domino Runs
JobsDiagnostic Statistics with dominostats.jsonNotificationsResultsRun Comparison
Advanced Options for Domino Runs
Run StatesDomino Environment VariablesEnvironment Variables for Secure Credential StorageUse Apache Airflow with Domino
Scheduled Jobs
Domino Workspaces
WorkspacesUse Git in Your WorkspaceUse Visual Studio Code in Domino WorkspacesPersist RStudio PreferencesAccess Multiple Hosted Applications in one Workspace Session
Spark on Domino
On-Demand Spark
On-Demand Spark OverviewValidated Spark VersionConfigure PrerequisitesWork with your ClusterManage DependenciesWork with Data
External Hadoop and Spark
Hadoop and Spark OverviewConnect to a Cloudera CDH5 cluster from DominoConnect to a Hortonworks cluster from DominoConnect to a MapR cluster from DominoConnect to an Amazon EMR cluster from DominoRun Local Spark on a Domino ExecutorUse PySpark in Jupyter WorkspacesKerberos Authentication
Customize the Domino Software Environment
Environment ManagementDomino Standard EnvironmentsInstall Packages and DependenciesAdd Workspace IDEs
Partner Environments for Domino
Use MATLAB as a WorkspaceCreate a SAS Data Science Workspace EnvironmentNVIDIA NGC Containers
Advanced Options for Domino Software Environment
Install Custom Packages in Domino with Git IntegrationAdd Custom DNS Servers to Your Domino EnvironmentConfigure a Compute Environment to User Private Cran/Conda/PyPi MirrorsScala notebooksUse TensorBoard in Jupyter Workspaces
Publish your Work
Publish a Model API
Model Publishing OverviewModel Invocation SettingsModel Access and CollaborationModel Deployment ConfigurationPromote Projects to ProductionExport Model Image
Publish a Web Application
App Publishing OverviewGet Started with DashGet Started with ShinyGet Started with FlaskContent Security Policies for Web Apps
Advanced Web Application Settings in Domino
App Scaling and PerformanceHost HTML Pages from DominoHow to Get the Domino Username of an App Viewer
Launchers
Launchers OverviewAdvanced Launcher Editor
Assets Portfolio Overview
Connect to your Data
Domino Datasets
Datasets OverviewDatasets Best PracticesAbout domino.yamlDatasets Advanced Mode TutorialDatasets Scratch SpacesConvert Legacy Data Sets to Domino Datasets
Data Sources Overview
Connect to Data Sources
External Data Volumes
Git and Domino
Git Repositories in DominoWork From a Commit ID in Git
Work with Data Best Practices
Work with Big Data in DominoWork with Lots of FilesMove Data Over a Network
Advanced User Configuration Settings
User API KeysDomino TokenOrganizations Overview
Use the Domino Command Line Interface (CLI)
Install the Domino Command Line (CLI)Domino CLI ReferenceDownload Files with the CLIForce-Restore a Local ProjectMove a Project Between Domino DeploymentsUse the Domino CLI Behind a Proxy
Browser Support
Get Help with Domino
Additional ResourcesGet Domino VersionContact Domino Technical SupportSupport Bundles
domino logo
About Domino
Domino Data LabKnowledge BaseData Science BlogTraining
User Guide
>
Domino Reference
>
Connect to your Data
>
Domino Datasets
>
Datasets Overview

Datasets Overview

Domino Datasets provide high-performance, versioned and structured filesystem storage in Domino. With Domino Datasets, you can build multiple curated pipelines of data in one project, and share them with your fellow contributors across their projects.

A Domino Dataset is a series of Snapshots. Each Snapshot is a completely independent state of the Dataset, and represents the contents of a filesystem directory from the time when the snapshot was written. There are two key ways to interact with a Domino Dataset:

  1. you can write a new snapshot to one of your project’s local Datasets

  2. you can read from an available snapshot of a shared Dataset you have mounted

Write to a local Dataset

Domino Datasets belong to Domino projects. Permission to read and write from a dataset is granted to project contributors, just like the behavior of project files. A Dataset that belongs to a project is considered to be local to that project. To create a new Dataset in your project, click Datasets from the project menu, then click Create New Dataset.

Screen_Shot_2019-02-21_at_7.08.43_AM.png

Supply a name and optional description, then click Upload Contents. The upload page provides four ways to write to your local dataset.

  1. Browser Upload

    In Domino 3.5+ you can use the Upload Files section to queue up to 50GB or 50,000 individual files for upload through your browser. You can pause this upload and resume within 24 hours. You can upload directories and subdirectories to preserve your filesystem structure.

    Screen_Shot_2019-06-26_at_10.27.33_AM.png

  2. CLI Upload

    After installing and configuring the Domino CLI, you can copy and paste the displayed command to upload a directory of files from your local machine to the Dataset. Note that all contents of the directory you specify are written to the Dataset.

    Screen_Shot_2019-02-21_at_7.09.55_AM.png

    For the example shown above, if the files you want to write to the Dataset are in /Users/myUser/data, you would run the following command:

    domino upload-dataset njablonski/datasets-demo/test /Users/myUser/data

    When finished, click Complete. You will then be taken to the Dataset overview where you should see a new Snapshot has been written. The new Snapshot will contain exactly those files that were in the folder you uploaded from your local machine.

  3. Upload by Running Script

    Before using this method, you need a script in your project files that is configured to write to the target Dataset. Supply the name of a Bash, Python, or R script and click Start to launch a Job. During the Job, an empty folder will be available at the path shown in Output Directory. At the conclusion of the Job, any files that your script has written to the output directory will be written to your Dataset as a new Snapshot.

    Screen_Shot_2019-02-21_at_7.11.00_AM.png

    In the example above, the Output Directory is /domino/datasets/main/output. For the simplest possible example, you could run the below script for a situation where there is a file named data.csv in your project.

    write-dataset.sh

    cp $DOMINO_WORKING_DIR/data.csv /domino/datasets/output/<Dataset_Name>

    When the script runs, it will copy the data file to the Dataset output directory. Then, when the Job is finished, Domino will write a new snapshot to the Dataset. The new Snapshot will contain the exact contents of the output directory, which in this case is just the data.csv file.

  4. Upload by Launching a Workspace

    This method works similarly to uploading by running a script. You will have all of the usual options available from your Domino environment for launching a Workspace. When the Workspace is launched, an empty folder will be available at the path shown in Output Directory. When you stop and sync the Workspace, any files that you have written to the output directory will be written to your Dataset as a new Snapshot.

There is a configurable limit to the number of snapshots a Dataset may contain. This limit defaults to 20 Snapshots.

If your Dataset is at this limit, attempting to start an upload with any of the above methods will result in an error message. Before you can write additional Snapshots, you will need to delete old Snapshots or increase the limit. Talk to your local administrator for more information.

You can create a maximum of 5 local Datasets per project. If you are setting up a pipeline that requires more than 5 Datasets, use separate projects for each logical task that import Datasets from the project that precedes them in the pipeline.

Read from a shared Dataset

To access the contents of an existing Dataset snapshot, you must mount the target Dataset in your project. To mount a Dataset, click Datasets from the project menu, then click Mount Shared Dataset.

Click the Dataset to Mount field to see an autocomplete dropdown of Datasets you have access to. To access a Dataset, you must be Owner, Contributor, Project Importer, or Results Consumer on the project that contains the Dataset.

Screen_Shot_2019-01-23_at_12.15.27_PM.png

There are three different settings under Update Behavior that control which Snapshot of the target Dataset your project will mount. You can mount the latest Snapshot, a tagged Snapshot, or a fixed Snapshot number. When finished, click Mount.

Now, on the Datasets page for your project you will see the Dataset you mounted listed under Shared Datasets.

Screen_Shot_2019-01-23_at_12.15.52_PM.png

The Path shown for the Dataset points to a directory where you will find the file contents of the mounted Snapshot in your project’s Runs and Workspaces. When mounted this way, the Dataset is read-only.

Manage Datasets

From the Datasets page of your project, click the name of a local or shared Dataset to open its overview page. At the top of the overview page you will see the Dataset name and description, plus buttons to upload to, rename, or archive the Dataset.

Below the description is a panel with Dataset details. Use the dropdown menu at the top of the panel to select a Snapshot, then the panel will display a list of the files it contains, plus some metadata about the Snapshot.

Screen_Shot_2019-02-21_at_7.14.21_AM.png

There are two important actions you can take on a Dataset snapshot:

  1. Add Tag

    Above the Snapshot selection dropdown you will find a list of tags applied to the Snapshot, followed by a + Add Tag button. Tags can be used to identify a Snapshot when mounting a shared Dataset for input. This allows the Dataset owner to tag a Snapshot for production use, and move the tag to whichever Snapshot is in the desired state as the Dataset changes over time.

  2. Mark for Deletion

    Clicking the Mark for Deletion button in the lower right of the panel will mark the currently selected Snapshot for deletion, changing its status. Such Snapshots can no longer be mounted in Runs by consuming projects. The Snapshot will be flagged to a Domino administrator as ready for deletion, but will not be fully deleted until the administrator takes an additional action to delete it, at which point the data is erased and no longer recoverable.

Domino Data LabKnowledge BaseData Science BlogTraining
Copyright © 2022 Domino Data Lab. All rights reserved.