domino logo
Tech Ecosystem
Get started with Python
Step 0: Orient yourself to DominoStep 1: Create a projectStep 2: Configure your projectStep 3: Start a workspaceStep 4: Get your files and dataStep 5: Develop your modelStep 6: Clean up WorkspacesStep 7: Deploy your model
Get started with R
Step 0: Orient yourself to Domino (R Tutorial)Step 1: Create a projectStep 2: Configure your projectStep 3: Start a workspaceStep 4: Get your files and dataStep 5: Develop your modelStep 6: Clean up WorkspacesStep 7: Deploy your model
Get Started with MATLAB
Step 1: Orient yourself to DominoStep 2: Create a Domino ProjectStep 3: Configure Your Domino ProjectStep 4: Start a MATLAB WorkspaceStep 5: Fetch and Save Your DataStep 6: Develop Your ModelStep 7: Clean Up Your Workspace
Step 8: Deploy Your Model
Scheduled JobsLaunchers
Step 9: Working with Domino Datasets
Domino Reference
Projects
Projects Overview
Revert Projects and Files
Revert a ProjectRevert a File
Projects PortfolioReference ProjectsProject Goals in Domino 4+
Git Integration
Git Repositories in DominoGit-based Projects with CodeSyncWorking from a Commit ID in Git
Jira Integration in DominoUpload Files to Domino using your BrowserFork and Merge ProjectsSearchSharing and CollaborationCommentsDomino Service FilesystemCompare File RevisionsArchive a Project
Advanced Project Settings
Project DependenciesProject TagsRename a ProjectSet up your Project to Ignore FilesUpload files larger than 550MBExporting Files as a Python or R PackageTransfer Project Ownership
Domino Runs
JobsDiagnostic Statistics with dominostats.jsonNotificationsResultsRun Comparison
Advanced Options for Domino Runs
Run StatesDomino Environment VariablesEnvironment Variables for Secure Credential StorageUse Apache Airflow with Domino
Scheduled Jobs
Domino Workspaces
WorkspacesUse Git in Your WorkspaceRecreate A Workspace From A Previous CommitUse Visual Studio Code in Domino WorkspacesPersist RStudio PreferencesAccess Multiple Hosted Applications in one Workspace Session
Spark on Domino
On-Demand Spark
On-Demand Spark OverviewValidated Spark VersionConfigure PrerequisitesWork with your ClusterManage DependenciesWork with Data
External Hadoop and Spark
Hadoop and Spark OverviewConnect to a Cloudera CDH5 cluster from DominoConnect to a Hortonworks cluster from DominoConnect to a MapR cluster from DominoConnect to an Amazon EMR cluster from DominoRun Local Spark on a Domino ExecutorUse PySpark in Jupyter WorkspacesKerberos Authentication
On-Demand Ray
On-Demand Ray OverviewValidated Ray VersionConfigure PrerequisitesWork with your ClusterManage DependenciesWork with Data
On-Demand Dask
On-Demand Dask OverviewValidated Dask VersionConfigure PrerequisitesWork with Your ClusterManage DependenciesWork with Data
Customize the Domino Software Environment
Environment ManagementDomino Standard EnvironmentsInstall Packages and DependenciesAdd Workspace IDEsAdding Jupyter Kernels
Partner Environments for Domino
Use MATLAB as a WorkspaceUse Stata as a WorkspaceUse SAS as a WorkspaceNVIDIA NGC Containers
Advanced Options for Domino Software Environment
Install Custom Packages in Domino with Git IntegrationAdd Custom DNS Servers to Your Domino EnvironmentConfigure a Compute Environment to User Private Cran/Conda/PyPi MirrorsUse TensorBoard in Jupyter Workspaces
Publish your Work
Publish a Model API
Model Publishing OverviewModel Invocation SettingsModel Access and CollaborationModel Deployment ConfigurationPromote Projects to ProductionExport Model Image
Publish a Web Application
App Publishing OverviewGet Started with DashGet Started with ShinyGet Started with FlaskContent Security Policies for Web Apps
Advanced Web Application Settings in Domino
App Scaling and PerformanceHost HTML Pages from DominoHow to Get the Domino Username of an App Viewer
Launchers
Launchers OverviewAdvanced Launcher Editor
Assets Portfolio Overview
Model Monitoring and Remediation
Monitor WorkflowsData Drift and Quality Monitoring
Set up Monitoring for Model APIs
Set up Prediction CaptureSet up Drift DetectionSet up Model Quality MonitoringSet up NotificationsSet Scheduled ChecksSet up Cohort Analysis
Set up Model Monitor
Connect a Data SourceRegister a ModelSet up Drift DetectionSet up Model Quality MonitoringSet up Cohort AnalysisSet up NotificationsSet Scheduled ChecksUnregister a Model
Use Monitoring
Access the Monitor DashboardAnalyze Data DriftAnalyze Model QualityExclude Features from Scheduled Checks
Remediation
Cohort Analysis
Review the Cohort Analysis
Remediate a Model API
Monitor Settings
API TokenHealth DashboardNotification ChannelsTest Defaults
Monitoring Config JSON
Supported Binning Methods
Model Monitoring APIsTroubleshoot the Model Monitor
Connect to your Data
Data in Domino
Datasets OverviewProject FilesDatasets Best Practices
Connect to Data Sources
External Data VolumesDomino Data Sources
Connect to External Data
Connect Domino to DataRobotConnect to Amazon S3 from DominoConnect to BigQuery from DominoConnect to Generic S3 from DominoConnect to IBM DB2 from DominoConnect to IBM Netezza from DominoConnect to Impala from DominoConnect to MSSQL from DominoConnect to MySQL from DominoConnect to Okera from DominoConnect to Oracle Database from DominoConnect to PostgreSQL from DominoConnect to Redshift from DominoConnect to Snowflake from DominoConnect to Teradata from Domino
Work with Data Best Practices
Work with Big Data in DominoWork with Lots of FilesMove Data Over a Network
Advanced User Configuration Settings
User API KeysDomino TokenOrganizations Overview
Use the Domino Command Line Interface (CLI)
Install the Domino Command Line (CLI)Domino CLI ReferenceDownload Files with the CLIForce-Restore a Local ProjectMove a Project Between Domino DeploymentsUse the Domino CLI Behind a Proxy
Browser Support
Get Help with Domino
Additional ResourcesGet Domino VersionContact Domino Technical SupportSupport Bundles
domino logo
About Domino
Domino Data LabKnowledge BaseData Science BlogTraining
User Guide
>
Domino Reference
>
Connect to your Data
>
Work with Data Best Practices
>
Work with Big Data in Domino

Work with Big Data in Domino

When you start a Run, Domino copies your project files to the executor that is hosting the run. After every run in Domino, by default, the Domino will try to write all files in the working directory back to the project as a new revision. When working with large volumes of data, this presents two potential problems:

  1. The number of files that are written back to the project may exceed the configurable limit. By default, the file limit for Domino project files is 10,000 files.

  2. The time required for the write process is proportional to the size of the data. It can take a long time if the size of the data is very large.

There are three solutions to consider for these problems, discussed in detail below. The following table shows the recommended solution for various cases.

CaseData size# of filesStatic / DynamicSolution

Large volume of static data

Unlimited

Unlimited

Static

Domino Datasets

Large volume of dynamic data

Up to 300GB

Unlimited

Dynamic

Project Data Compression

Domino datasets

When working on image recognition or image classification deep learning projects, it is common to require a training dataset of thousands of images. The total dataset size can easily become tens of GB. For these types of projects, the initial training also uses a static dataset. The data is not constantly being changed or appended to. Furthermore, the actual data that is used is normally processed into a single large tensor.

You should store your processed data in a Domino Dataset. Datasets can be mounted by other Domino projects, where they are attached as a read-only network filesystem to that project’s Runs and Workspaces.

For more information on Datasets, see the Datasets overview.

Project data compression

There might be times when you want to work with logs as raw text files. When that is the case, typically new log files are constantly being added to the dataset, so your dataset is dynamic. Here, we can encounter both potential problems detailed in the introduction simultaneously:

  1. number of files being over the 10k limit

  2. long preparing and syncing times.

We currently recommend storing your large number of files in a compressed format. If you need the files to be in an uncompressed format during your Run, you can use Domino Compute Environments to prepare the files. In the pre-run script, you can uncompress your files:

tar -xvzf many_files_compressed.tar.gz

Then in the post-run script, you can re-compress the directory:

tar -cvzf many_files_compressed.tar.gz /path/to/directory-or-file

If your compressed file is still very large, the preparing and syncing times may still be long, depending on how large your compressed file is. Consider storing these files in a Domino Dataset to minimize the copying times.

Domino Data LabKnowledge BaseData Science BlogTraining
Copyright © 2022 Domino Data Lab. All rights reserved.