Access data with Dask

When using a Domino on-demand Dask cluster, any data that will be created or modified as part of the interaction needs to go into an external data store.

Warning

On-demand clusters in Domino are ephemeral. Any data that is stored on cluster local storage and not externally will be lost upon termination of the workload and the cluster.

Using Domino Datasets

When you create a Dask cluster attached to a Domino workspace or job, any Domino dataset accessible from the workspace or job will also be accessible from all components of the cluster under the same dataset mount path. You will then be able to access the files from your code using the same path regardless of whether your code runs on your workspace or job container or in a Dask task on the cluster.

For example, to read a file you would use the following.

import dask.dataframe as dd

df = dd.read_parquet("/mnt/data/my_dataset/large_dataset.parquet")

Using S3

To access Amazon S3 (or S3 compatible object store) data with Dask, you can use any of the libraries you already use (for example, boto3, s3fs) to pull down files from S3.

For structured data, you can also read it directly into Dask dataframes of bags. You would need to specify the s3:// as the protocol. The following is a basic example.

import dask.dataframe as dd

df = dd.read_parquet("s3://bucket/path/data-*.parquet")
df = dd.read_csv("s3://bucket/path/data-*.csv")

Note	Dask uses `s3fs` and the underlying `boto3` for S3 access, so you will need to make sure that the optional `s3fs` package is installed on your base Dask environment and your execution environment when using Dask to access S3 data.

Additional parameters (for example, auth keys) can be passed through the storage_options. For full documentation of the S3 specific options (including loading data from S3 compatible services), refer to the relevant section of the Dask documentation.

AWS credential propagation

When AWS credential propagation is enabled for your deployment, temporary AWS credentials corresponding to the roles enabled for you in your company identity provider will be automatically available on all Dask workers and your execution.

The credentials will be automatically refreshed and available under a profile name corresponding to each role in an AWS credential file. The location of the file is stored in the AWS_SHARED_CREDENTIALS_FILE environment variable, which puts in the proper search path for s3fs and boto3.

You will be able to specify the name of the profile that corresponds to the role that you would want to use for authentication. You can do the following:

import dask.dataframe as dd

df = dd.read_parquet("s3://bucket/path/data-*.parquet",
    storage_options={
       "profile_name"="my-role-profile",
    })

Using other data stores

Similar to S3, Dask can load data from Microsoft Azure Storage, Google Cloud Storage, HDFS, HTTP, NFS, and your local file system.

Detailed documentation describing the protocol to use, the required packages, and the available storage_options can be found in the Remote data section of the Dask documentation.