Use PySpark in Jupyter Workspaces

You can configure a Domino workspace to launch a Jupyter notebook with a connection to your Spark cluster.

This allows you to operate the cluster interactively from Jupyter with PySpark.

The instructions for configuring a PySpark workspace are below. To use them, you must have a Domino environment that meets the following prerequisites:

The environment must use one of the Domino Standard Environments as its base image.
The necessary binaries and configurations for connecting to your Spark cluster must be installed in the environment. See the provider-specific guides for setting up the environment.

Add a PySpark Workspace option to your Environment

From the Domino main menu, click Environments.
Click the name of an environment that meets the prerequisites listed previously. It must use a Domino standard base image and already have the necessary binaries and configuration files installed for connecting to your spark cluster.
On the environment overview page, click Edit Definition.
In the Pluggable Workspace Tools field, paste the following YAML configuration.
pyspark: title: "PySpark" start: [ "/var/opt/workspaces/pyspark/start" ] iconUrl: "https://raw.githubusercontent.com/dominodatalab/workspace-configs/develop/workspace-logos/PySpark.png" httpProxy: port: 8888 rewrite: false internalPath: "/{{#if pathToOpen}}tree/{{pathToOpen}}{{/if}}" supportedFileExtensions: [ ".ipynb" ]
pyspark: title: "PySpark" start: [ "/var/opt/workspaces/pyspark/start" ] iconUrl: "https://raw.githubusercontent.com/dominodatalab/workspace-configs/develop/workspace-logos/PySpark.png" httpProxy: port: 8888 rewrite: false internalPath: "/{{#if pathToOpen}}tree/{{pathToOpen}}{{/if}}" supportedFileExtensions: [ ".ipynb" ]
When finished, the field should look like this:
Click Build to apply the changes and build a new version of the environment. Upon a successful build, the environment is ready for use.

Launching PySpark Workspaces

Open the project you want to use a PySpark workspace in.
Open the project settings, then follow the provider-specific instructions from the Hadoop and Spark overview on setting up a project to work with an existing Spark connection environment. This will involve enabling YARN integration in the project settings.
On the Hardware & Environment tab of the project settings, choose the environment you added a PySpark configuration to in the previous section.
After the previous settings are applied, you can launch a PySpark workspace from the workspaces dashboard.