domino logo
Latest (5.4)
  • Tech Ecosystem
  • Deployment-wide Search
  • Get Started
  • Security and Credentials
  • Collaborate
  • Organizations
  • Projects
  • Domino Datasets
  • External Data
  • Workspaces
  • Environments
  • Executions
  • Model APIs
  • Publish
  • Model Monitoring and Remediation
  • Notifications
  • Download the Audit Log
  • Domino Command Line Interface (CLI)
  • Troubleshooting
  • Get Help
domino logo
About Domino
Domino Data LabKnowledge BaseData Science BlogTraining
User Guide
>
Workspaces
>
Clusters
>
Spark on Domino
>
Hadoop and Spark Overview
>
Use PySpark in Jupyter Workspaces

Use PySpark in Jupyter Workspaces

You can configure a Domino Workspace to launch a Jupyter notebook with a connection to your Spark cluster.

This allows you to operate the cluster interactively from Jupyter with PySpark.

The instructions for configuring a PySpark Workspace are below. To use them, you must have a Domino environment that meets the following prerequisites:

  • The environment must use one of the Domino Standard Environments as its base image.

  • The necessary binaries and configurations for connecting to your Spark cluster must be installed in the environment. See the provider-specific guides for setting up the environment.

Note

Add a PySpark Workspace option to your environment

  1. From the Domino main menu, click Environments.

  2. Click the name of an environment that meets the prerequisites listed previously. It must use a Domino standard base image and already have the necessary binaries and configuration files installed for connecting to your spark cluster.

  3. On the environment overview page, click Edit Definition.

  4. In the Pluggable Workspace Tools field, paste the following YAML configuration.

    pyspark: title: "PySpark" start: [ "/var/opt/workspaces/pyspark/start" ] iconUrl: "https://raw.githubusercontent.com/dominodatalab/workspace-configs/develop/workspace-logos/PySpark.png" httpProxy: port: 8888 internalPath: "/{{ownerUsername}}/{{projectName}}/{{sessionPathComponent}}/{{runId}}/{{#if pathToOpen}}tree/{{pathToOpen}}{{/if}}" rewrite: false requireSubdomains: false supportedFileExtensions: [ ".ipynb" ]
    pyspark:
       title: "PySpark"
       start: [ "/var/opt/workspaces/pyspark/start" ]
       iconUrl: "https://raw.githubusercontent.com/dominodatalab/workspace-configs/develop/workspace-logos/PySpark.png"
       httpProxy:
          port: 8888
          internalPath: "/{{ownerUsername}}/{{projectName}}/{{sessionPathComponent}}/{{runId}}/{{#if pathToOpen}}tree/{{pathToOpen}}{{/if}}"
          rewrite: false
          requireSubdomains: false
       supportedFileExtensions: [ ".ipynb" ]

    When finished, the field should look like this:

    pyspark pluggable workspace tools

  5. Click Build to apply the changes and build a new version of the environment. Upon a successful build, the environment is ready for use.

Note

Launching PySpark Workspaces

  1. Open the project you want to use a PySpark Workspace in.

  2. Open the project settings, then follow the provider-specific instructions from the Hadoop and Spark overview on setting up a project to work with an existing Spark connection environment. This will involve enabling YARN integration in the project settings.

  3. On the Hardware & Environment tab of the project settings, choose the environment you added a PySpark configuration to in the previous section.

  4. After the previous settings are applied, you can launch a PySpark Workspace from the Workspaces dashboard.

    pyspark-pluggable-workspace-tools.png

Domino Data LabKnowledge BaseData Science BlogTraining
Copyright © 2022 Domino Data Lab. All rights reserved.