domino logo
4.6
  • Tech Ecosystem
  • Get Started
  • Domino Cloud
  • Collaborate
  • Projects
  • Work with Data
  • Workspaces
  • Environments
  • Executions
  • Deploy Models and Apps
  • Model Monitoring
  • Organizations
  • Security and Credentials
  • Notifications
  • Search
  • Domino CLI
  • Troubleshooting
  • Get Help
domino logo
About Domino
Domino Data LabKnowledge BaseData Science BlogTraining
User Guide
>
Workspaces
>
Clusters
>
Spark on Domino
>
Hadoop and Spark Overview

Hadoop and Spark Overview

Apache Hadoop is a collection of open source cluster computing tools that supports popular applications for data science at scale, such as Spark.

To interact with Hadoop from your Domino executors, configure your Domino environment with the necessary software dependencies and credentials. Domino supports most providers of Hadoop solutions, including MapR, Cloudera, and Amazon EMR. After a Domino environment is set up to connect to your cluster, Domino projects can use the environment to work with Hadoop applications.

Use a Hadoop-enabled environment in your Domino project

If your Domino administrators have already created an environment for connecting to a Hadoop cluster, you can follow these subsections of the setup instructions to use that environment in your Domino project.

For users setting up projects to work with an existing environment, read these subsections:

  • Configure a Domino project for use with a Cloudera CDH5 cluster

  • Configure a Domino project for use with an Amazon EMR cluster

  • Configure a Domino project for use with a MapR cluster

  • Configure a Domino project for use with a Hortonworks cluster

After your project is set up to use the environment, you can execute code in your Domino Runs that connects to the cluster for Spark, HDFS, or Hive functionality.

Screen Shot 2019 04 26 at 10.23.17 AM

Set up Domino to connect to a new Hadoop cluster

To connect to your existing Hadoop cluster from Domino, you must create a Domino environment with the necessary dependencies installed. Some of these dependencies, including binaries and configuration files, will come directly from the cluster itself. Others will be external software dependencies like Java and Spark, and you will need to match the version you install in the environment to the version running on the cluster.

The basic steps for setting up an environment to connect to your cluster are:

  1. Gather binaries and configuration files from your cluster

  2. Gather dependencies from external sources, like Java JDKs and Spark binaries

  3. Upload all dependencies to a Domino project, to make them accessible to the Domino environment builder

  4. Author a new Domino environment that pulls from the Domino project, then installs and configures all required dependencies

    Screen Shot 2019 04 26 at 10.29.18 AM

For Domino admins setting up a Domino environment to connect to a new cluster, read the full provider-specific setup guides:

  • Connect to a Cloudera CDH5 Cluster from Domino

  • Connect to an Amazon EMR cluster from Domino

  • Connect to a MapR cluster from Domino

  • Connect to a Hortonworks cluster from Domino

Additional capabilities

Domino also supports running Spark on a Domino executor in local mode, querying Hive tables with JDBC, and authenticating to clusters with Kerberos. See the following guides for more information.

  • Kerberos Authentication

  • Run Local Spark on a Domino Executor

  • Interactive PySpark notebooks

Domino Data LabKnowledge BaseData Science BlogTraining
Copyright © 2022 Domino Data Lab. All rights reserved.