domino logo
Tech Ecosystem
Get started with Python
Step 0: Orient yourself to DominoStep 1: Create a projectStep 2: Configure your projectStep 3: Start a workspaceStep 4: Get your files and dataStep 5: Develop your modelStep 6: Clean up WorkspacesStep 7: Deploy your model
Get started with R
Step 0: Orient yourself to Domino (R Tutorial)Step 1: Create a projectStep 2: Configure your projectStep 3: Start a workspaceStep 4: Get your files and dataStep 5: Develop your modelStep 6: Clean up WorkspacesStep 7: Deploy your model
Domino Reference
Projects
Projects Overview
Revert Projects and Files
Revert a ProjectRevert a File
Projects PortfolioProject Goals in Domino 4+Upload Files to Domino using your BrowserFork and Merge ProjectsSearchSharing and CollaborationCommentsCompare File RevisionsArchive a Project
Advanced Project Settings
Project DependenciesProject TagsRename a ProjectSet up your Project to Ignore FilesUpload files larger than 550MBExporting Files as a Python or R PackageTransfer Project Ownership
Domino Runs
JobsDiagnostic Statistics with dominostats.jsonNotificationsResultsRun Comparison
Advanced Options for Domino Runs
Run StatesDomino Environment VariablesEnvironment Variables for Secure Credential StorageUse Apache Airflow with Domino
Scheduled Jobs
Domino Workspaces
WorkspacesUse Visual Studio Code in Domino WorkspacesPersist RStudio PreferencesAccess Multiple Hosted Applications in one Workspace SessionUse Domino Workspaces in Safari
Spark on Domino
On-Demand Spark
On-Demand Spark OverviewValidated Spark VersionConfigure PrerequisitesWork with your ClusterManage DependenciesWork with Data
External Hadoop and Spark
Hadoop and Spark OverviewConnect to a Cloudera CDH5 cluster from DominoConnect to a Hortonworks cluster from DominoConnect to a MapR cluster from DominoConnect to an Amazon EMR cluster from DominoRun Local Spark on a Domino ExecutorUse PySpark in Jupyter WorkspacesKerberos Authentication
Customize the Domino Software Environment
Environment ManagementDomino Standard EnvironmentsInstall Packages and DependenciesAdd Workspace IDEs
Advanced Options for Domino Software Environment
Install Custom Packages in Domino with Git IntegrationAdd Custom DNS Servers to Your Domino EnvironmentConfigure a Compute Environment to User Private Cran/Conda/PyPi MirrorsScala notebooksUse TensorBoard in Jupyter WorkspacesUse MATLAB as a WorkspaceCreate a SAS Data Science Workspace Environment
Publish your Work
Publish a Model API
Model Publishing OverviewModel Invocation SettingsModel Access and CollaborationModel Deployment ConfigurationPromote Projects to ProductionExport Model Image
Publish a Web Application
Cross-Origin Security in Domino web appsApp Publishing OverviewGet Started with DashGet Started with ShinyGet Started with Flask
Advanced Web Application Settings in Domino
App Scaling and PerformanceHost HTML Pages from DominoHow to Get the Domino Username of an App Viewer
Launchers
Launchers OverviewAdvanced Launcher Editor
Assets Portfolio Overview
Connect to your Data
Domino Datasets
Datasets OverviewDatasets Best PracticesAbout domino.yamlDatasets Advanced Mode TutorialDatasets Scratch SpacesConvert Legacy Data Sets to Domino Datasets
Data Sources OverviewConnect to Data Sources
Git and Domino
Git Repositories in DominoWork From a Commit ID in Git
Work with Data Best Practices
Work with Big Data in DominoWork with Lots of FilesMove Data Over a Network
Advanced User Configuration Settings
User API KeysOrganizations Overview
Use the Domino Command Line Interface (CLI)
Install the Domino Command Line (CLI)Domino CLI ReferenceDownload Files with the CLIForce-Restore a Local ProjectMove a Project Between Domino DeploymentsUse the Domino CLI Behind a Proxy
Browser Support
Get Help with Domino
Additional ResourcesGet Domino VersionContact Domino Technical SupportSupport Bundles
domino logo
About Domino
Domino Data LabKnowledge BaseData Science BlogTraining
User Guide
>
Domino Reference
>
Spark on Domino
>
External Hadoop and Spark
>
Connect to a Cloudera CDH5 cluster from Domino

Connect to a Cloudera CDH5 cluster from Domino

Domino supports connecting to a Cloudera CDH5 cluster through the addition of cluster-specific binaries and configuration files to your Domino environment.

At a high level, the process is as follows:

  1. Connect to your CDH5 edge or gateway node and gather the required binaries and configuration files, then download them to your local machine.

  2. Upload the gathered files into a Domino project to allow access by the Domino environment builder.

  3. Create a new Domino environment that uses the uploaded files to enable connections to your cluster.

  4. Enable YARN integration for the Domino projects that you want to use with the CDH5 cluster.

Domino supports the following types of connections to a CDH5 cluster:

  • FS shell

  • spark2-shell

  • spark2-submit

  • pyspark

  • YARN shell

Gather the required binaries and configuration files

You will find most of the files for setting up your Domino environment on your CDH5 edge or gateway node. To get started, connect to the edge node via SSH, then follow the steps below.

  1. Create a directory named hadoop-binaries-configs at /tmp.

    mkdir /tmp/hadoop-binaries-configs
  2. Create the following subdirectories inside /tmp/hadoop-binaries-configs/.

    mkdir /tmp/hadoop-binaries-configs/configs
    
    mkdir /tmp/hadoop-binaries-configs/parcels
  3. Optional: If your cluster uses Kerberos authentication, create the following subdirectory in /tmp/hadoop-binaries/configs/.

    mkdir /tmp/hadoop-binaries-configs/kerberos

    Then, copy the krb5.conf Kerberos configuration file from /etc/ to /tmp/hadoop-binaries-configs/kerberos.

    cp /etc/krb5.conf /tmp/hadoop-binaries-configs/kerberos/
  4. Copy the CDH and SPARK2 directories from /opt/cloudera/parcels/ to /tmp/hadoop-binaries-configs/parcels/. These directories will have a version number appended to their names, so complete the appropriate directory name in the commands shown below.

    cp -R /opt/cloudera/parcels/CDH-<version>/ /tmp/hadoop-binaries-configs/parcels/
    cp -R /opt/cloudera/parcels/SPARK2-<version>/ /tmp/hadoop-binaries-configs/parcels/
  5. Copy the hadoop, hive, spark, and spark2 directories from /etc/ to /tmp/hadoop-binaries-configs/configs/.

    cp -R /etc/hadoop /tmp/hadoop-binaries-configs/configs/
    cp -R /etc/hive /tmp/hadoop-binaries-configs/configs/
    cp -R /etc/spark2 /tmp/hadoop-binaries-configs/configs/
    cp -R /etc/spark /tmp/hadoop-binaries-configs/configs/
  6. On the edge node, run the following command to identify the version of Java running on the cluster.

    java -version

    You should then download a JDK .tar file from the Oracle downloads page that matches that version. The filename will have a pattern like the following.

    jdk-8u211-linux-x64.tar.gz

    Keep this JDK handy on your local machine for use in a future step.

  7. Compress the /tmp/hadoop-binaries-configs/ directory to a gzip archive.

    cd /tmp
    
    tar -zcf hadoop-binaries-configs.tar.gz hadoop-binaries-configs

    When finished, use SCP to download the archive to your local machine.

  8. Next, you’ll need to extract the archive on your local machine, add a java subdirectory, then add the JDK .tar file you downloaded earlier to the java subdirectory.

    tar xzf hadoop-binaries-configs.tar.gz
    
    mkdir hadoop-binaries-configs/java
    
    cp jdk-8u211-linux-x64.tar.gz hadoop-binaries-configs/java/
  9. When finished, your hadoop-binaries-configs directory should have the following structure.

    hadoop-binaries-configs/
      ├── configs/
            ├── hadoop/
            ├── hive/
            ├── spark/
            └── spark2/
      ├── java/
            └── jdk-8u211-linux-x64.tar.gz
      ├── parcels
            ├── CDH-version/
            └── SPARK-version/
      └── kerberos/  # optional
            └── krb5.conf
  10. If your directory contains all the required files, you can now compress it to a gzip archive again in preparation for uploading to Domino in the next step.

    tar -zcf hadoop-binaries-configs.tar.gz hadoop-binaries-configs

Upload the binaries and configuration files to Domino

Use the following procedure to upload the archive you created in the previous step to a public Domino project. This will make the file available to the Domino environment builder.

  1. Log in to Domino, then create a new public project.

    Screen Shot 2019 04 01 at 10.47.24 PM

  2. Open the Files page for the new project, then click to browse for files and select the archive you created in the previous section. Then click Upload.

  3. After the archive has been uploaded, click the gear menu next to it on the Files page, then right click Download and click Copy Link Address. Save the copied URL in your notes, as you will need it in the next step.

    After you have recorded the download URL of the archive, you’re ready to build a Domino environment for connecting to your CDH5 cluster.

Create a Domino environment for connecting to CDH5

  1. Click Environments from the Domino main menu, then click Create Environment.

    Screen Shot 2019 04 02 at 10.56.18 AM

  2. Give the environment an informative name, then choose a base environment that includes the version of Python that is installed on the nodes of your CDH5 cluster. Most Linux distributions ship with Python 2.7 by default, so you will see the Domino Analytics Distribution for Python 2.7 used as the base image in the following examples. Click Create when finished.

    Screen Shot 2019 04 02 at 10.56.57 AM

  3. After creating the environment, click Edit Definition. Copy the following example into your Dockerfile Instructions, then be sure to edit it wherever necessary with values specific to your deployment and cluster.

    In this Dockerfile, wherever you see a hyphenated instruction enclosed in carats like <paste-your-domino-download-url-here>, be sure to replace it with the corresponding value you recorded in previous steps.

    You may also need to edit commands that follow to match downloaded filenames.

USER root

# Give user ubuntu ability to sudo as any user including root
RUN echo "ubuntu ALL=(ALL:ALL) NOPASSWD: ALL" >> /etc/sudoers

# Set up directories
RUN mkdir -p /opt/cloudera/parcels && \
    mkdir /tmp/domino-hadoop-downloads && \
    mkdir /usr/java

# Download the binaries and configs gzip you uploaded to Domino.
# This downloaded gzip file should have the following
# - CDH and Spark2 parcel directories in a 'parcels' sub-directory.
# - java installation tar file in 'java' sub-directory
# - krb5.conf in 'kerberos' sub-directory
# - hadoop, hive, spark2 and spark config directories a 'configs' sub-directory
RUN wget --no-check-certiticate <paste-your-domino-download-url-here> -O /tmp/domino-hadoop-downloads/hadoop-binaries-configs.tar.gz && \
    tar xzf /tmp/domino-hadoop-downloads/hadoop-binaries-configs.tar.gz -C /tmp/domino-hadoop-downloads/

# Install kerberos client and update the kerberos configuration file
RUN apt-get -y install krb5-user telnet && \
    cp /tmp/domino-hadoop-downloads/hadoop-binaries-configs/kerberos/krb5.conf /etc/krb5.conf

# Install version of Java that matches hadoop cluster and update environment variables
# Your JDK may have a different filename depending on your cluster's version of Java
RUN tar xvf /tmp/domino-hadoop-downloads/hadoop-binaries-configs/java/jdk-8u162-linux-x64.tar -C /usr/java
ENV JAVA_HOME=/usr/java/jdk1.8.0_162
RUN echo "export JAVA_HOME=/usr/java/jdk1.8.0_162" >> /home/ubuntu/.domino-defaults && \
    echo "export PATH=$JAVA_HOME/bin:$PATH" >> /home/ubuntu/.domino-defaults

# Install CDH hadoop-client binaries from cloudera ubuntu trusty repository.
# This example shows client binaries for CDH version 5.15 here.
# Update these commands with the CDH version that matches your cluster.
RUN echo "deb [arch=amd64] http://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh trusty-cdh5.15.0 contrib" >> /etc/apt/sources.list.d/cloudera.list && \
    echo "deb-src http://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh trusty-cdh5.15.0 contrib" >> /etc/apt/sources.list.d/cloudera.list && \
    wget http://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh/archive.key -O /tmp/domino-hadoop-downloads/archive.key && \
    apt-key add /tmp/domino-hadoop-downloads/archive.key && \
    apt-get update && \
    apt-get -y -t trusty-cdh5.15.0 install zookeeper && \
    apt-get -y -t trusty-cdh5.15.0 install hadoop-client

# Copy CDH and Spark2 parcels to correct directories and update symlinks
# Note that the version strings attached to your directory names may be different than the below examples.
RUN mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/parcels/CDH-5.15.0-1.cdh5.15.0.p0.21 /opt/cloudera/parcels/ && \
    mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809 /opt/cloudera/parcels/ && \
    ln -s /opt/cloudera/parcels/CDH-5.15.0-1.cdh5.15.0.p0.21 /opt/cloudera/parcels/CDH && \
    ln -s /opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809 /opt/cloudera/parcels/SPARK2

# Copy hadoop, hive and spark2 configurations
RUN mv /etc/hadoop /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/hadoop-etc-local.backup && \
    mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/hadoop /etc/hadoop && \
    mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/hive /etc/hive && \
    mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/spark2 /etc/spark2 && \
    mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/spark /etc/spark

# Create alternatives for hadoop configurations. Update the extensions with the same strings as found in your edge node
# Example: In the command 'update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.cloudera.yarn 55'
# make sure that /etc/hadoop/conf.cloudera.yarn is named the same as the corresponding file on your edge node.
# Sometimes in the CDH5 edgenode, that is named something like /etc/hadoop/conf.cloudera.yarn_
RUN update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.cloudera.yarn 55 && \
    update-alternatives --install /etc/hive/conf hive-conf /etc/hive/conf.cloudera.hive 55 && \
    update-alternatives --install /etc/spark2/conf spark2-conf /etc/spark2/conf.cloudera.spark2_on_yarn 55 && \
    update-alternatives --install /etc/spark/conf spark-conf /etc/spark/conf.cloudera.spark_on_yarn 55

# These instructions are for Spark2
# Creating alternatives for Spark2 binaries, also create symlink for pyspark pointing to pyspark2
RUN update-alternatives --install /usr/bin/spark2-shell spark2-shell /opt/cloudera/parcels/SPARK2/bin/spark2-shell 55 && \
    update-alternatives --install /usr/bin/spark2-submit spark2-submit /opt/cloudera/parcels/SPARK2/bin/spark2-submit 55 && \
    update-alternatives --install /usr/bin/pyspark2 pyspark2 /opt/cloudera/parcels/SPARK2/bin/pyspark2 55 && \
    ln -s /usr/bin/pyspark2 /usr/bin/pyspark

# Update SPARK and HADOOP environment variables. Make sure py4j file name is correct per your edgenode
ENV SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2
RUN echo "export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop" >> /home/ubuntu/.domino-defaults && \
    echo "export HADOOP_CONF_DIR=/etc/hadoop/conf" >> /home/ubuntu/.domino-defaults && \
    echo "export YARN_CONF_DIR=/etc/hadoop/conf" >> /home/ubuntu/.domino-defaults && \
    echo "export SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2" >> /home/ubuntu/.domino-defaults && \
    echo "export SPARK_CONF_DIR=/etc/spark2/conf" >> /home/ubuntu/.domino-defaults && \
    echo "export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip" >> /home/ubuntu/.domino-defaults

# Change spark-defaults.conf file permission
RUN mv /etc/spark2/conf/spark-defaults.conf /etc/spark2/ && \
    chmod 777 /etc/spark2/conf.cloudera.spark2_on_yarn

# Copy hive-site.xml to /etc/spark2/conf to access hive tables from Spark2.
RUN cp /etc/spark2/conf/yarn-conf/hive-site.xml /etc/spark2/conf/
  1. Scroll down to the Pre Run Script field and add the following lines.

    cat /etc/spark2/spark-defaults.conf >> /etc/spark2/conf/spark-defaults.conf
    sed -i.bak '/spark.ui.port\=0/d' /etc/spark2/conf/spark-defaults.conf
  2. Scroll down and click Advanced to expand additional fields. Add the following line to the Post Setup Script field.

    echo "export YARN_CONF_DIR=/etc/hadoop/conf" >> /home/ubuntu/.bashrc
  3. Click Build when finished editing the Dockerfile instructions. If the build completes successfully, you are ready to try using the environment.

Configure a Domino project for use with a CDH5 cluster

This procedure assumes that an environment with the necessary client software has been created according to the instructions above. Ask your Domino admin for access to such an environment.

  1. Open the Domino project you want to use with your CDH5 cluster, then click Settings from the project menu.

  2. On the Integrations tab, click to select YARN integration from the Apache Spark panel, then click Save. You do not need to edit any of the fields in this section.

  3. If your cluster uses Kerberos authentication, you can configure credentials at the user level or project level . Do so before attempting to use the environment. Note that if you followed the instructions above on creating your environment, your Kerberos configuration file has already been added to it.

  4. On the Hardware & Environment tab, change the project default environment to the one with the cluster’s binaries and configurations files installed.

You are now ready to start Runs from this project that interact with your CDH5 cluster.

Domino Data LabKnowledge BaseData Science BlogTraining
Copyright © 2022 Domino Data Lab. All rights reserved.