domino logo
Tech Ecosystem
Get started with Python
Step 0: Orient yourself to DominoStep 1: Create a projectStep 2: Configure your projectStep 3: Start a workspaceStep 4: Get your files and dataStep 5: Develop your modelStep 6: Clean up WorkspacesStep 7: Deploy your model
Get started with R
Step 0: Orient yourself to Domino (R Tutorial)Step 1: Create a projectStep 2: Configure your projectStep 3: Start a workspaceStep 4: Get your files and dataStep 5: Develop your modelStep 6: Clean up WorkspacesStep 7: Deploy your model
Get Started with MATLAB
Step 1: Orient yourself to DominoStep 2: Create a Domino ProjectStep 3: Configure Your Domino ProjectStep 4: Start a MATLAB WorkspaceStep 5: Fetch and Save Your DataStep 6: Develop Your ModelStep 7: Clean Up Your Workspace
Step 8: Deploy Your Model
Scheduled JobsLaunchers
Step 9: Working with Domino Datasets
Domino Reference
Projects
Projects Overview
Revert Projects and Files
Revert a ProjectRevert a File
Projects PortfolioReference ProjectsProject Goals in Domino 4+
Git Integration
Git Repositories in DominoGit-based Projects with CodeSyncWorking from a Commit ID in Git
Jira Integration in DominoUpload Files to Domino using your BrowserCopy ProjectsFork and Merge ProjectsSearchSharing and CollaborationCommentsDomino Service FilesystemCompare File RevisionsArchive a Project
Advanced Project Settings
Project DependenciesProject TagsRename a ProjectSet up your Project to Ignore FilesUpload files larger than 550MBExporting Files as a Python or R PackageTransfer Project Ownership
Domino Runs
JobsDiagnostic Statistics with dominostats.jsonNotificationsResultsRun Comparison
Advanced Options for Domino Runs
Run StatesDomino Environment VariablesEnvironment Variables for Secure Credential StorageUse Apache Airflow with Domino
Scheduled Jobs
Domino Workspaces
WorkspacesUse Git in Your WorkspaceUse Visual Studio Code in Domino WorkspacesPersist RStudio PreferencesAccess Multiple Hosted Applications in one Workspace Session
Spark on Domino
On-Demand Spark
On-Demand Spark OverviewValidated Spark VersionConfigure PrerequisitesWork with your ClusterManage DependenciesWork with Data
External Hadoop and Spark
Hadoop and Spark OverviewConnect to a Cloudera CDH5 cluster from DominoConnect to a Hortonworks cluster from DominoConnect to a MapR cluster from DominoConnect to an Amazon EMR cluster from DominoRun Local Spark on a Domino ExecutorUse PySpark in Jupyter WorkspacesKerberos Authentication
On-Demand Ray
On-Demand Ray OverviewValidated Ray VersionConfigure PrerequisitesWork with your ClusterManage DependenciesWork with Data
On-Demand Dask
On-Demand Dask OverviewValidated Dask VersionConfigure PrerequisitesWork with Your ClusterManage DependenciesWork with Data
Customize the Domino Software Environment
Environment ManagementDomino Standard EnvironmentsInstall Packages and DependenciesAdd Workspace IDEsAdding Jupyter Kernels
Partner Environments for Domino
Use MATLAB as a WorkspaceUse Stata as a WorkspaceUse SAS as a WorkspaceNVIDIA NGC Containers
Advanced Options for Domino Software Environment
Install Custom Packages in Domino with Git IntegrationAdd Custom DNS Servers to Your Domino EnvironmentConfigure a Compute Environment to User Private Cran/Conda/PyPi MirrorsUse TensorBoard in Jupyter Workspaces
Publish your Work
Publish a Model API
Model Publishing OverviewModel Invocation SettingsModel Access and CollaborationModel Deployment ConfigurationPromote Projects to ProductionExport Model Image
Publish a Web Application
App Publishing OverviewGet Started with DashGet Started with ShinyGet Started with FlaskContent Security Policies for Web Apps
Advanced Web Application Settings in Domino
App Scaling and PerformanceHost HTML Pages from DominoHow to Get the Domino Username of an App Viewer
Launchers
Launchers OverviewAdvanced Launcher Editor
Assets Portfolio Overview
Model Monitoring
Model Monitoring APIsAccessing The Model MonitorGet Started with Model MonitoringModel Monitor DeploymentIngest Data into The Model MonitorModel RegistrationMonitoring Data DriftMonitoring Model QualitySetting Scheduled Checks for the ModelConfigure Notification Channels for the ModelUse Model Monitoring APIsProduct Settings
Connect to your Data
Data in Domino
Datasets OverviewDatasets Best Practices
Data Sources Overview
Connect to Data Sources
External Data Volumes
Work with Data Best Practices
Work with Big Data in DominoWork with Lots of FilesMove Data Over a Network
Advanced User Configuration Settings
User API KeysDomino TokenOrganizations Overview
Use the Domino Command Line Interface (CLI)
Install the Domino Command Line (CLI)Domino CLI ReferenceDownload Files with the CLIForce-Restore a Local ProjectMove a Project Between Domino DeploymentsUse the Domino CLI Behind a Proxy
Browser Support
Get Help with Domino
Additional ResourcesGet Domino VersionContact Domino Technical SupportSupport Bundles
domino logo
About Domino
Domino Data LabKnowledge BaseData Science BlogTraining
User Guide
>
Domino Reference
>
Domino Runs
>
Advanced Options for Domino Runs
>
Use Apache Airflow with Domino

Use Apache Airflow with Domino

Data science projects often require multiple steps to go from raw data to useful data products. These steps tend to be sequential, and involve things like:

  • sourcing data

  • cleaning data

  • processing data

  • training models

After you understand the steps necessary to deliver results from your work, it’s useful to automate them as a repeatable pipeline. Domino has the ability to schedule Jobs, but for more complex pipelines you can pair Domino with an external scheduling system like Apache Airflow.

This topic will describe how to integrate Airflow with Domino by using the python-domino package.

Get started with Airflow

Airflow is an open source platform to author, schedule, and monitor pipelines of programmatic tasks. As a user, you can define pipelines with code and configure the Airflow scheduler to execute the underlying tasks. The Airflow UI can be used visualize, monitor, and troubleshoot pipelines.

If you are new to Airflow, read the Airflow QuickStart to set up your own Airflow server.

There are many options for configuring your Airflow server, and for pipelines that can run parallel tasks, you will need to use Airflow’s LocalExecutor mode. In this mode you can run tasks in parallel and execute multiple dependencies at the same time. Airflow uses a database to keep records of all the tasks it executes and schedules, so you will need to install and configure a SQL database for LocalExecutor mode.

Read A Guide On How To Build An Airflow Server/Cluster to learn more about setting up LocalExecutor mode:

For more information about scheduling and triggers, notifications, and pipeline monitoring, read the Airflow documentation.

Installing python-domino on your Airflow workers

To create Airflow tasks that work with Domino, you must install python-domino on your Airflow workers. This library will enable you to add tasks in your pipeline code that interact with the Domino API to start Jobs.

Connect to your Airflow workers, and follow these steps to install and configure python-domino:

  1. Install from pip

    pip install git+https://github.com/dominodatalab/python-domino.git

  2. Set up an Airflow variable to point to the Domino host. This is the URL where you load the Domino application in your browser.

    Key: DOMINO_API_HOST
    Value: <your-domino-url>
  3. Set up an Airflow variable to store the user API key you want to use with Airflow. This is the user Airflow with authenticate to Domino as for the purpose of starting Jobs.

    Key: DOMINO_API_KEY
    Value: <your-api-key>

How Airflow tasks map to Domino Jobs

Airflow pipelines are defined with Python code. This fits in well with Domino’s code-first philosophy. You can use python-domino in your pipeline definitions to create tasks that start Jobs in Domino.

Architecturally, Airflow has its own server and worker nodes, and Airflow will operate as an independent service that sits outside of your Domino deployment. Airflow will need network connectivity to Domino so its workers can access the Domino API to start Jobs in your Domino project. All the code that performs the actual work in each step of the pipeline — code that fetches data, cleans data, and trains data science models — is maintained and versioned in your Domino project. This way you have Domino’s Reproducibility engine working together with Airflow’s scheduler.

Screen Shot 2019 02 21 at 11.35.09 AM

Example pipeline

The following example assumes you have an Airflow server where you want to set up a pipeline of tasks that fetches data, cleans and processes data, performs an analysis, then generates a report. It also assumes you have all the code required to complete those tasks stored as scripts in a Domino project.

Screen Shot 2019 02 20 at 4.32.16 PM

The example graph shown above is written using Airflow and python-domino, and executes all the dependencies in Domino using the Airflow scheduler. It trains a model using multiple datasets, and generates a final report.

See the commented script below for an example of how to configure an Airflow DAG to execute such a pipeline with Domino Jobs.

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from domino import Domino
from airflow.models import Variable

# Initialize Domino API object with the api_key and host
api_key=Variable.get("DOMINO_API_KEY")
host=Variable.get("DOMINO_API_HOST")
domino = Domino("sujaym/airflow-pipeline",api_key,host)

# Parameters to DAG object
default_args = {
    'owner': 'domino',
    'depends_on_past': False,
    'start_date': datetime(2019, 2, 7),
    'retries': 1,
    'retry_delay': timedelta(minutes=.5),
    'end_date': datetime(2019, 2, 10),
}

# Instantiate a DAG
dag = DAG('domino_pipeline', description='Execute Airflow DAG in Domino',default_args=default_args,schedule_interval=timedelta(days=1))

# Define Task instances in Airflow to kick off Jobs in Domino
t1 = PythonOperator(task_id='get_dataset_1', python_callable=domino.runs_start_blocking, dag=dag, op_kwargs={"command":["src/data/get_dataset_1.py"]})

t2= PythonOperator(task_id='get_dataset_2', python_callable=domino.runs_start_blocking, op_kwargs={"command":["src/data/get_dataset_2.py"]}, dag=dag)

t3 = PythonOperator(task_id='get_dataset_3', python_callable=domino.runs_start_blocking, op_kwargs={"command":["src/models/get_dataset_3.sh"]}, dag=dag)

t4 = PythonOperator(task_id='clean_data', python_callable=domino.runs_start_blocking, op_kwargs={"command":["src/data/cleaning_data.py"]}, dag=dag)

t5 = PythonOperator(task_id='generate_features_1', python_callable=domino.runs_start_blocking, op_kwargs={"command":["src/features/word2vec_features.py"]}, dag=dag)

t6 = PythonOperator(task_id='run_model_1', python_callable=domino.runs_start_blocking, op_kwargs={"command":["src/models/run_model_1.py"]}, dag=dag)

t7 = PythonOperator(task_id='do_feature_engg', python_callable=domino.runs_start_blocking, op_kwargs={"command":["src/features/feature_eng.py"]}, dag=dag)

t8 = PythonOperator(task_id='run_model_2', python_callable=domino.runs_start_blocking, op_kwargs={"command":["src/models/run_model_2.py"]}, dag=dag)

t9 = PythonOperator(task_id='run_model_3', python_callable=domino.runs_start_blocking, op_kwargs={"command":["src/models/run_model_3.py"]}, dag=dag)

t10 = PythonOperator(task_id='run_final_report', python_callable=domino.runs_start_blocking, op_kwargs={"command":["src/report/report.sh"]}, dag=dag)

# Define your dependencies
t2.set_upstream(t1)
t3.set_upstream(t1)
t4.set_upstream(t2)
t5.set_upstream(t3)
t6.set_upstream([t4, t5])
t7.set_upstream(t4)
t8.set_upstream(t7)
t9.set_upstream(t7)
t10.set_upstream([t6, t8, t9])
Domino Data LabKnowledge BaseData Science BlogTraining
Copyright © 2022 Domino Data Lab. All rights reserved.