How to Integrate DataRobot and Apache Airflow for Orchestration and MLOps Workflows background

How to Integrate DataRobot and Apache Airflow for Orchestration and MLOps Workflows

June 16, 2022
by
· 6 min read

We’re excited to announce DataRobot’s integration with Apache Airflow, a popular open source orchestration tool and workflow scheduler used by more than 12,000 organizations* across industries like financial services, healthcare, retail, and manufacturing. Airflow is a perfect tool to orchestrate stages of the DataRobot machine learning (ML) pipeline, because it provides an easy but powerful solution to integrate DataRobot capabilities into bigger pipelines, combine it with other services, as well as to clean your data, and store or publish the results.

To lay a strong foundation for machine learning operations (MLOps) in your organization, it is critical that you establish a repeatable, reproducible, maintainable, and reliable ML workflow for training and deploying models and scoring predictions. Managing your ML pipeline with an orchestration solution like Apache Airflow makes it easy to monitor, debug, and rerun the individual stages of your workflow.

Integrate DataRobot and Apache Airflow for Retraining and Redeploying Models

When combined, the capabilities of DataRobot MLOps and Apache Airflow provide a reliable solution for retraining and redeploying your models. You can, for example, retrain and redeploy your models on a schedule, on model performance degradation, or using a sensor that triggers the pipeline in presence of new data. This type of ML orchestration can provide the best-informed predictions from your organization’s models, regularly trained on the most recent data.

This tutorial will explain the DataRobot Airflow Provider setup and configuration process, helping you implement an Apache Airflow DAG (Directed Acyclic Graph) to orchestrate an end-to-end DataRobot ML pipeline. This pipeline includes creating a project, training models, deploying a model and scoring predictions.

If you are new to working with DataRobot, you’re welcome to check out our documentation, where you can find the UI docs, API docs, and tutorials.

DataRobot Provider Modules

There are multiple DataRobot operators and sensors that automate the DataRobot ML pipeline steps. Each operator initiates a specific job, and each sensor waits for a predetermined action to complete:

OperatorJob
CreateProjectOperatorCreates a DataRobot project and returns its ID
TrainModelsOperatorTriggers DataRobot Autopilot to train models
DeployModelOperatorDeploys a specified model and returns the deployment ID
DeployRecommendedModelOperatorDeploys a recommended model and returns the deployment ID
ScorePredictionsOperatorScores predictions against the deployment and returns a batch prediction job ID
AutopilotCompleteSensorChecks whether the Autopilot has completed
ScoringCompleteSensorChecks whether batch scoring has completed

Getting Started with DataRobot and Airflow

The DataRobot provider for Apache Airflow is a Python package built from source code available in a public GitHub repository and published in PyPi (The Python Package Index). It is also listed in the Astronomer Registry.

The integration uses the DataRobot Python API Client, which communicates with DataRobot instances via REST API. You’re welcome to check out the documentation to learn more about it.

For more information on using and developing providers packages, please check the Apache Airflow documentation.

Pre-requisites

The DataRobot provider for Apache Airflow requires an environment with the following dependencies installed:

In this tutorial, we’ll use the Astro CLI tool to manage the local Airflow instance. Make sure to install this command line tool before following the steps below:

curl -sSL https://install.astronomer.io | sudo bash

Install the provider

First, we’ll create a new Airflow project. For that, create a new directory and run the following command inside it to initialize the project files:

astro dev init

Add the following to your requirements.txt file:

airflow-provider-datarobot

The example DataRobot pipeline DAG doesn’t appear on the DAGs page by default. To make it available, download the DAG file from the repository to the dags/ directory in your project (browse GitHub tags to download to the same source code version as your installed DataRobot provider) and refresh the page.

Run the following command to start the Airflow instance:

astro dev start

Once the installation is complete, you should be able to access Airflow (wait approx. 30 seconds for the web server to start) and login with the default credentials (username: admin, password: admin):

http://localhost:8080/

The datarobot_pipeline DAG should appear on the DAGs page:

Create a connection from Airflow to DataRobot 

The next step is to create a connection from Airflow to DataRobot. Click Admin > Connections, and then, on the List Connection page, click Add a new record. Fill in the following fields:

The connection setup page looks like this:

Create a connection from Airflow to DataRobot 

You can test the connection by clicking the Test button. Then, click the Save button.

Configuration

Each operator in the DataRobot pipeline requires specific parameters. We define these parameters in a configuration JSON file and provide it when running the DAG.

The general structure of the configuration is this:

{
    "training_data": "",
    "project_name": "Project created from Airflow - Diabetes10k",
    "unsupervised_mode": false,
    "use_feature_discovery": false,
    "autopilot_settings": {
        "target": "readmitted",
        "mode": "quick",
        "max_wait": 3600
    },
    "deployment_label": "Deployment from Airflow based on diabetes10k",
    "deployment_description": "",
    "score_settings": {}
}

Values in the training_data and score_settings depend on the intake/output type. We explain the construction of these settings in the sections below.

As all the parameters from autopilot_settings are passed directly into the Project.set_target() method, any parameter available in this method may be set in the config.

Local file intake/output

In case of the local file intake, you should provide the path to training data, e.g.:

    "training_data": "include/Diabetes10k.csv"

And for the scoring settings, provide the following:

    "score_settings": {
        "intake_settings": {
            "type": "localFile",
            "file": "include/Diabetes_scoring_data.csv"
        },
        "output_settings": {
            "type": "localFile",
            "path": "include/Diabetes_predictions.csv"
        }
    }

As the parameters from score_settings are passed directly into the BatchPredictionJob.score() method, any parameter available in this method may be set in the config.

If you’re using the Astro CLI tool to run Airflow, make sure to place the input files in the include/ directory which is accessible to the Airflow application inside the docker container.

Amazon AWS S3 intake/output 

In case of the S3 intake, you need to generate a pre-signed URL for the training data file on S3:

  1. In the S3 bucket, click the csv file
  2. Click Object Actions at the top right corner of the screen and click Share with a pre-signed URL.
  3. Set the expiration time interval and click Create presigned URL. The URL is saved to your clipboard
  4. Paste the URL in the config file as the training_data value.

For scoring data that is on AWS S3, your AWS credentials must be added to DataRobot to connect with your AWS account. You may add your credentials using the DataRobot Python API client:

import datarobot as dr

dr.client.Client(token="<your-api-token>", endpoint="https://app.datarobot.com/api/v2")

credential = dr.Credential.create_s3(
    name="My AWS credentials",
    aws_access_key_id="<aws_access_key_id>",
    aws_secret_access_key="<aws_secret_access_key>",
    aws_session_token="<aws_session_token>",
)
credential.credential_id

This credential entry appears in the DataRobot Credentials Management page (read more about credential management in our platform docs).

Afterwards, you need to add the credential.credential_id value both to intake_settings and output_settings in the config. In addition, you should provide the URLs for intake and output locations on S3:

    "score_settings": {
        "intake_settings": {
            "type": "s3",
            "url": "s3://my-bucket-name/Diabetes1k.csv",
            "credential_id": "<value from credential.credential_id>"
        },
        "output_settings": {
            "type": "s3",
            "url": "s3://my-bucket-name/Diabetes1k_predictions.csv",
            "credential_id": "<value from credential.credential_id>"
        }
    }

Run DataRobot Pipeline

After completing all the above setup, you should be able to run the DataRobot Pipeline DAG following these steps:

  1. Copy the config file content.
  2. On the Airflow DAGs page, click the run icon in the “datarobot_pipeline” entry and select Trigger DAG w/ config.
  3. Paste the config.
  4. Select Unpause DAG when triggered.
  5. Click Trigger.

The graph view of the DAG looks like this:

Run DataRobot Pipeline

Note: If you are running Airflow in a Docker container (e.g., using the Astro CLI tool), expect the predictions file created inside the container. To make the predictions available in the host machine, specify the output location in the include/ directory.

Getting Started with DataRobot and Airflow

DataRobot MLOps automates the capabilities to deploy, manage, monitor, and govern models in production environments. Orchestrating ML workflow takes it a step further and allows combining the ML stages within your data analytics pipeline, like data ingestion, data validation, data preparation, model evaluation, etc. 

DataRobot and Airflow are the perfect tandem to achieve this level of automation in your organization and have the best MLOps experience.

References

*HG Insights, May 2022

Industry Analyst Report
May 2022 Gartner® Market Guide

Multipersona Data Science and Machine Learning (DSML) Platforms

Download now
About the author
Andrius Senulis
Andrius Senulis

Senior Software Engineer, DataRobot

Andrius Senulis is a senior software engineer with 9+ years of experience in the IT industry, having worked in cybersecurity and AI companies. Andrius joined DataRobot in 2019 and has worked on different teams and projects in the company, including the COVID Simulator, Partner Integrations and Predictions.

Meet Andrius Senulis