Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

DataRobot provider for Apache Airflow

The combined capabilities of DataRobot MLOps and Apache Airflow provide a reliable solution for retraining and redeploying your models. For example, you can retrain and redeploy your models on a schedule, on model performance degradation, or using a sensor that triggers the pipeline in the presence of new data. This quickstart guide on the DataRobot provider for Apache Airflow illustrates the setup and configuration process by implementing a basic Apache Airflow DAG (Directed Acyclic Graph) to orchestrate an end-to-end DataRobot AI pipeline. This pipeline includes creating a project, training models, deploying a model, scoring predictions, and returning target and feature drift data. In addition, this guide shows you how to import example DAG files from the airflow-provider-datarobot repository so that you can quickly implement a variety of DataRobot pipelines.

The DataRobot provider for Apache Airflow is a Python package built from source code available in a public GitHub repository and published in PyPi (The Python Package Index). It is also listed in the Astronomer Registry. For more information on using and developing provider packages, see the Apache Airflow documentation. The integration uses the DataRobot Python API Client, which communicates with DataRobot instances via REST API. For more information, see the DataRobot Python package documentation.

Install the prerequisites

The DataRobot provider for Apache Airflow requires an environment with the following dependencies installed:

To install the DataRobot provider, you can run the following command:

pip install airflow-provider-datarobot

Before you start the tutorial, install the Astronomer command line interface (CLI) tool to manage your local Airflow instance:

First, install Docker Desktop for MacOS.

Then, run the following command:

brew install astro

First, install Docker Desktop for Linux.

Then, run the following command:

curl -sSL https://install.astronomer.io | sudo bash

First, install Docker Desktop for Windows.

Then, see the Astro CLI README.

Next, install pyenv or another Python version manager.

Initialize a local Airflow project

After you complete the installation prerequisites, you can create a new directory and initialize a local Airflow project there with AstroCLI:

  1. Create a new directory and navigate to it:

    mkdir airflow-provider-datarobot && cd airflow-provider-datarobot
    
  2. Run the following command within the new directory, initializing a new project with the required files:

    astro dev init
    
  3. Navigate to the requirements.txt file and add the following content:

    airflow-provider-datarobot
    
  4. Run the following command to start a local Airflow instance in a Docker container:

    astro dev start
    
  5. Once the installation is complete and the web server starts (after approximately one minute), you should be able to access Airflow at http://localhost:8080/.

  6. Sign in to Airflow. The Airflow DAGs page appears.

Load example DAGs into Airflow

The example DAGs don't appear on the DAGs page by default. To make the DataRobot provider for Apache Airflow's example DAGs available:

  1. Download the DAG files from the airflow-provider-datarobot repository.

  2. Copy the datarobot_pipeline_dag.py Airflow DAG (or the entire datarobot_provider/example_dags directory) to your project.

  3. Wait a minute or two and refresh the page.

    The example DAGs appear on the DAGs page, including the datarobot_pipeline DAG:

Create a connection from Airflow to DataRobot

The next step is to create a connection from Airflow to DataRobot:

  1. Click Admin > Connections to add an Airflow connection.

  2. On the List Connection page, click + Add a new record.

  3. In the Add Connection dialog box, configure the following fields:

    Field Description
    Connection Id datarobot_default (this name is used by default in all operators)
    Connection Type DataRobot
    API Key A DataRobot API token (locate or create an API key in Developer Tools)
    DataRobot endpoint URL https://app.datarobot.com/api/v2 by default
  4. Click Test to establish a test connection between Airflow and DataRobot.

  5. When the connection test is successful, click Save.

Configure the DataRobot pipeline DAG

The datarobot_pipeline Airflow DAG contains operators and sensors that automate the DataRobot pipeline steps. Each operator initiates a specific job, and each sensor waits for a predetermined action to complete:

Operator Job
CreateProjectOperator Creates a DataRobot project and returns its ID
TrainModelsOperator Triggers DataRobot Autopilot to train models
DeployModelOperator Deploys a specified model and returns the deployment ID
DeployRecommendedModelOperator Deploys a recommended model and returns the deployment ID
ScorePredictionsOperator Scores predictions against the deployment and returns a batch prediction job ID
AutopilotCompleteSensor Senses if Autopilot completed
ScoringCompleteSensor Senses if batch scoring completed
GetTargetDriftOperator Returns the target drift from a deployment
GetFeatureDriftOperator Returns the feature drift from a deployment

Note

This example pipeline doesn't use every available operator or sensor; for more information, see the Operators and Sensors documentation in the project README.

Each operator in the DataRobot pipeline requires specific parameters. You define these parameters in a configuration JSON file and provide the JSON when running the DAG.

{
    "training_data": "local-path-to-training-data-or-s3-presigned-url-",
    "project_name": "Project created from Airflow",
    "autopilot_settings": {
        "target": "readmitted",
        "mode": "quick",
        "max_wait": 3600
    },
    "deployment_label": "Deployment created from Airflow",
    "score_settings": {}
}

The parameters from autopilot_settings are passed directly into the Project.set_target() method; you can set any parameter available in this method through the configuration JSON file.

Values in the training_data and score_settings depend on the intake/output type. The parameters from score_settings are passed directly into the BatchPredictionJob.score() method; you can set any parameter available in this method through the configuration JSON file.

For example, see the local file intake/output and Amazon AWS S3 intake/output JSON configuration samples below:

Define training_data

For local file intake, you should provide the local path to the training_data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
    "training_data": "include/Diabetes10k.csv",
    "project_name": "Project created from Airflow",
    "autopilot_settings": {
        "target": "readmitted",
        "mode": "quick",
        "max_wait": 3600
    },
    "deployment_label": "Deployment created from Airflow",
    "score_settings": {}
}

Define score_settings

For the scoring intake_settings and output_settings, define the type and provide the local path to the intake and output data locations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
{
    "training_data": "include/Diabetes10k.csv",
    "project_name": "Project created from Airflow",
    "autopilot_settings": {
        "target": "readmitted",
        "mode": "quick",
        "max_wait": 3600
    },
    "deployment_label": "Deployment created from Airflow",
    "score_settings": {
        "intake_settings": {
            "type": "localFile",
            "file": "include/Diabetes_scoring_data.csv"
        },
        "output_settings": {
            "type": "localFile",
            "path": "include/Diabetes_predictions.csv"
        }
    }
}

Note

When using the Astro CLI tool to run Airflow, you can place local input files in the include/ directory. This location is accessible to the Airflow application inside the Docker container.

Define training_data

For Amazon AWS S3 intake, you can generate a pre-signed URL for the training data file on S3:

  1. In the S3 bucket, click the CSV file.

  2. Click Object Actions at the top-right corner of the screen and click Share with a pre-signed URL.

  3. Set the expiration time interval and click Create presigned URL. The URL is saved to your clipboard.

  4. Paste the URL in the JSON configuration file as the training_data value:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
{
    "training_data": "s3-presigned-url",
    "project_name": "Project created from Airflow",
    "autopilot_settings": {
        "target": "readmitted",
        "mode": "quick",
        "max_wait": 3600
    },
    "deployment_label": "Deployment created from Airflow",
    "datarobot_aws_credentials": "connection-id",
    "score_settings": {}
}

Define datarobot_aws_credentials and score_settings

For scoring data on Amazon AWS S3, you can add your DataRobot AWS credentials to Airflow:

  1. Click Admin > Connections to add an Airflow connection.

  2. On the List Connection page, click + Add a new record.

  3. In the Connection Type list, click DataRobot AWS Credentials.

  4. Define a Connection Id and enter your Amazon AWS S3 credentials.

  5. Click Test to establish a test connection between Airflow and Amazon AWS S3.

  6. When the connection test is successful, click Save.

    You return to the List Connections page, where you should copy the Conn Id.

You can now add the Connection Id / Conn Id value (represented by connection-id in this example) to the datarobot_aws_credentials field when you run the DAG.

For the scoring intake_settings and output_settings, define the type and provide the url for the AWS S3 intake and output data locations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
{
    "training_data": "s3-presigned-url",
    "project_name": "Project created from Airflow",
    "autopilot_settings": {
        "target": "readmitted",
        "mode": "quick",
        "max_wait": 3600
    },
    "deployment_label": "Deployment created from Airflow",
    "datarobot_aws_credentials": "connection-id",
    "score_settings": {
        "intake_settings": {
            "type": "s3",
            "url": "s3://path/to/scoring-data/Diabetes10k.csv",
        },
        "output_settings": {
            "type": "s3",
            "url": "s3://path/to/results-dir/Diabetes10k_predictions.csv",
        }
    }
}

Note

Because this pipeline creates a deployment, the output of the deployment creation step provides the deployment_id required for scoring.

Run the DataRobot pipeline DAG

After completing the setup steps above, you can run a DataRobot provider DAG in Airflow using the configuration JSON you assembled:

  1. On the Airflow DAGs page, locate the DAG pipeline you want to run.

  2. Click the run icon for that DAG and click Trigger DAG w/ config.

  3. On the DAG conf parameters page, enter the JSON configuration data required by the DAG. In this example, the JSON you assembled in the previous step.

  4. Select Unpause DAG when triggered, and then click Trigger. The DAG starts running:

Note

While running Airflow in a Docker container (e.g., using the Astro CLI tool), expect the predictions file created inside the container. To make the predictions available in the host machine, specify the output location in the include/ directory.


Updated July 28, 2023