Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

DataRobot Pipelines

Availability information

The ability to use DataRobot Pipelines, available as a public preview feature, is off by default. Contact your DataRobot representative or administrator for information on enabling the feature. Note that public preview of DataRobot Pipelines is supported on the US cloud version of DataRobot only.

Feature flags: Enable Pipeline Workspaces in AI Catalog, Disable Workspace Batch Execution

DataRobot Pipelines enable data science and engineering teams to manage machine learning data throughout the stages of model development and deployment. Teams start by collecting data from various sources, cleaning them, and combining them. They standardize the values among other data preparation operations to build a dataset at the unit of analysis.

These data extraction and preparation activities are repeated throughout the lifecycle of a model. To make this process easier, teams often build a data pipeline—a set of connected data processing steps—so that they can train models with new data as needed.

With DataRobot Pipelines, you connect to data sources of varied formats and implement data transformations to build and orchestrate your machine learning data flows.

This section describes how to work with workspaces and pipelines:

Topic Describes how to ...
Pipeline workspaces Add and edit workspaces.
Compose a pipeline Add and connect modules to build a pipeline.
Run a pipeline Run successfully compiled modules. You can run a module alone or as part of a path.
Import data Bring external data into the pipeline.
Transform data Use Spark SQL to create data transformations.
Export data Export data to configured data sources, for example, the AI Catalog and S3.

Pipeline workspaces

A data flow pipeline is a collection of connected modules—it contains the module specifications, connections, and configurations needed to implement your machine learning data flows. The pipelines you build and execute are contained in workspaces.

Note

Currently, a workspace contains a single pipeline. In the future, a workspace may support additional pipelines and related data assets.

Workspaces page

The Workspaces page lets you manage the workspaces where you will build your pipelines.

Access the Workspaces page in the AI Catalog:

You can perform the following actions on the Workspaces page:

Element Description
Add new workspace Click to add a workspace where you will create a pipeline.
Search field Enter strings to search for workspaces.
Tags and Owner filters You can filter by:
  • Tags: Filter by tags that you create. Create and add tags on a workspace's Info tab.
  • Owner: Filter by the owner of a workspace.
workspace list Click a workspace to view or edit it.
sort list Sort by Creation Date, Name, or Description.

Add a workspace

To create a new workspace:

  1. In the AI Catalog, click the Workspaces tab and click Add new workspace.

  2. Name the workspace by clicking the pencil icon () next to the default workspace name. Then select the type of module to add.

    You can add a module that imports, transforms, or exports data.

    Once selected, the module appears in the workspace editor.

  3. Click the new module to select it (in this case, the CSV Reader module).

    You can perform the following actions in the workspace editor.

    Element Description
    Pencil icon () Rename your workspace.
    Open/Close tabs Select checkboxes to show or hide tabs in the workspace editor. Select Reset layout to default to display the default layout.
    Delete Delete the current workspace.
    Close Close the workspace editor and return to the Workspace Info page.
    Workspace module tabs
    • The Connections tab lets you add or remove ports through which the module accepts input, produces output, and connects to other modules.
    • The Config tab lets you specify the configuration, for example, by defining a file path, setting credentials, indicating a delimeter, etc.
    • The Editor tab lets you modify your pipeline and its associated code.
    Edit pipeline file Displays the pipeline.yml tab where you can edit the yml file that specifies the pipeline's modules and connections. Click the Graph tab to return to the workspace editor.
    + Add new module Select a module type to add. The new module appears in the workspace. Use the workspace module tabs on the right to edit it and update its connections and configuration.
    Module tile The top left of a module tile shows the module type (S3, SQL, etc.). The name appears to the right of the type (in this case, "CSV Reader"). The number of rows successfully processed during the module's run displays beneath the module name. After selecting a module, you can:
    • Use the Connections, Config, and Editor tabs to configure and edit it.
    • Use the actions menu on the upper right of a module tile to force a run, clone the module, or remove the module from the workspace.
    • Click Run to selection above the pipeline to run the pipeline to the selected module.

Use tags to filter workspaces

After adding a workspace, you can add tags to help you find it on the Workspaces page:

  1. In the Info tab of the workspace, click the pencil icon () next to the Tags field.

  2. Enter one or more tag names, then enter Return or click outside of the field.

    The following special characters are not allowed in tag names: -$.{}"'#, Spaces are also not allowed but you can use an underscore (_) for a multi-word tag name.

  3. Once you've created a tag, you can use it to filter workspaces on the Workspaces page.

Compose a pipeline

You build pipelines by adding one or more modules and connecting them.

Ports and channels

In a pipeline, each module produces and consumes data through ports. The following example shows how ports are connected via channels.

Element Description
Input port Modules accept input data through their input ports. Depending on the type of module, there can be multiple input ports. Use the + button to add a channel to an existing or new upstream module. If you add a new module on an input port that already has a channel, the newly added module acts as an intermediary between the two existing modules of the channel.
Output port Modules produce data on their output ports. Use the + button on the output port to channel the data to an existing or new module. Depending on the type of module, there can be multiple output ports.
Channel The connection between an output port of one module and the input port of another is called a channel. Data flows from one module to another through channels.
Channel to multiple modules While an input port can accept data from only one channel, the data from an output port can be channeled to the input ports of multiple modules.

See Build a pipeline and connect ports to connect ports with channels.

Port allowances by module type

Each module type has a specific set of ports you can add and configure. Following are some examples:

Module type # Module Input Ports # Module Output Ports
CSV Reader module 0 1
Spark SQL module 0 or more 1
AI Catalog Export module 1 0 or 1
CSV Writer module 1 0

Build a pipeline and connect ports

Select the tabs below to learn how to add and connect modules, and also how to disconnect them.

  1. On the Graph tab of the workspace editor, click Add new module and select a module type.

  2. Hover over the module in the workspace editor. A + button appears on the module's ports.

  3. Click the + button on a port.

  4. On the Add new module page, select a module type.

    The new module is added and the two modules are connected.

This procedure assumes the module editor contains at least two modules that you want to connect.

  1. On the Graph tab of the workspace editor, press and hold your mouse on the port of the first module.

  2. Drag your mouse to the port of the second module you want to connect to.

    When you release your mouse, the modules rearrange themselves so that the output port of the first module aligns with the input port of the second.

To disconnect modules, delete the channel that connects them by clicking the channel, then clicking the delete channel icon () that appears.

Edit pipeline modules

For each module that you add to a pipeline, you might need to update its connections to other modules, configure it, and, if necessary, add and edit SQL transformations. You do so using the module tabs on the right side of the module editor. DataRobot compiles the module automatically based on the settings and reports the module's status.

Note

The following module tab descriptions apply to all module types. However, each module type has different connection, configuration, and editing requirements. See the following sections to learn about requirements for specific module types.

  • To learn about CSV Reader modules, see Import data.
  • To learn about Spark SQL modules, see Transform data.
  • To learn about AI Catalog Export modules and CSV Writer modules, see Export data.

Select a tab below to learn how you set up your modules.

This example shows the connection settings for a Spark SQL module.

Element Description
Module Name and Description fields Customize the module name and add a description.
Inputs Customize the input port names.
Outputs Customize the output port names.
+ Add Click the + Add button to the right of the Inputs and Output lists to add ports. Note that the + Add button is greyed out once the maximum number of ports have been added for the module type. The example above contains a Spark SQL module which can have multiple inputs. See Port allowances by module type.
Source dropdown menu Select the name of the port you're connecting to from the Source dropdown menu. Configure channels by selecting from the available ports in the list. For Inputs, the list contains the available output ports of the existing modules. For Outputs, the list contains the available input ports of the existing modules. If there are no available ports to connect to, no Source dropdown list displays. You can use these settings to rewire existing channels between ports.

This example shows the configuration settings for a CSV Reader module. The module has been renamed "JHU Hospitalization Data."

Note

The Config tab is used to configure import and export modules. The Config tab doesn't apply to Spark SQL modules.

Element Description
File path field Enter the path to the S3 bucket.
S3 Credentials Enter all required credentials for your bucket.

This example shows the Editor tab for a Spark SQL module.

Note

The Editor tab is applicable to Spark SQL modules and not to import modules like CSV Reader modules or export modules like AI Catalog Export and CSV Writer modules.

Element Description
SQL edit window Enter SQL queries in the edit window. The editor saves your updates automatically. Errors and warnings display in the Console tab beneath the workspace editor.
actions menu Select SQL-related operations from the menu on the bottom right of the Editor tab.
Format SQL Displays the SQL code in a readable format.
Generate Schema Builds a schema that allows you to autocomplete dataset column names as you type SQL queries. Generate Schema runs the upstream inputs in order to build the schema.
Spark Docs Displays documentation for the Spark SQL built-in functions in a new browser tab.

Continue adding modules until your pipeline is complete. To learn about configuring import, transform, and export modules, see:

Edit the pipeline.yml configuration

Any changes you make in the module tabs are updated in the workspace's pipeline.yml tab. It shows the results of your updates to a module's Connections, Config, and Editor tabs. Another way to add or update modules in your pipeline is to edit the pipeline.yml code directly in the pipeline.yml tab.

The editor saves your updates automatically.

Tip

You can clone a workspace by copying the contents of your pipeline.yml tab, creating a new workspace, and pasting the clipboard contents into the pipeline.yml tab.

Run a pipeline

As you add a module to a pipeline and configure it, DataRobot compiles the module. If the module compiles successfully, you can run it. Otherwise, a compilation error displays in the Console tab.

Compilation errors can be simple errors like missing sufficient ports or unconnected ports. In this case, it's an unconnected port error.

Note

To learn about errors and warnings generated during compilations and runs, see Module status.

You can run a module explicitly using the Run to selection command. You can also run a module as part of a pipeline path, for example, by applying Run to selection on a downstream module.

Select a tab below to learn how to run pipelines.

  1. In the workspace editor, select the module the pipeline will run to and click the Run to selection button above the pipeline.

  2. View the output of the module run in the Results tab.

    Following are the results of a successful run.

    If a run is unsuccessful, an error displays in the Console tab.

If you run a pipeline to a selected module and an error occurs, you can view the details in the Console tab under the workspace editor. Runtime errors can be caused by issues like invalid SQL statements and invalid file paths. See Module status for a description of the status types.

To run an entire pipeline, close the workspace editor and click Run on the top right of the workspace.

The Run button displays whether you're viewing the Info, Pipelines, Run Schedule and History, or Comments tab.

To see the results, select the Run Schedule and History tab in the workspace.

Tip

You can also schedule pipelines to run at regular intervals.

Download a preview

When you run to a selected module, you see a preview of the results at that point in the pipeline. The preview displays in the Results tab:

Click Download Preview to download the results preview.

Module status

A status icon displays beside each module to indicate the current state of the module, for example, compiled, running, and successful run.

In this example, the CSV Reader module compiles but the Spark SQL module fails to compile. The Console tab displays the errors.

The following table describes each status.

Status icon Description
Module has been added to the pipeline, but has not compiled yet.
Module has been compiled successfully.
Module compilation failed. Check the Console tab for error details.
Module is running or waiting for a dependency to finish running. The status icon spins to indicate the module is running.
Module has run successfully. Check the Results tab for details. If you run this module again, it will intelligently use the cached results and will not re-run. To force a re-run of the module, use Force Run from the module actions menu.
Module run failed. Select the failed module and check the Console tab for error details. You can make changes to the module configuration or edit the specification and rerun the module.

Schedule a pipeline run

A powerful feature of DataRobot Pipelines is the ability to configure your pipeline and run it at regular intervals. This allows you to automate data updates. To schedule a pipeline run:

  1. Navigate to a workspace, select the Run Schedule and History tab, and click Set schedule.

  2. Set the frequency and time you want the workspace pipeline to run.

    Tip

    To set more granular times, click Use advanced scheduler. The advanced scheduler lets you schedule runs for specific minutes, hours, days of the month, months, and days of the week.

  3. Select the Activate schedule after it is saved checkbox to turn the schedule on now. You can uncheck it if you want to wait and enable the schedule later.

  4. Click Save.

    The schedule is set.

    To deactivate the schedule, click pause (). You can later enable the schedule.

Enable a paused schedule

If you created a schedule but chose not to activate it right away or if you deactivated a schedule, it appears as "Paused."

To later enable the schedule, click play ().

Edit a saved schedule

To edit a saved schedule, do one of the following:

  • Click pause () to deactivate the schedule temporarily. You can later enable the schedule.
  • Click the pencil icon () to update your schedule settings.
  • Click delete () to delete the schedule.

View run history

Once your pipeline runs, you can view the run history:

Click a run to view a log file and the pipeline.yml code.

Import data

A typical data pipeline starts with a data read operation. Import modules bring external data into the pipeline and make it available for other modules to consume.

CSV Reader module

The CSV Reader module is an import module that reads delimited text files from the AWS S3 storage service. The following are options used to configure the CSV Reader in the Config tab.

Option Description
File path Specify the path to the delimited text file, including the bucket name.
S3 Credentials Use any existing credentials from your profile’s “Credential Management” section or create a new set of credentials by providing the Access Key, Secret Key, and AWS Session Token details.
AWS region Enter the region where the S3 bucket is located. The default value is us-east-1.
Treat first row as column header If there are no header rows, uncheck this.
Delimiter Specify the field delimiter. Comma is the default.
Encoding Specify the type of encoding for the data. UTF-8 is the default.
Force Column Types to String Treat all imported columns as strings instead of performing type inference to detect other types, such as numeric. Useful for larger datasets where some of the column types may be wrongly inferred.
Parallel Streams Select the number of parallel processing streams to add. This option lets you trade off between speed of ingestion and amount of memory used. You can increase this value for smaller datasets to speed up runs. Keep this value low for larger datasets to avoid "Out of Memory" errors.
Size of blocks in bytes Select the number of blocks of data (in bytes) that are read at a time. Increasing the number of blocks can speed up the module and downstream modules to a point, but may result in "Out of Memory" errors for larger datasets. Decreasing the number of blocks can help avoid "Out of Memory" errors for larger datasets, but setting it too small will slow processing.

Note

Only files of less than 120GB in size can be imported from S3.

Transform data

Once data is read into the pipeline, it typically goes through a series of transformations. Transform modules let you create data transformations like combining multiple datasets, removing duplicates, and cleaning erroneous values.

Spark SQL module

The Spark SQL transform module lets you write SQL queries on the incoming data. This module accepts one or more input datasets. You can compose SQL queries on these datasets to generate the desired output. In the SQL queries, you address the datasets using the input port names. For example, if the module has two input ports, Orders and Customers, the SQL query must refer to the incoming data using the port names Orders and Customers, as shown here:

SELECT
Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM
Orders INNER JOIN Customers
ON Orders.CustomerID = Customers.CustomerID

Tip

While most SQL transformations require at least one input dataset, some SQL statements can be written without any input data. In these cases, you can remove the inputs of the Spark SQL module and keep just the output.

Export data

Once processing is complete, the goal is often to export the data to a storage location. Export modules output data from the pipeline to an external data source in a format of your choice.

AI Catalog Export module

The AI Catalog Export module exports the data from the pipeline as a dataset in the AI Catalog. This lets you combine data from multiple sources, clean them, and create a dataset that can be used for training models.

Use the following options to configure an AI Catalog Export module in the Config tab:

Option Description
Dataset name Specify the name for the dataset you're creating in the AI Catalog.
Description Provide a description of the dataset.
Request timeout (seconds) Specify the number of seconds before timing out.
Tags Enter tags to apply to the dataset. You can then filter on these tags in the AI Catalog to find your dataset.

Note

Only datasets of less than 10GB in size can be exported to the AI Catalog.

CSV Writer module

The CSV Writer module exports the data from the pipeline to an AWS S3 location of your choice in a character-delimited format.

Use the following options to configure an a CSV Writer module in the Config tab:

Option Description
File path Enter the path to the delimited text file, including the bucket name.
S3 Credentials Use existing credentials from your profile’s “Credential Management” section or create a new set of credentials by providing the Access Key, Secret Key, and AWS Session Token details.
AWS Region Enter the region where the S3 bucket is located. The default value is us-east-1.
Include header Select to include a header row.
Overwrite existing file Select to overwrite an existing file.
Double quote all strings Add double quotes to strings.
Encoding Specify the type of encoding for the data. UTF-8 is the default.
Delimiter Specify the field delimiter. Comma is the default.

Considerations

The following are considerations to be aware of when working with DataRobot Pipelines.

  • Only files of less than 120GB in size can be imported from S3.
  • Currently, every export to the AI Catalog creates a new dataset. The capability of exporting a new version of an existing dataset will be available in an upcoming release.
  • The AI Catalog supports datasets of 10GB, so exports to the AI Catalog have the same limit.

Definitions

Workspace

A container to build and execute data flow pipelines. A workspace has a pipeline which includes module specifications, their connectivity, and configuration, as well as data assets and credentials. A workspace is a catalog entity and is searchable via a workspace name, a description, and a tag. Workspaces are backed by a file system such as S3.

Pipeline

A declarative directed acyclic graph (DAG) with a series of instructions that act upon data. This is represented as modules connected by channels that reflect the flow of data between them. A pipeline lives in a workspace and is rendered in the Graph tab. Pipelines are not exposed outside the workspace, so they are not directly searchable in the AI Catalog.

Module

Self-contained code that represents one step in the pipeline flow. Each module has its own runtime specification and can have one or many inputs.

Channel

The connection between an output port of one module and an input port of another module. Data flows from one module's output port to another module's input port via a channel, represented visually by a line connecting the two.


Updated September 10, 2021