Compose a pipeline¶
Build pipelines by adding one or more modules, then configuring and connecting them.
Ports and channels¶
In a pipeline, each module produces and consumes data through ports. The following example shows how ports are connected via channels.
|Input port||Modules accept input data through their input ports. Depending on the type of module, there can be multiple input ports. Use the + button to add a channel to an existing or new upstream module. If you add a new module on an input port that already has a channel, the newly added module acts as an intermediary between the two existing modules of the channel.|
|Output port||Modules produce data on their output ports. Use the + button on the output port to channel the data to an existing or new module. Depending on the type of module, there can be multiple output ports.|
|Channel||The connection between an output port of one module and the input port of another is called a channel. Data flows from one module to another through channels.|
|Channel to multiple modules||While an input port can accept data from only one channel, the data from an output port can be channeled to the input ports of multiple modules.|
See Build a pipeline and connect ports to connect ports with channels.
Port allowances by module type¶
Each module type has a specific set of ports you can add and configure. Following are some examples:
|Module type||# Module Input Ports||# Module Output Ports|
|CSV Reader module (AWS S3)||0||1|
|AI Catalog Import module||0||1|
|Spark SQL module||0 or more||1|
|AI Catalog Export module||1||0 or 1|
|CSV Writer module (AWS S3)||1||0|
Unlike other data ports in the system, the output port for AI Catalog Export modules is a metadata port (i.e., the output is the
Build a pipeline and connect ports¶
Select the tabs below to learn how to add and connect modules, and also how to disconnect them.
On the Graph tab of the workspace editor, click Add new module and select a module type.
Hover over the module in the workspace editor. A + button appears on the module's ports.
Click the + button on a port.
On the Add new module page, select a module type—only compatible module types are displayed.
The new module is added and the two modules are connected.
This procedure assumes the module editor contains at least two modules that you want to connect.
On the Graph tab of the workspace editor, press and hold your mouse on the port of the first module.
Drag your mouse to the port of the second module you want to connect to.
When you release your mouse, a port is automatically added to the second module, and the modules rearrange themselves so that the output port of the first module aligns with the input port of the second.
To disconnect modules, delete the channel that connects them by clicking the channel, then clicking the delete channel icon () that appears.
Edit pipeline modules¶
For each module that you add to a pipeline, you might need to update its connections to other modules, configure it, and, if necessary, add and edit SQL transformations. You do so using the module tabs on the right side of the module editor. DataRobot compiles the module automatically based on the settings and reports the module's status.
The following module tab descriptions apply to all module types. However, each module type has different connection, configuration, and editing requirements. See the following sections to learn about requirements for specific module types.
Select a tab below to learn how you set up your modules.
This example shows the connection settings for a Spark SQL module.
|Module Name and Description fields||Customize the module name and add a description.|
|Inputs||Customize the input port names.|
|Outputs||Customize the output port names.|
|+ Add||Click the + Add button to the right of the Inputs and Output lists to add ports. Note that the + Add button is greyed out once the maximum number of ports have been added for the module type. The example above contains a Spark SQL module which can have multiple inputs. See Port allowances by module type.|
|Source dropdown menu||Select the name of the port you're connecting to from the Source dropdown menu. Configure channels by selecting from the available ports in the list. For Inputs, the list contains the available output ports of the existing modules. For Outputs, the list contains the available input ports of the existing modules. If there are no available ports to connect to, no Source dropdown list displays. You can use these settings to rewire existing channels between ports.|
|Delete||Click the trash icon to delete a port.|
This example shows the configuration settings for a CSV Reader module.
The Details tab is used to configure import and export modules. For Spark SQL modules, the Details tab displays an editor.
|File path field||Enter the path to the S3 bucket.|
|S3 Credentials||Select S3 Credentials from the dropdown, click Add, and enter all required credentials for your bucket.|
This example shows the Details tab for a Spark SQL module.
This Details tab display is specifc to Spark SQL modules and not to import modules like CSV Reader modules or export modules like AI Catalog Export and CSV Writer modules.
|SQL edit window||Enter SQL queries in the edit window. The editor saves your updates automatically. Errors and warnings display in the Console tab beneath the workspace editor.|
|Actions menu||Select SQL-related operations from the menu on the bottom right of the Details tab.|
|Format SQL||Displays the SQL code in a readable format.|
|Generate Schema||Builds a schema that allows you to autocomplete dataset column names as you type SQL queries. Generate Schema runs the upstream inputs in order to build the schema.|
|Spark Docs||Displays documentation for the Spark SQL built-in functions in a new browser tab.|
Continue adding modules until your pipeline is complete. To learn about configuring import, transform, and export modules, see:
Edit the pipeline.yml configuration¶
Any changes you make in the module tabs are updated in the workspace's pipeline.yml tab. It shows the results of your updates to a module's Connections and Details tabs. Another way to add or update modules in your pipeline is to edit the
pipeline.yml code directly in the pipeline.yml tab.
The editor saves your updates automatically.
You can clone a workspace by copying the contents of your pipeline.yml tab, creating a new workspace, and pasting the clipboard contents into the pipeline.yml tab.