Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

DataRobot Pipelines

DataRobot Pipelines enable data science and engineering teams to build and run machine learning data flows. Teams start by collecting data from various sources, cleaning them, and combining them. They standardize the values among other data preparation operations to build a dataset at the unit of analysis.

To make repeatable data extraction and preparation easier, teams often build a data pipeline—a set of connected data processing steps—so that they can prepare data for training models, making predictions, and applying to other relevant use cases.

With DataRobot Pipelines, you connect to data sources of varied formats and transform data to build and orchestrate your machine learning data flows.

This section describes how to work with workspaces and pipelines:

Topic Describes...
Pipeline workspaces Add and edit workspaces.
Compose a pipeline Add and connect modules to build a pipeline.
Run a pipeline Run successfully compiled modules. You can run a module alone or as part of a path.
Import data Bring external data into the pipeline.
Transform data Use Spark SQL to create data transformations.
Export data Export data to configured data sources, for example, the AI Catalog and S3.

Data processing limits by module type

The following table lists the data processing limits for each module type:

Module type Data limit
CSV Reader module 100GB
AI Catalog Import module 10GB
Spark SQL module 100GB
AI Catalog Export module 10GB
CSV Writer module 100GB

Updated December 12, 2021
Back to top