DataRobot Pipelines enable data science and engineering teams to build and run machine learning data flows. Teams start by collecting data from various sources, cleaning them, and combining them. They standardize the values among other data preparation operations to build a dataset at the unit of analysis.
To make repeatable data extraction and preparation easier, teams often build a data pipeline—a set of connected data processing steps—so that they can prepare data for training models, making predictions, and applying to other relevant use cases.
With DataRobot Pipelines, you connect to data sources of varied formats and transform data to build and orchestrate your machine learning data flows.
This section describes how to work with workspaces and pipelines:
|Pipeline workspaces||Add and edit workspaces.|
|Compose a pipeline||Add and connect modules to build a pipeline.|
|Run a pipeline||Run successfully compiled modules. You can run a module alone or as part of a path.|
|Import data||Bring external data into the pipeline.|
|Transform data||Use Spark SQL to create data transformations.|
|Export data||Export data to configured data sources, for example, the AI Catalog and S3.|
Data processing limits by module type¶
The following table lists the data processing limits for each module type:
|Module type||Data limit|
|CSV Reader module||100GB|
|AI Catalog Import module||10GB|
|Spark SQL module||100GB|
|AI Catalog Export module||10GB|
|CSV Writer module||100GB|