Modify a blueprint¶
This section describes the blueprint editor. A blueprint represents the high-level end-to-end procedure for fitting the model, including any preprocessing steps, modeling, and post-processing steps. The description of the Describe > Blueprints tab provides a detailed explanation of blueprint elements.
When you create your own blueprints, DataRobot validates modifications to ensure that changes are intentional, not to enforce requirements. As such, blueprints with validation warnings are saved and can be trained, despite the warnings. While this flexibility prevents erroneously constraining you, be aware that a blueprint with warnings is not likely to successfully build a model.
How a blueprint works¶
Before working with the editor, make sure you understand the kind of data processing a blueprint can handle, the components for building a pipeline, and how tasks within a pipeline work.
Blueprint data processing abilities¶
A blueprint is designed to implement training pipelines, including modeling, calibration, and model-specific preprocessing steps. Other types of data preparation are best addressed using other tools. When deciding where to implement data processing steps, consider that the following aspects apply to all blueprints:
Input data is limited to a single post-EDA2 dataset. No joins can be defined inside a blueprint. All joins should be accomplished prior to EDA2 (using, for example, Spark SQL, Feature Discovery, code, or Data Prep.
Output data is limited to predictions for the project’s target, as well as information about those predictions (Prediction Explanations).
Post-processing that produces output in a different format should be defined outside of the blueprint.
- When scoring new data, a single prediction can only depend on a single row of input data.
- The number of input and output rows must match.
Blueprint task types¶
DataRobot supports two types of tasks—estimator and transform.
Estimator tasks predict new value(s) (
y) by using the input data (
x). The final task in any blueprint must be an estimator. During scoring, the estimator's output must always align with the target format. For example for multiclass blueprints, an estimator must return a probability of each class for each row.
Examples of estimator tasks are
LightGBM regressor, and
Transform tasks transform the input data (
x) in some way. Its output is always a dataframe, but unlike estimators, it can contain any number of columns and any data types.
Examples of transforms are
Matrix n-gram, and more.
Both estimator and transform tasks have a
fit()method that is used for training and learning data characteristics. For example, a binning task requires
fit()to define the bins based on training data, and then applies those bins to all future incoming data. While both task types use the
fit()method, estimators use a
score()hook while transform tasks use a
transform()hook. See the descriptions of these hooks when creating custom tasks for more information.
Transform and estimator tasks can each be used as intermediate steps inside a blueprint. For example,
Auto-Tuned N-Gramis an estimator, providing the next task with predictions as input.
How data passes through a blueprint¶
Data is passed through a blueprint sequentially, task by task, left to right. When data is passed to a transform, DataRobot:
- Fits it on the received data.
- Uses the trained transform to transform the same data.
- Passes the result to the next task.
Once passed to an estimator, DataRobot:
- Fits it on the received data.
- Uses the trained estimator to predict on the same data.
- Passes the predictions to the next task. To reduce overfitting, DataRobot passes stacked predictions when the estimator is not the final step in a blueprint.
When the trained blueprint is used to make predictions, data is passed through the same set of steps (with the difference that the
fit() method is skipped).
Access the blueprint editor¶
You can access the blueprint editor from Leaderboard, the Repository, and the AI Catalog.
From the Leaderboard, select a model to use as the basis for further exploration and click to expand (which opens the Describe > Blueprint tab). From the Repository, select and expand a model from the library of modeling blueprints available for a selected project. From the AI Catalog, select the Blueprints tab to list an inventory of user blueprints.
In either method, once the blueprint diagram is open, choose Copy and Edit to open the blueprint editor, which makes a copy of the blueprint.
When you then make modifications, they are made to a copy and the original is left intact (either on the Leaderboard or the Repository, depending on where you opened it from). Click and drag to move the blueprint around on the canvas.
Why is the editable blueprint different from the original?
When a blueprint is generated, it can contain branches for data types that are not present in the current project. Unused branches are pruned (ignored) in that case. These branches are included in the copied blueprint, as they were part of the original (before the pruning) and they may be needed for future projects. For that reason, they are available and visible inside the blueprint editor.
When you have finished editing the blueprint:
- Click +Add to AI Catalog if you want to save it to the AI Catalog for further editing, use in other projects, and sharing.
- Click Train to run the blueprint and add the resulting model to the Leaderboard.
Use the blueprint editor¶
A blueprint is composed of nodes and connectors. A node is the pipeline step—it takes in data, performs an operation, and outputs the data in its new form. Tasks are the elements that complete those actions. A connector is a representation of the flow of data. From the editor you can add, remove, and modify tasks, task hyperparameters, and/or task connections.
Work with nodes¶
The following table describes the actions to take on a node.
|Modify a node||Change characteristics of the task contained in the node.||Click a node and then the associated pencil () icon. Edit the task or parameters as needed.|
|Add a node||Add a node to the blueprint.||Click the node that will serve as the new node's input or output. This generates plus signs () on the connectors, which you then click to create an empty node. The accompanying Select a task window is the foundation for task configuration.|
|Remove a node||Remove a node and its associated task from the blueprint.||Click a node and then the associated trash can () icon.|
Work with connectors¶
The following table describes the actions to take on a connector.
|Add a connector||Add a connection between tasks, directing the data flow.||Click the starting point node, drag the blue knob to the output point.|
|Remove a connector||Disable a connection between two nodes.||Select a connector at its starting point and click the resulting trash can () icon. If the icon does not appear, the connector cannot be deleted because its removal will make the blueprint invalid.|
Modify a node¶
Use these steps to change an existing node or to add hyperparameters to a node newly added to the blueprint.
On the node to be changed:
- Hover to display task requirements.
- Click the pencil icon to activate editing, the plus sign for adding a connected node, or the trash can for deleting.
Click the pencil icon to open the task window. DataRobot presents a list of all parameters that define the task.
The following table describes the actions available from the task window:
Element Click to... Open documentation link Open the model documentation to read a description of the task and its parameters. Task selector Choose an alternative task. Click through the options to select or search for a specific task. To learn about a task, use the Open documentation link. Recommended values Reset all parameter values to the safe defaults recommended by DataRobot. Value entry Change a parameter value. When you select a parameter, a dropdown displays acceptable values. Click outside the box to set the new value. Click Recommended values to restore the default.
Use the task selector¶
Click the task name to expand the task finder. Either enter text into the search field or expand the task types to see options listed. If you previously created custom tasks, they also are available in the list. You can also create a task from this modal before proceeding.
When you click to select a new task, the blueprint editor loads that task's parameters for editing (if desired). When you are finished, click Update. DataRobot replaces the task in the blueprint.
Launch custom task creation workflow¶
You can access the custom task creation workflow by clicking the add a custom task link at the top of the task selector modal. The Add Custom Task modal opens in a new browser tab, initiating the task creation workflow.
Once the environment is set and the code is uploaded, close the tab. From the Select a task modal, click Refresh to make the new task available. You can find it either by expanding the Custom dropdown or searching:
Add a data type¶
You can change input data types available in a blueprint. Click on the Data node—the editor highlights the node and the included data types. Click the pencil icon to select or remove data types:
Pass selected columns into a task¶
To pass a single or group of columns into a task, use the Task Multiple Column Selector. This task selects specific features in a dataset such that downstream transformations are only applied to a subset of columns. To use this task, add it directly after a data type (for example, directly after “Categorical Variables”), then use the task’s parameters to specify which features should or should not be passed to the next task.
To configure the task, use the column_names parameter to specify columns that should or should not be passed to the next task. Use the method parameter to specify whether those columns should be included or excluded from the input into the next task. Note that if you need to pass all columns of a certain type to a task, you don't need MCPICK, just connect the task to the data type node.
Click Add to see the new task referencing the chosen column(s).
Note that referencing specific columns in a blueprint requires that those columns be present to train the blueprint. DataRobot provides a warning reminder when editing or training a blueprint that the named columns may not be present in the current project.
DataRobot validates each node based on the incoming and outgoing edges, checking to ensure that data type, sparse vs. dense data, and shape (number of columns) requirements are met. If you have made changes that cause validation warnings, those affected nodes are displayed in yellow on the blueprint:
Hover on the node to see specifics:
In addition to checking a task's input and output, DataRobot validates that a blueprint doesn't form cycles. If a cycle is introduced, DataRobot provides a warning, indicating which nodes are causing the issue.
Train new models¶
After changes have been made and saved for a blueprint, the option to train a model using that blueprint becomes available. Click Train to open the window and then select a feature list, sample size, and the number of folds used in cross validation. Then, click Train model.
The model becomes available to the project on the model Leaderboard. If errors were encountered during model building, DataRobot provides several indicators.
You can view the errored node from the Describe > Blueprint tab. Click on the problematic task to see the error message or validation warning.
For additional information about a custom task that failed, you can find the full error traceback in the Describe > Log tab.
The following sections provide details to help ensure succesful blueprint creation.
Boosting is a technique that can improve accuracy by training a model using predictions of another model. It uses multiple estimators together, which in turn either use data in multiple forms or help calibrate predictions.
A boosting pipeline has two key components:
Booster task: A node that boosts the predictions (Text fit on Residuals (L2/Binomial Deviance) in the example above). The list of built-in booster tasks available can be found in the task selector under Models > Boosting:
Boosting input: A node that supplies the prediction to boost (eXtreme Gradient Boosted Trees Classifier with Early stopping in the example) and other tasks that pass extra variables of the booster (Matrix of word-grams occurrences).
It must meet the following criteria:
There must be only one task that provides predictions to boost.
There must be at least one task providing extra explanatory variables to the booster, other than the predictions (Matrix of word-grams occurrences in the example).