Build a recipe¶
Building a recipe is the first step in preparing your data. When you start a Wrangle session, DataRobot connects to your data source, pulls a live random sample, and performs exploratory data analysis on that sample. When you add operations to your recipe, the transformation is applied to the sample and the exploratory data insights are recalculated, allowing you to quickly iterate on and profile your data before publishing.
See the associated considerations for important additional information.
To wrangle data, you must add a dataset using a configured data connection.
When a wrangling recipe is pushed down to Snowflake, the operations are executed in the Snowflake environment. To understand how operations behave in Snowflake, refer to the Snowflake documentation.
To view which queries were executed by Snowflake during pushdown, open the AI Catalog and select the new output dataset. The queries are listed on the Info tab.
Configure the live sample¶
By default, DataRobot retrieves 10000 rows for the live sample, however, you can modify this number in the wrangling settings. Note that the more rows you retrieve, the longer it will take to render the live sample.
To configure the live sample:
Click Settings in the right panel and open Interactive sample.
Enter the number of rows (under 10000) you want to include in the live sample and click Resample. The live sample updates to display the specified number of rows.
Analyze the live sample¶
During data wrangling, DataRobot performs exploratory data analysis on the live sample, generating table- and column-level summary statistics and visualizations that help you profile the dataset and recognize data quality issues as you apply operations. For more information on interacting with the live sample, see the section on Exploratory Data Insights.
Speed up live sample
To speed up the time it takes to retrieve and render the live sample, use the toggle next to Show Insights to hide the feature distribution charts.
Live sample vs. Exploratory Data Insights on the Datasets tab
Although both pages provide similar insights, you can specify the number of rows displayed in the live sample and it updates each time a transformation is added to your recipe.
A recipe is composed of operations—transformations that will be applied to the source data to prepare it for modeling. Note that all operations are processed with the
The table below describes the wrangling operations currently available in Workbench:
|新しい特徴量を計算||Create a new feature using Snowflake scalar subqueries, scalar functions, or window functions.|
|行をフィルター||Filter the rows in your dataset according to specified value(s) and conditions|
|De-duplicate rows||Automatically remove all duplicate rows from your dataset.|
|検索と置換||Replace specific feature values in a dataset.|
|Rename features||Change the name of one or more features in your dataset.|
|特徴量の削除||Remove one or more features from your dataset.|
To add an operation to your recipe:
With Recipe selected, click Add Operation in the right panel.
Select and configure an operation. Then, click Add to recipe.
The live sample updates after DataRobot retrieves a new sample from the data source and applies the operation, allowing you to review the transformation in realtime.
Continue adding operations while analyzing their effect on the live sample; when you're done, the recipe is ready to be published.
Use the Compute new feature operation to create a new output feature from existing features in your dataset. By applying domain knowledge, you can create features that do a better job of representing your business problem to the model than those in the original dataset.
To compute a new feature:
Click Compute new feature in the right panel.
Enter a name for the new feature, and under Expression, use Snowflake scalar subqueries, scalar functions, or window functions to define the feature.
This example uses
REGEXP_SUBSTR, to extract the first number from the
[<age_range_start> - <age_range_end>)from the
to_numberto convert the output from a string to a number.
Use the Filter row operation to filter the rows in your dataset according to specified value(s) and conditions.
To filter rows:
Click Filter row in the right panel.
Decide if you want to keep the rows that match the defined conditions or exclude them.
Define the filter conditions, by choosing the feature you want to filter, the condition type, and the value you want to filter by. DataRobot highlights the selected column.
(Optional) Click Add condition to define additional filtering criteria.
Use the De-duplicate rows operation to automatically remove all rows with duplicate information from the dataset.
To de-duplicate rows, click De-duplicate rows in the right panel. This operation is immediately added to your recipe and applied to the live sample.
Use the Find and replace operation to quickly replace specific feature values in a dataset. This is helpful to, for example, fix typos in a dataset.
To find and replace a feature value:
Click Find and replace in the right panel.
Under Select feature, click the dropdown and choose the feature that contains the value you want to replace. DataRobot highlights the selected column.
Under Find, choose the match criteria—Exact, Partial, or Regular Expression—and enter the feature value you want to replace. Then, under Replace, enter the new value.
Use the Rename features operation to rename one or more features in the dataset.
To rename features:
Click Rename features in the right panel.
Rename specific features from the live sample
Alternatively, you can click the More options icon next to the feature you want to rename. This opens the operation parameters in the right panel with the feature field already filled in.
Under Feature name, click the dropdown and choose the feature you want to rename. Then, enter the new feature name in the second field.
(Optional) Click Add feature to rename additional features.
Use the Remove features operation to remove features from the dataset.
To remove features:
Click Remove features in the right panel.
Remove specific features from the live sample
Alternatively, you can click the More options icon next to the feature you want to remove. This opens the operation parameters in the right panel with the feature field already filled in.
Under Feature name, click the dropdown and either start typing the feature name or scroll through the list to select the feature(s) you want to remove. Click outside of the dropdown when you're done selecting features.
At any point, you can click Quit Wrangling to end your wrangling session, however, any operations applied to the dataset will be removed.
To learn more about the topics discussed on this page, see:
- Snowflake scalar functions for computing features.
- Snowflake window functions for computing features.
- Description of summary statistics and histograms in DataRobot Classic.