Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Feature Discovery

Preview

Feature Discovery in Workbench is available for preview and on by default.

Feature flag(s): Enable Feature Discovery in Workbench

To deploy AI across the enterprise and make the best use of predictive models, you must be able to access relevant features. Often, the starting point of your data does not contain the right set of features. Feature Discovery discovers and generates new features from multiple datasets so that you no longer need to perform manual feature engineering to consolidate various datasets into one.

See the Feature Discovery file requirements for information about dataset sizes, and the associated considerations for important additional information.

Open Feature Discovery

To perform Feature Discovery in Workbench, in the Data tab, click the more options icon > Feature Discovery to the right of the dataset that will serve as the primary dataset. When you add and configure secondary datasets in the Feature Discovery recipe, you will define their relationship to the dataset selected here.

DataRobot opens Feature Discovery and adds the primary dataset to the canvas.

Data sources

Unlike data wrangling, Feature Discovery can support data from the Data Registry, a data connection, or local file.

Why were two Feature Discovery recipes added to the Data tab?

When you launch Feature Discovery, DataRobot adds two recipe entries to the Data tab of your current Use Case. One recipe is active and can be opened to resume modifying it in Workbench. The other recipe appears grayed out and cannot be opened. This recipe is a placeholder, allowing you to, optionally, resume Feature Discovery in DataRobot Classic.

Configure primary dataset settings

With the primary dataset selected, enter the Target (what you want to predict) and Prediction point (time of the prediction). Prediction point is only available if a date feature is detected in the dataset.

Set the target and prediction point.

Then, click SavePrimary data settings saved is displayed at the bottom of the page.

Add secondary datasets

Feature Discovery requires at least one secondary dataset. Otherwise, you do not need to perform Feature Discovery and can use the single dataset to directly set up an experiment. To add secondary datasets:

  1. Click + Add Datasets in the left panel. The Add Data modal opens.

    Click Add Datasets to open the Add Data modal.

  2. You can add data from a data connection, the Data Registry, or your current Use Case, as well as preview a dataset by clicking it. Select the box to the left of each secondary dataset you want to add, then click Add Datasets.

    Add secondary datasets.

    All secondary datasets are displayed in the left panel.

Add relationships

Adding a relationship between datasets tells DataRobot that the two datasets are connected. There are two ways to establish a relationship between a primary and secondary dataset:

  • Select the secondary dataset, and click the + that appears below a dataset node on the canvas.

  • Select a dataset node on the canvas and from more options , select Add relation. In the left panel, select the dataset you want to join.

Note

After defining a relationship between a primary and secondary dataset, you must configure the join conditions for that relationship before adding another dataset.

Set join conditions

While adding a relationship establishes that there's a connection between two datasets, the join conditions specify how they're related.

If the tables in your datasets are well-formed, DataRobot automatically detects compatible features and populates the Join condition field with the most appropriate feature, typically, a feature that's included in both datasets.

  Element Description
1 Join A visual representation indicating a relationship, or join, between two dataset nodes. Click this to edit a relationship and its join conditions.
2 Nodes to join The two dataset nodes that are joined.
3 Join condition The features, one from each dataset, that tell DataRobot how the two datasets are related.
4 + Add join condition Click to include an additional join condition.
5 Save / Save and configure time-aware
  • Save: For non-time aware, saves the relationship and join conditions.
  • Save and configure time aware: For time-aware, saves the relationship and join conditions, and opens Time-awareness tab for further secondary dataset configuration.
Join feature type compatibility and restrictions

See the table below for compatible join types when creating or modifying joins:

Feature type Compatible join types
Numeric Numeric, Categorical
Categorical Categorical, Numeric, Text
Text Text, Categorical
Date Date

The following feature types cannot be used as join keys:

  • Summarized categorical
  • Length
  • Currency
  • Percentage
  • Audio
  • Image
  • Document

For more information, see Set join conditions in the DataRobot Classic section.

Configure secondary dataset settings

Select a secondary dataset node on the canvas to configure its settings, including its name, feature list, and time-awareness (if applicable).

Node Settings

To edit the settings for a secondary dataset node, click on a secondary dataset node and open the Node Settings tab, which includes the following options:

  Element Description
1 Node alias Modify the name displayed at the top of the node. By default, the string displayed on the canvas is the name of the secondary dataset. Entering a node alias is helpful if the dataset name is too long to display in full.
2 Snapshot policy Select a snapshot policy to associate with the dataset node.
3 Feature list Select a feature list to apply to the dataset in this node.
4 + Create new feature list Create a new feature list to apply to the dataset node using the features listed below.
5 Features View the features included in the dataset.

Time-awareness

If DataRobot detects a date feature in the primary dataset, you can select a prediction point to configure time-awareness. To edit these settings for a secondary node, open the Time-awareness tab, which includes the following options:

  Element Description
1 Time index Determines the time window when DataRobot performs joins and aggregations during Feature Discovery.
2 Feature derivation window (FDW) Set the rolling window used to create features, which increases the model’s ability to learn from data trends and results in more accurate forecasts.
3 + Add feature derivation window Define additional FDWs to finetune time-aware Feature Discovery.
4 Prediction point: {date_feature} rounded down to nearest Control how DataRobot rounds down the prediction point when running Feature Discovery. While rounding makes the Feature Discovery process faster, doing so comes at a cost of potentially losing fresh secondary dataset records.
Prediction point vs. Time index

Prediction point applies to the primary dataset and is used as the reference date for when you can make predictions. Time index applies to secondary datasets and is used to determine the time window when DataRobot can perform joins and aggregations as part of Feature Discovery.

For more information, see Time-aware feature engineering.

Review relationship configurations

After configuring at least one secondary dataset, you can test the quality of those relationship configurations to identify and resolve potential problems early in the creation process. The Relationship Quality Assessment tool verifies join keys, dataset selection, and time-aware settings.

Click Review configuration to test the relationships on the Feature Discovery canvas.

Each node displays the results of the assessment. If the quality of a relationship passes the assessment, a green check mark is displayed in the node.

If the assessment detects quality issues, a yellow exclamation point is displayed in the affected node.

For more information, see Test relationship quality.

Publish a Feature Discovery recipe

When you've finished configuring relationships and they've passed the relationship configuration assessment, click Publish in the upper-right corner to access additional settings, including Feature Discovery controls, partitioning, and the ability to rename the output dataset. The tabs below describe the options available:

In the Feature Discovery tab, you can set:

Setting Description Read more in DataRobot Classic
Feature discovery controls Set which feature types DataRobot evaluates during Feature Discovery. See Feature Discovery settings.
Feature reduction When enabled, during Feature Discovery, DataRobot generates new features, and then removes features that have low impact or are redundant. See Feature reduction.

In the Partitioning tab, you can set:

Setting Description Read more in DataRobot Classic
Partitioning method Set how DataRobot partitions data during Feature Discovery. Available options are dependent on the target feature and/or partition column. See Partitioning details.
Validation type Choose the validation types you want DataRobot to use during Feature Discovery. See Validation types and Understand validation types.
Cross-validation folds Available if the Validation type is set to Cross-validation. Set the number of cross-validation folds used to train models. See Data partitioning and validation.
Validation percentage Available if the Validation type is set to Training-validation-holdout. Set the subset of data to use for validation. See Data partitioning and validation.
Holdout percentage Set the subset of data that is unavailable during training and validation. See Configure model validation.

When you're done, click Publish. DataRobot then performs joins and aggregations as part of Feature Discovery, generating a new output dataset that is then registered in the Data Registry and added to your current Use Case.

Next steps

From here, you can:


Updated April 8, 2024