Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Feature Discovery

To deploy AI across the enterprise and make the best use of predictive models, you must be able to access relevant features. Often, the starting point of your data does not contain the right set of features. Feature Discovery discovers and generates new features from multiple datasets so that you no longer need to perform manual feature engineering to consolidate various datasets into one.

See the Feature Discovery file requirements for information about dataset sizes, and the associated considerations for important additional information.

Self-managed: Allocate resources for large datasets

If you're working with large datasets, an admin can allocate additional compute resources by navigating to User settings > System configuration, enabling XLARGE_MM_WORKER_SAFER_AIM_CONTAINER_MEM_MB, and specifying the number of resources in the field.

Open Feature Discovery

To perform Feature Discovery in Workbench, in the Data tab, click the Actions menu > Feature Discovery to the right of the dataset that will serve as the primary dataset. When you add and configure secondary datasets in the Feature Discovery recipe, you will define their relationship to the dataset selected here.

DataRobot opens Feature Discovery and adds the primary dataset to the canvas.

Configure primary dataset settings

With the primary dataset selected, enter the Prediction point (time of the prediction). Prediction point is only available if a date feature is detected in the dataset.

Set the target and prediction point.

Then, click SavePrimary data settings saved is displayed at the bottom of the page.

Add secondary datasets

Feature Discovery requires at least one secondary dataset. Otherwise, you do not need to perform Feature Discovery and can use the single dataset to directly set up an experiment. To add secondary datasets:

  1. Click + Add Datasets in the left panel. The Add Data modal opens.

    Click Add Datasets to open the Add Data modal.

  2. You can add data from a data connection, the Data Registry, or your current Use Case, as well as preview a dataset by clicking it. Select the box to the left of each secondary dataset you want to add, then click Add Datasets.

    Add secondary datasets.

    All secondary datasets are displayed in the left panel.

Add relationships

Adding a relationship between datasets tells DataRobot that the two datasets are connected. There are two ways to establish a relationship between a primary and secondary dataset:

  • Select the secondary dataset, and click the + that appears below a dataset node on the canvas.

  • Select a dataset node on the canvas and from the actions menu , select Add relation. In the left panel, select the dataset you want to join.

Note

After defining a relationship between a primary and secondary dataset, you must configure the join conditions for that relationship before adding another dataset.

Set join conditions

While adding a relationship establishes that there's a connection between two datasets, the join conditions specify how they're related.

If the tables in your datasets are well-formed, DataRobot automatically detects compatible features and populates the Join condition field with the most appropriate feature, typically, a feature that's included in both datasets.

  Element Description
1 Join A visual representation indicating a relationship, or join, between two dataset nodes. Click this to edit a relationship and its join conditions.
2 Nodes to join The two dataset nodes that are joined.
3 Join condition The features, one from each dataset, that tell DataRobot how the two datasets are related.
4 + Add join condition Click to include an additional join condition.
5 Save / Save and configure time-aware
  • Save: For non-time aware, saves the relationship and join conditions.
  • Save and configure time aware: For time-aware, saves the relationship and join conditions, and opens Time-awareness tab for further secondary dataset configuration.
Join feature type compatibility and restrictions

See the table below for compatible join types when creating or modifying joins:

Feature type Compatible join types
Numeric Numeric, Categorical
Categorical Categorical, Numeric, Text
Text Text, Categorical
Date Date

The following feature types cannot be used as join keys:

  • Summarized categorical
  • Length
  • Currency
  • Percentage
  • Audio
  • Image
  • Document

For more information, see Set join conditions in the DataRobot Classic section.

Configure secondary dataset settings

Select a secondary dataset node on the canvas to configure its settings, including its name, feature list, and time-awareness (if applicable).

Node Settings

To edit the settings for a secondary dataset node, click on a secondary dataset node and open the Node Settings tab, which includes the following options:

  Element Description
1 Node alias Modify the name displayed at the top of the node. By default, the string displayed on the canvas is the name of the secondary dataset. Entering a node alias is helpful if the dataset name is too long to display in full.
2 Snapshot policy Select a snapshot policy to associate with the dataset node.
3 Feature list Select a feature list to apply to the dataset in this node.
4 + Create new feature list Create a new feature list to apply to the dataset node using the features listed below.
5 Features View the features included in the dataset.

Time-awareness

If DataRobot detects a date feature in the primary dataset, you can select a prediction point to configure time-awareness. To edit these settings for a secondary node, open the Time-awareness tab, which includes the following options:

  Element Description
1 Time index Determines the time window when DataRobot performs joins and aggregations during Feature Discovery.
2 Feature derivation window (FDW) Set the rolling window used to create features, which increases the model’s ability to learn from data trends and results in more accurate forecasts.
3 + Add feature derivation window Define additional FDWs to finetune time-aware Feature Discovery.
4 Prediction point: {date_feature} rounded down to nearest Control how DataRobot rounds down the prediction point when running Feature Discovery. While rounding makes the Feature Discovery process faster, doing so comes at a cost of potentially losing fresh secondary dataset records.
Prediction point vs. Time index

Prediction point applies to the primary dataset and is used as the reference date for when you can make predictions. Time index applies to secondary datasets and is used to determine the time window when DataRobot can perform joins and aggregations as part of Feature Discovery.

For more information, see Time-aware feature engineering.

Automatically generate relationships

Automatic relationship detection (ARD) analyzes the primary dataset and all secondary datasets in a Feature Discovery recipes to detect and generate relationships between features, allowing you to quickly explore potential relationships when you're unsure of how the datasets connect.

Note

Note the following before automatically generating relationships:

  • All secondary datasets must be added to the Feature Discovery recipe prior to running ARD.
  • ARD does not run on dynamic datasets.

To automatically generate relationships in a Feature Discovery recipe:

  1. Make sure all secondary datasets are added.

  2. Then, click Generate Relationships at the top of the canvas.

    Once ARD is complete, DataRobot automatically adds secondary datasets to the canvas and configures relationships between the datasets.

Review relationship configurations

After configuring at least one secondary dataset, you can test the quality of those relationship configurations to identify and resolve potential problems early in the creation process. The Relationship Quality Assessment tool verifies join keys, dataset selection, and time-aware settings.

Click Review configuration to test the relationships on the Feature Discovery canvas.

Each node displays the results of the assessment. If the quality of a relationship passes the assessment, a green check mark is displayed in the node.

If the assessment detects quality issues, a yellow exclamation point is displayed in the affected node.

For more information, see Test relationship quality.

Configure Feature Discovery controls

To influence how DataRobot conducts feature engineering, open Settings, which includes feature engineering controls and feature reduction.

Setting Description Read more in DataRobot Classic
Feature discovery controls Set which feature types DataRobot evaluates during Feature Discovery. See Feature Discovery settings.
Feature reduction When enabled, during Feature Discovery, DataRobot generates new features, and then removes features that have low impact or are redundant. See Feature reduction.

Start modeling

When you've finished configuring relationships and they've passed the relationship configuration assessment, you can proceed directly to experiment set up to start modeling.

To set up an experiment using the Feature Discovery recipe:

  1. Click Recipe actions > Start modeling.

  2. Set up the experiment for either predictive or time-aware modeling.

After you click Start modeling in the experiment, DataRobot performs joins and aggregations as part of Feature Discovery, generating an enriched output dataset that is then registered in the Data Registry and added to your current Use Case.

Download recipe SQL

Once the enriched dataset is registered and added to the Use Case—which only happens after you start modeling—you can access the Spark SQL that DataRobot used to execute the actions specified in your Feature Discovery recipe.

To access the recipe SQL:

  1. Open the enriched dataset in the Use Case.
  2. On the Info tab for the dataset, click Recipe SQL.

  3. View the SQL to understand how DataRobot performed the joins and aggregations as part of Feature Discovery or copy the SQL to run the SQL in a new Spark cluster.


Updated October 24, 2024