Create a Feature Discovery project¶
Feature Discovery is based on relationships—between datasets and the features within those datasets. DataRobot provides an intuitive relationship editor that allows you to build and visualize these relationships. The end product is a multitude of additional features that result from these linkages. These derived features can then train more accurate models and generate better predictions. DataRobot’s Feature Discovery engine analyzes the graphs and the included datasets to determine a feature engineering “recipe,” and from that recipe generates secondary features for training and predictions.
See the Feature Discovery file requirements for dataset sizes information.
Review the next section to get started with Feature Discovery. Or, skip to the step-by-step instructions that describe:
- Loading datasets into a project
- Defining relationships
- Setting join conditions
- Assessing the quality of relationship configurations
- Starting the project
Get started with Feature Discovery¶
In most cases, all you need to start a Feature Discovery project is a simple primary dataset that includes:
- The target (column that you want to predict).
- An identifier (for example, customer_id or transaction_id) to link the dataset to additional related datasets. This key serves as the basis of dataset joins.
- An optional time index—a date feature in the primary dataset—to support time-aware Feature Discovery. This date feature is used as the prediction point for generating new features.
Each record of the primary dataset represents the desired unit of analysis. From this primary dataset, DataRobot guides you through creating relationships to additional datasets, called secondary datasets.
Secondary datasets have features that can potentially enrich the primary dataset. While it may be the case that both primary and secondary datasets have one-to-one relationships when they are added, it is not required. In most cases, DataRobot aggregates and then summarizes features in the secondary datasets, and, from there, enriches the primary dataset.
Sample use case¶
The following sections use an example to illustrate how DataRobot automatically discovers new features from multiple datasets to predict whether a loan will default. In the primary dataset, CreditRisk - Loan Applications, the is-bad column is the project target. The relation between the datasets is the CustID column.
Two additional relational datasets, CreditRisk - Credit Inquiries and CreditRisk - Tradeline Acccounts, are the secondary datasets used for Feature Discovery.
Once model building begins, DataRobot runs through EDA2, adding newly created features to the Data page. The Data page provides a variety of information about all the resulting project data, both new and old.
The following steps describe how to load datasets from the AI Catalog into the project.
This procedure shows how to load datasets using the AI Catalog, so to begin, make sure all the assets are in the catalog. Alternatively, you can use the drag-and-drop method to upload datasets. If you do so, all datasets that you upload are automatically registered to the AI Catalog.
From the AI Catalog, select the primary dataset and click Create project. Enter the target.
Click Add datasets to add one or more additional datasets to the project. A valid Feature Discovery project requires at least one secondary dataset.
On the Specify prediction point page of the Relationship editor, optionally Select a date feature to use as a prediction point. This date/time feature from the primary dataset serves as a reference date for feature derivation windows.
The step to specify a prediction point does not display if you have already specified a prediction point for the project.
For an in-app explanation of prediction points, expand Show Example.
Click Set up as prediction point for a time-aware Feature Discovery project or Continue without prediction point for a non time-aware project.
In the Add datasets page of the Relationship editor, select a data import method under Add Data From.
This example shows how to add a dataset from the AI Catalog.
From the AI Catalog, select the datasets you want to include by clicking Select. Use the search functionality to easily locate datasets for selection. When finished, click Add.
Click Continue to finalize your selection.
The secondary datasets you select on this page are immediately added to the configuration, so if you reload the page without clicking Continue, the data is not lost.
The Define Relationships page displays the datasets.
Hover over a dataset to view associated metadata.
Click the menu to view details or delete the dataset.
If you select Details, you can also access the profile, feature lists, relationships, version history, and comments associated with the dataset as you do in the AI Catalog:
Best practice suggests continuing within this editor to define relationships. You can, however, click Continue to project to return to the Start screen.
The datasets display and you can see the number of relationships that have been defined.
At any time, you can click Define relationships to return to the Define Relationships page.
Once all datasets are loaded, the next step is to define relationships on the Define Relationships page. The primary dataset is on the canvas while any secondary sets are listed in the left pane.
To define relationships:
Click a secondary dataset to highlight it; notice the addition of a plus sign on the primary set.
Click the plus sign. DataRobot adds the selected secondary dataset to the canvas and opens a configuration editor for setting join conditions.
The following table describes the elements of the "Create new relationship" page:
Element Description Secondary dataset for join Sets the secondary dataset used in the join. Change via the dropdown to any added dataset. Changes are reflected in the canvas below. Primary dataset for join Sets the primary dataset used in the join. Suggested join condition Sets the join condition (feature) for the corresponding dataset (listed above the condition). DataRobot suggests up to five conditions, each of which is editable. Use the dropdown to select a new feature; use the trash icon () to delete the join. Add join condition Provides a manual join configuration option. Save or Save and configure time-aware Saves the relationship configuration. Save is the option if there is no date feature or you did not set a prediction point. If you did set a prediction point from the primary dataset, the Save and configure time-aware button displays. Canvas display controls Zooms in or out, or resets the default display size. Dataset menu options Provides access to a variety of actions that can be enacted on a primary or secondary dataset. Join edit launch Opens the relationship editor, allowing you to define or modify the relationship between the datasets joined by the line you clicked. Primary icon Indicates, with a bullseye icon, that this is the primary dataset. Tour launch Opens a short tour that provides an overview of configuring Feature Discovery. Continue to project Returns to the Start screen where you can revise your time-aware settings, set advanced options, set a modeling mode, and start the modeling process.
Once you've added all of your secondary datasets and selected your relationship configuration settings, click Save and configure time-aware (or Save for a non time-aware project).
If the project is not time-aware, the Start page displays. If the project is time-aware, the Time-aware feature engineering page displays.
If your project is time-aware, click Select time feature to choose a time index feature for the secondary dataset.
Configure Feature Derivation Windows (FDW). The FDW is a rolling window of past values used to generate features before the prediction point. The FDW constrains the time history. It defines how many values to look at, in this case, no further back than 30 days, no more recent than 2 days.
You can configure up to three FDWs for each dataset—each window must be unique. To add a FDW, click Add window.
Once set, the FDW is reflected in the dataset's tile on the canvas:
These time-aware settings ensure that the generated features are based only on records that occur before the prediction point. For more details, see Time-aware feature engineering.
An integration between DataRobot and Snowflake allows joint users to both execute data science projects in DataRobot and perform computations in Snowflake as a way to optimize workload performance. Feature Discovery training and prediction workflows will push down relational inner-joins, projection, and filter operations to the Snowflake platform (via SQL). By natively conducting joins in the Snowflake database, data is filtered into smaller datasets for transfer across the network before loading into DataRobot. The smaller datasets reduce project runtimes.
To enable integration with Snowflake, the following requirements must be met:
- A Snowflake data connection is set up.
- All secondary datasets are stored in Snowflake.
- All Snowflake sources are stored in the same warehouse.
- All datasets are configured as dynamic datasets in the AI Catalog.
- You have write permissions to one of the schemas in use or one
PUBLICschema of the database in use.
If the above requirements are met, the integration is automatically established and the Snowflake icon and Snowflake mode enabled will be displayed in blue at the top of the Define Relationships page.
Set join conditions¶
When creating or modifying joins, feature types must match—supported types are numeric, categorical, and date.
If tables in your datasets are well-formed, DataRobot automatically detects compatible features and creates up to five "suggested" joins. You can modify the suggested join using the dropdowns associated with each join key.
You can also manually create join keys by clicking Add join condition. In the resulting dialog, select a join feature from each dataset from the feature dropdown.
Add more datasets¶
If your project has more than one secondary dataset, you can add more datasets after saving. From the Define Relationships page:
Select another dataset. Notice that now both the primary and any secondary datasets have a plus icon. Click one to define a relationship between the dataset in the left pane and a dataset on the canvas. DataRobot adds the dataset to the canvas.
Follow the same steps as with adding the previous secondary dataset:
- Select a method and add a join.
- Set up time-aware modeling, if applicable.
- Save the configuration.
Work with primary datasets¶
Be sure to save your configuration before using the menu options. Unsaved changes are lost when you leave a page.
Working from the canvas, you can select the menu option on the dataset tile. The primary dataset allows you to add a relationship or edit the prediction point:
Choose Add relation when you don't have any previous relationships configured to open the Create new relationship page. This is the equivalent of selecting the dataset from the list on the left and clicking the plus sign on the primary's canvas tile. Once the page opens, select a secondary dataset from the dropdown and it is added to the canvas.
Edit prediction point
Select Edit prediction point to choose a different date feature to use as your prediction point.
Work with secondary datasets¶
When a secondary dataset has been selected and moved to the canvas, a menu option becomes available on its tile. The table below describes the options available from the menu:
|Add relation||Opens the relationship editor and allows you to select a dataset (from any available in the left pane) to join with.|
|Edit alias||Allows you to set an alias for the dataset. The string displays on the canvas as the secondary dataset name. The alias does not change the display in the left-pane dataset list or the relationship editor pages.|
|Configure dataset||Opens the dataset configuration editor, where you can set dataset details.|
|Configure time-awareness||Opens the time-aware feature engineering configuration dialog, where you can select a time index for the secondary dataset or confirm that the correct date/time feature is selected.|
|Details||Click to open the Info window for the dataset in the AI Catalog.|
|Delete||Deletes the dataset, and all its relationships, from the current relationship configuration. The dataset is still available to the configuration and listed in the left panel.|
Configure secondary datasets¶
Selecting Configure dataset from a secondary dataset menu opens the Dataset Editor.
From here you can:
Change the dataset alias. If not manually set, DataRobot auto-generates an alias based on the file name. Click in the box to modify the alias; the alias for the primary dataset cannot be modified.
Choose a snapshot policy, either Latest, Fixed, or Dynamic, to use for this project. By default, the selected snapshot policy will apply at prediction time.
Choose a feature list to apply against the corresponding dataset. Use this option to limit the size of the table by selecting relevant features. You can create new feature lists from the AI Catalog.
Relationship quality assessment¶
After configuring at least one secondary dataset, you can test the quality of those relationship configurations to learn of potential problems early in the creation process. The relationship quality assessment tool verifies join keys, dataset selection, and time-aware settings before EDA2 begins.
Click the Review configuration button to trigger the relationship quality assessment.
A progress indicator (loading spinner) displays on each dataset and on the Review Configuration button, which is disabled, to indicate that an assessment is currently running.
Once the assessment is complete, DataRobot marks all tested datasets. Those with identified issues display a yellow warning icon and those with no identified issues display a green tick.
Depending on the project type, DataRobot assesses the relationship's enrichment rate, window settings, and most recent data—each of which is described in the table below:
|Enrichment rate||Quickly determines, as a percentage, how many rows in the secondary dataset map to rows in the primary table.||Review the dataset and relationship.||All|
|Window settings||Determines how many rows in the secondary dataset map to the primary dataset within the specified FDWs.||Expand the window settings to find more rows.||Time-aware|
|Most recent data||Compares the minimum and maximum time index of the secondary and primary datasets to determine if the secondary dataset is outdated.||Review the selected feature list and snapshot policy.||Time-aware|
Assessments are always updated for JDBC sources with dynamic snapshot policy.
DataRobot calculates enrichment rate using the following formula:
Select the warning icon to view a summary of the issues with suggested potential fixes. A summary of the issues identified during the assessment is displayed at the top of the window.
To open the detailed report, click the orange arrow on the right. DataRobot breaks down the assessment by category, providing additional information to diagnose the issue. If a secondary dataset has multiple FDWs, a detailed report is created for each one.
To resolve warnings, click the orange link displayed below each warning— Review dataset, Review relationship, or Review window settings—and a pane appears at the top of the relationship editor allowing you to modify relationship configurations.
After model building begins and EDA2 is done, you can view the most recent relationship quality assessment in the Data > Feature Discovery tab.
View dataset details¶
You can access dataset details directly from the relationship editor using one of the following methods:
On the dataset tile, hover over the line beneath the dataset name to display metadata for the dataset.
Click the menu icon on the top right of the dataset tile and select Details to open the Info page in the AI Catalog.
Set feature engineering controls¶
You have the option of configuring feature engineering controls prior to starting your project:
Click the settings gear on the Define Relationships page.
On the Feature Engineering tab, select the transformations you want DataRobot to try when deriving new features.
Hover over the options to learn about the transformations they represent.
Click Save changes.
Start the project¶
Once you are happy with the definition of the relationship(s), click Continue to project to return to the Start screen.
The Secondary Datasets section provides visual queues that provide details about the secondary datasets.
Visual queue Indicates Datasets with blue text The dataset is in use and part of the project. Datasets with white text The dataset is loaded but not part of the relationship definition. Linked datasets The number of datasets linked with this dataset. Number of datasets and relationships The number of secondary datasets and how many have relationships defined.
DataRobot conducts feature engineering as part of EDA2 and begins generating model blueprints.
As with any DataRobot project, you can share Feature Discovery projects (depending on your permissions). The assignable roles provide different levels of permission for the recipient. Unique to Feature Discovery projects, however, is the ability to share engineering graphs and datasets as well.
To share a project, click the share icon (). For the recipient to interact with the project, they must have access to the additional assets. By default, assets are not shared. Check to enable sharing relationships and datasets, or DataRobot provides a warning:
Note that in addition to the assigned role, the listing of project users also indicates whether project assets have been shared.