Data > Transform data > Feature Discovery

Feature Discovery¶

To deploy AI across the enterprise and make the best use of predictive models, you must be able to access relevant features. Often, the starting point of your data does not contain the right set of features. Feature Discovery discovers and generates new features from multiple datasets so that you no longer need to perform manual feature engineering to consolidate various datasets into one.

See the associated considerations for important additional information.

Select topics from the following table to learn about the feature engineering workflow:

Topic	Description
End-to-end Feature Discovery	An end-to-end example that shows you how to enrich data using Feature Discovery.
Feature Discovery projects	Create and configure projects with secondary datasets, including a simple use-case-based workflow overview.
Snowflake integration	Set up an integration that allows joint users to both execute data science projects in DataRobot and perform computations in Snowflake.
Feature Discovery settings	Configure advanced options for Feature Discovery projects, including feature engineering controls and feature reduction.
Time-aware feature engineering	Configure time-aware feature engineering.
Derived features	Introduction to the list of aggregations and the feature reduction process.
Predictions	Score data with models created using secondary datasets.

Feature considerations¶

When using Feature Discovery, consider the following:

JDBC drivers must be compatible with Java 1.8 and later.
For secondary datasets, only uploaded files and JDBC sources registered in the AI Catalog are supported.
The following features are not supported in Feature Discovery projects:
- Scoring Code
- Time series
- Challenger models
- V1.0 prediction API
- Portable Prediction Server (PPS)
- Automated Retraining
- Sliced insights
- Clustering
Maximum supported values:
- 30 datasets per project—DataRobot counts each feature derivation window and secondary dataset as a "dataset."
- The combined size of a project's primary and secondary datasets cannot exceed 100GB. Individual dataset size limits are based on AI Catalog limits.
If the primary dataset is larger than 40 MB, CV partitioning is disabled by default.
Column names in Feature Discovery datasets cannot contain the following:
- A trailing or leading single quote (e.g., feature1' or 'feature1)
- A trailing or leading space (e.g., feature1<space> or <space>feature1)
When there is an error during project start, you cannot return to defining relationships. You must restart the configuration.
There can be issues with the colors used in the visualization of linkages in the Feature Engineering relationship editor.
You must allow the following IP addresses to connect to the DataRobot JDBC connector:

Host: https://app.datarobot.com	Host: https://app.eu.datarobot.com	Host: https://app.jp.datarobot.com
100.26.66.209	18.200.151.211	52.199.145.51
54.204.171.181	18.200.151.56	52.198.240.166
54.145.89.18	18.200.151.43	52.197.6.249
54.147.212.247	54.78.199.18
18.235.157.68	54.78.189.139
3.211.11.187	54.78.199.173
52.1.228.155	18.200.127.104
3.224.51.250	34.247.41.18
44.208.234.185	99.80.243.135
3.214.131.132	63.34.68.62
3.89.169.252	34.246.241.45
3.220.7.239	52.48.20.136
52.44.188.255
3.217.246.191

Note

These IP addresses are reserved for DataRobot use only.

Batch prediction considerations¶

Only DataRobot models are supported; no external or custom model support.
Model package export is not supported for Feature Discovery models.
You cannot replace a Feature Discovery model with a non-Feature Discovery model or vice versa.
When a Feature Discovery model is replaced with another Feature Discovery model, the configuration used by the new model becomes the default configuration.
Feature discovery predictions will be slower than other DataRobot models because feature engineering is applied.
When Feature Discovery generates features using secondary datasets, the hash values of all the feature values (ROW_HASH) are used to break any ties (when applicable). The value of hash changes when applied to different datasets, so if you make predictions with another secondary configuration, you may receive different predictions.