Feature Discovery¶
To deploy AI across the enterprise and make the best use of predictive models, you must be able to access relevant features. Often, the starting point of your data does not contain the right set of features. Feature Discovery discovers and generates new features from multiple datasets so that you no longer need to perform manual feature engineering to consolidate various datasets into one.
See the associated considerations for important additional information.
Select topics from the following table to learn about the feature engineering workflow:
Topic | Description |
---|---|
End-to-end Feature Discovery | An end-to-end example that shows you how to enrich data using Feature Discovery. |
Feature Discovery projects | Create and configure projects with secondary datasets, including a simple use-case-based workflow overview. |
Snowflake integration | Set up an integration that allows joint users to both execute data science projects in DataRobot and perform computations in Snowflake. |
Feature Discovery settings | Configure advanced options for Feature Discovery projects, including feature engineering controls and feature reduction. |
Time-aware feature engineering | Configure time-aware feature engineering. |
Derived features | Introduction to the list of aggregations and the feature reduction process. |
Predictions | Score data with models created using secondary datasets. |
Feature considerations¶
When using Feature Discovery, consider the following:
-
JDBC drivers must be compatible with Java 1.8 and later.
-
For secondary datasets, only uploaded files and JDBC sources registered in the AI Catalog are supported.
-
The following features are not supported in Feature Discovery projects:
- Scoring Code
- Time series
- Challenger models
- V1.0 prediction API
- Portable Prediction Server (PPS)
- Automated Retraining
- Sliced insights
- Clustering
-
Maximum supported values:
- 30 datasets per project—DataRobot counts each feature derivation window and secondary dataset as a "dataset."
- The combined size of a project's primary and secondary datasets cannot exceed 100GB. Individual dataset size limits are based on AI Catalog limits.
-
If the primary dataset is larger than 40 MB, CV partitioning is disabled by default.
-
Column names in Feature Discovery datasets cannot contain the following:
- A trailing or leading single quote (e.g.,
feature1'
or'feature1
) - A trailing or leading space (e.g.,
feature1<space>
or<space>feature1
)
- A trailing or leading single quote (e.g.,
-
When there is an error during project start, you cannot return to defining relationships. You must restart the configuration.
-
There can be issues with the colors used in the visualization of linkages in the Feature Engineering relationship editor.
-
You must allow the following IP addresses to connect to the DataRobot JDBC connector:
Host: https://app.datarobot.com | Host: https://app.eu.datarobot.com | Host: https://app.jp.datarobot.com |
---|---|---|
100.26.66.209 | 18.200.151.211 | 52.199.145.51 |
54.204.171.181 | 18.200.151.56 | 52.198.240.166 |
54.145.89.18 | 18.200.151.43 | 52.197.6.249 |
54.147.212.247 | 54.78.199.18 | |
18.235.157.68 | 54.78.189.139 | |
3.211.11.187 | 54.78.199.173 | |
52.1.228.155 | 18.200.127.104 | |
3.224.51.250 | 34.247.41.18 | |
44.208.234.185 | 99.80.243.135 | |
3.214.131.132 | 63.34.68.62 | |
3.89.169.252 | 34.246.241.45 | |
3.220.7.239 | 52.48.20.136 | |
52.44.188.255 | ||
3.217.246.191 |
Note
These IP addresses are reserved for DataRobot use only.
Batch prediction considerations¶
-
Only DataRobot models are supported; no external or custom model support.
-
Model package export is not supported for Feature Discovery models.
-
You cannot replace a Feature Discovery model with a non-Feature Discovery model or vice versa.
-
When a Feature Discovery model is replaced with another Feature Discovery model, the configuration used by the new model becomes the default configuration.
-
Feature discovery predictions will be slower than other DataRobot models because feature engineering is applied.
-
When Feature Discovery generates features using secondary datasets, the hash values of all the feature values (
ROW_HASH
) are used to break any ties (when applicable). The value of hash changes when applied to different datasets, so if you make predictions with another secondary configuration, you may receive different predictions.
Feature Discovery compatibility¶
The following table indicates which features are supported for Feature Discovery and describes any limitations.
Supported? | Limitations | |
---|---|---|
Monotonicity | Yes | Limited to features from the primary dataset used to start the project. Note: Users can start the project without specifying constraints. They can then manually constrain models from the Leaderboard and the Repository on eligible blueprints using discovered/generated features. |
Pairwise interaction in GA2M models | Yes | Limited to features from the primary dataset used to start the project. |
Positive class assignment | Yes | |
Smart downsampling | Yes | |
Supervised feature reduction | Yes | Only applies if secondary datasets are provided. |
Search for interactions | Yes | Automatically enabled. Cannot be disabled if secondary datasets are provided. |
Only blueprints with Scoring Code support | No | |
Create blenders from top models | Yes | |
Include only SHAP-supported blueprints | Yes | |
Recommend and prepare a model for deployment | Yes | |
Challenger models in MLOps | No | |
Include blenders when recommending a model | Yes | |
Use accuracy-optimized metablueprint | Yes | These models are extremely slow. |
Upperbound running time | Yes | |
Weight | Yes | Weight feature must be in the primary dataset used to start the project. |
Offset | Yes | Offset feature must be in the primary dataset used to start the project. |
Exposure | Yes | Exposure feature must be in the primary dataset used to start the project. |
Random seed | Yes | |
Count of events | Yes | Count of events feature must be in the primary dataset used to start the project. |