Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Feature Discovery

To deploy AI across the enterprise, you must be able to access relevant features to make the best use of predictive models. Often, the starting point of your data does not contain the right set of features. Feature Discovery discovers and generates new features from multiple datasets so that you no longer need to perform manual feature engineering to consolidate multiple datasets into one.

See the associated considerations for important additional information.

Select topics from the following table to learn about the feature engineering workflow:

Topic Describes...
End-to-end Feature Discovery An end-to-end example that shows you how to enrich data using Feature Discovery.
Feature Discovery projects Create and configure projects with secondary datasets, including a simple use-case-based workflow overview.
Snowflake integration Set up an integration that allows joint users to both execute data science projects in DataRobot and perform computations in Snowflake.
Feature Discovery settings Configure advanced options for Feature Discovery projects, including feature engineering controls and feature reduction.
Time-aware feature engineering Configure time-aware feature engineering.
Derived features Introduction to the list of aggregations and the feature reduction process.
Predictions Score data with models created using secondary datasets.

Feature considerations

When using Feature Discovery, consider the following:

  • JDBC drivers must be compatible with Java 1.8 and later.

  • For secondary datasets, only uploaded files and JDBC sources registered in the AI Catalog are supported.

  • The following features are not supported in Feature Discovery projects:

    • Scoring Code
    • Time series
    • Challenger models
    • V1.0 prediction API
    • Portable prediction server (PPS)
    • Automated Retraining
    • Sliced insights
    • Clustering
  • Maximum supported values:

    • 30 datasets per project—DataRobot counts each feature derivation window and secondary dataset as a "dataset."
    • The combined size of a project's primary and secondary datasets cannot exceed 100GB. Individual dataset size limits are based on AI Catalog limits.
  • If the primary dataset is larger than 40 MB, CV partitioning is disabled by default.

  • Column names in Feature Discovery datasets cannot contain the following:

    • A trailing or leading single quote (e.g., feature1' or 'feature1)
    • A trailing or leading space (e.g., feature1<space> or <space>feature1)
  • When there is an error during project start, you cannot return to defining relationships. You must restart the configuration.

  • There can be issues with the colors used in the visualization of linkages in the Feature Engineering relationship editor.

  • You must allow the following IP addresses to connect to the DataRobot JDBC connector:

Host: https://app.datarobot.com Host: https://app.eu.datarobot.com
100.26.66.209 18.200.151.211
54.204.171.181 18.200.151.56
54.145.89.18 18.200.151.43
54.147.212.247 54.78.199.18
18.235.157.68 54.78.189.139
3.211.11.187 54.78.199.173
52.1.228.155 18.200.127.104
3.224.51.250 34.247.41.18
44.208.234.185 99.80.243.135
3.214.131.132 63.34.68.62
3.89.169.252 34.246.241.45
3.220.7.239 52.48.20.136
52.44.188.255
3.217.246.191

Note

These IP addresses are reserved for DataRobot use only.

Batch prediction considerations

  • Only DataRobot models are supported; no external or custom model support.

  • Model package export is not supported for Feature Discovery models.

  • You cannot replace a Feature Discovery model with a non-Feature Discovery model or vice versa.

  • When a Feature Discovery model is replaced with another Feature Discovery model, the configuration used by the new model becomes the default configuration.

  • Feature discovery predictions will be slower than other DataRobot models because feature engineering is applied.

  • When Feature Discovery generates features using secondary datasets, the hash values of all the feature values (ROW_HASH) are used to break any ties (when applicable). The value of hash changes when applied to different datasets, so if you make predictions with another secondary configuration, you may receive different predictions.

Feature Discovery compatibility

The following table indicates which features are supported for Feature Discovery and describes any limitations.

Feature Supported? Limitations
Monotonicity Yes Limited to features from the primary dataset used to start the project. Note: Users can start the project without specifying constraints. They can then manually constrain models from the Leaderboard and the Repository on eligible blueprints using discovered/generated features.
Pairwise interaction in GA2M models Yes Limited to features from the primary dataset used to start the project.
Positive class assignment Yes
Smart downsampling Yes
Supervised feature reduction Yes Only applies if secondary datasets are provided.
Search for interactions Yes Automatically enabled. Cannot be disabled if secondary datasets are provided.
Only blueprints with Scoring Code support No
Create blenders from top models Yes
Include only SHAP-supported blueprints Yes
Recommend and prepare a model for deployment Yes
Challenger models in MLOps No
Include blenders when recommending a model Yes
Use accuracy-optimized metablueprint Yes These models are extremely slow.
Upperbound running time Yes
Weight Yes Weight feature must be in the primary dataset used to start the project.
Offset Yes Offset feature must be in the primary dataset used to start the project.
Exposure Yes Exposure feature must be in the primary dataset used to start the project.
Random seed Yes
Count of events Yes Count of events feature must be in the primary dataset used to start the project.

Updated January 26, 2024