December 13, 2021
Release v7.3 provides updated UI string translations for the following languages:
In the spotlight...¶
The following features are some of the highlights of Release 7.3:
- Composable ML adds project linking and bulk training
User interface enhancements¶
New XEMP Prediction Explanation interface¶
With this release, the XEMP Prediction Explanations visualization has been redesigned to provide cleaner, clearer at-a-glance information about why a model has made a particular prediction. The functionality offers the same insights with an easier, more intuitive interface.
DataRobot Pipelines now GA¶
DataRobot Pipelines enable data science and engineering teams to build and run machine learning data flows. Teams start by collecting data from various sources, cleaning them, and combining them. They standardize the values among other data preparation operations to build a dataset at the unit of analysis.
To make repeatable data extraction and preparation easier, teams often build a data pipeline—a set of connected data processing steps—so that they can prepare data for training models, making predictions, and applying to other relevant use cases.
With DataRobot Pipelines, you connect to data sources of varied formats and transform data to build and orchestrate your machine learning data flows. Currently, Pipelines contain input (AI Catalog Import and CSV Reader), transformation (Spark SQL), and output (AI Catalog Export and CSV Writer) modules. Once built, you can run your pipelines interactively or you can schedule batch runs.
For details, see the DataRobot Pipelines documentation.
Feature Discovery features¶
Feature Discovery supports multiple feature derivation windows¶
In Automated Feature Discovery, you can now configure up to three feature derivation windows (FDW) per dataset. To define additional windows, open the Time-aware feature engineering editor and click Add window. Note that each FDW must be unique.
For details, see Define Relationships.
Feature Discovery relationship quality assessment¶
Feature Discovery introduces a tool to automatically assess the quality of a relationship configuration—warning the user of potential problems—early in the creation process. The relationship quality assessment tool verifies join keys, dataset selection, and time-aware settings before EDA2 begins.
Click the Review configuration button to trigger the relationship quality assessment. A progress indicator (loading spinner) displays on each dataset and on the review configuration button, which is disabled, to indicate that an assessment is currently running.
Once the assessment is complete, DataRobot marks all tested datasets. Those with identified issues display a yellow warning icon and those with no identified issues display a green tick. Select the dataset to view a summary of the issues with suggested potential fixes.
To resolve warnings, click the orange link displayed below each warning—Review dataset, Review relationship, or Review window settings—and a pane appears at the top of the relationship editor allowing you to modify relationship configurations. After addressing the warnings, click Review configuration to reassess the relationships.
For details, see the Relationship quality assessment documentation.
Feature Discovery improvements¶
Release 7.3 brings the following improvements to the Feature Discovery UI:
- In the Relationship Editor, if the primary dataset is also used as a secondary dataset, the target no longer appears as a suggested join key.
- When making changes to a secondary dataset configuration, this no longer causes all dataset names to reload.
- Individual dataset import sizes cannot exceed 11GB.
- The default snapshot policy for all snapshotted dataset, including JDBC datasets, is Latest.
- You can now click a FDW displayed on your dataset to open the FDW editor.
Composable ML adds project linking and bulk training, general improvements¶
Composable ML provides a full-flexibility approach to model building, allowing you to direct your data science and subject matter expertise to the models you build. With Composable ML, you build blueprints that best suit your needs using built-in tasks and custom Python/R code. Then, use your custom blueprint together with other DataRobot capabilities (MLOps, for example) to boost productivity.
With release 7.3, in addition to the feature preview capabilities available earlier, come these important improvements:
Project linking: Because some blueprints are meant to be used only with a specific project (perhaps they incorporate a step that calls for specific features, for example) DataRobot applies automated project linking. If you then attempt to apply the blueprint to a different project, DataRobot provides a warning that the required columns do not exist in the dataset.
Bulk training: You can now train user blueprints in bulk for a specific project, filtered based on compatibility with the selected blueprints. (Note that if selected blueprints don't have at least one common target type, DataRobot prevents bulk training.) From the AI Catalog Blueprints tab, you can sort blueprints by target type (binary, regression, multiclass, and unsupervised) for easier selection.
The feature is generally available for Managed AI Cloud users and private preview for on-premise users (contact your DataRobot representative for enablement information).
More information for Managed AI Cloud users.
Word Cloud support for all linear models¶
Previously only available for a single model and mode type, Word Cloud now supports a variety of binary classification, multiclass, and regression models. Additionally, Word Cloud is now available for multimodal datasets (i.e., datasets that mix images, text, categorical, etc.), displaying a word cloud for all text from the data.
For details, see the Word Cloud documentation.
Clustering, an application of unsupervised learning, lets you explore your data by grouping and identifying natural segments. Use clustering to explore clusters generated from many types of data—numeric, categorical, text, image, and geospatial data—independently or combined. In clustering mode, DataRobot captures a latent behavior that's not explicitly captured by a column in the dataset.
To generate clusters, run in unsupervised Clusters mode:
To investigate the clusters generated during modeling, use the Cluster Insights visualization to understand, name, and explain each cluster in a dataset:
For details, see the Clustering documentation.
External predictions now GA¶
Released as a public preview feature in v7.2, the External Predictions capability allows you to bring external model(s) into the DataRobot AutoML environment for comparison against DataRobot models. Simply add external model predictions as a new column in your training dataset and identify the predictions and partition column. When modeling completes, the external model is available on the Leaderboard. From there you can compare it against DataRobot models, investigate further using select DataRobot visualizations, and (for binary classification projects) explore bias testing.
Additionally, a new public preview enhancement is available for the feature, providing support for multiple (up to 25) prediction columns, with each mapping into a separate "external model."
Feature Effects for multiclass projects¶
With this release, the Feature Effects visualization is now available for multiclass projects. In addition, using the Select Class dropdown, you can view partial dependence, predicted, and actual values for each class of the target value. By default, DataRobot calculates effects for the top 10 impact-ranked features, but the new feature provides an option to calculate, individually, for all features.
For details, see Feature Effects.
Configurable sample size for SHAP Feature Impact¶
With this release, you can now configure the sample size used for computing Feature Impact in SHAP-based projects. Previously this capability was only available for permutation Feature Impact. Changing sizes can help, for example, to compute SHAP Feature Impact quickly with near to the same accuracy.
For details, see the Feature Impact documentation.
Unlimited multiclass builds multiclass classifiers for targets with any number of classes¶
Availability of unlimited classes in multiclass projects is dependent on your DataRobot package. If it is not enabled for your organization, class limit is set to 100. Contact your DataRobot representative to increase this limit.
This release extends the multiclass project types, adding an unlimited multiclass option. For multiclass projects with more than 1000 classes, DataRobot, by default, will keep the top 999 most frequent classes and aggregate the remainder into a single "other" bucket. Or, you can configure the aggregation parameters to ensure all classes necessary to your project are represented. Additionally, multiclass visualizations are adjusted to suit the larger class display. With unlimited multiclass, you no longer need to prepare data to suit a class limit and maintain several models. You can now deploy a single model to serve predictions against any number of classes.
For details, see the multiclass documentation.
Multilabel modeling adds pairwise matrix management¶
Availability of multilabel modeling is dependent on your DataRobot package. If it is not enabled for your organization, contact your DataRobot representative for more information.
Multilabel modeling (modeling when each row is associated with one, several, or zero labels) is now generally available. In addition, capabilities have been added that allow you to more easily control the pairwise matrix. The matrix, which shows pairwise statistics for pairs of labels and the occurrence percentage of each label in the dataset, now uses a thumbnail matrix to more easily set the display of the main matrix. You can select an area from the thumbnail or manually set rows and/or columns, ensuring the main matrix focuses on labels of interest.
To ensure that data is valid, the data quality assessment checks now checks data against the requirements for multicategorical features. A log provides more detailed error information.
For details, see the multilabel documentation.
The following is a summary of API new features and enhancements. Reference the API Documentation home for more information on each client.
DataRobot highly recommends updating to the latest API client for Python and R.
The following new functionality has been added for API release v2.27.0.
- Retrieve and restore discarded features for time series projects.
Compute and retrieve Feature Effects for multiclass models¶
- For non-datetime partitioned models:
- For datetime partitioned models:
Custom models conversion functionality¶
featureFit retrieve routes now support returning
individual_conditional_expectation (ICE) Plots; a new query parameter,
include_ice_plots, controls this functionality. To access this feature, enable the feature flag
Enable ICE Plots on Feature Fit/Feature Effects.
This includes the following routes:
There are new routes to initialize compliance documentation pre-processing, which is required to generate compliance documentation for custom models:
Create compliance documentation pre-processing initialization:
Check compliance documentation pre-processing initialization:
There are now new routes that support multilabel classification project types:
Retrieve multilabel pairwise statistics:
Retrieve multilabel histograms:
Retrieve multilabel labelwise ROC:
Retrieve multilabel labelwise Lift charts:
Retrieve manual label selections for multilabel pairwise statistics:
Save manual label selections for multilabel pairwise statistics:
Update a manual label selection for multilabel pairwise statistics:
Delete a manual label selection for multilabel pairwise statistics:
Public preview features¶
Connect to Snowflake using external OAuth¶
Snowflake users can now set up a Snowflake data connection in DataRobot using an external identity provider (IdP)—either Okta or Azure Active Directory—for user authentication through OAuth single sign-on (SSO).
For details, see External OAuth for Snowflake.
Fast registration in the AI Catalog¶
You can now quickly register large datasets in the AI Catalog by specifying the first N rows to be used for registration instead of the full dataset—giving you faster access to data to use for testing and Feature Discovery.
In the AI Catalog, click Add to catalog and select your data source. Fast registration is only available when adding a dataset from a new data connection, an existing data connection, or a URL. Enter information for the data source and select a snapshot policy:
- For a snapshot dataset, DataRobot will ingest the specified number of first rows. Subsequent consumption of the data, like creating a project with it, will use this dataset with N rows.
- For a dynamic dataset, DataRobot will use the specified number of first N rows to compute EDA1. Subsequent consumption of the data, however, will always use the full dataset.
For fast registration, select the partial data upload option and specify the number of rows to ingest.
For details, see AI Catalog fast registration.
Note the following to better plan for later migration to new releases.
Local folder option for custom models will be deprecated¶
As of release v8.0 (March 14, 2022 for Cloud users), the ability to use the “Local Folder” option when adding a model via the Deployment inventory will be deprecated. For this release, while it is still available, the preferred method is to use the Custom Model Workshop. With v8.0, only the workshop option will be available (and will be linked to from the inventory page).
API deprecation notices¶
Note the following to better plan for later migration to new releases.
Get discarded features information:
Restore a list of discarded features:
Customer-reported fixed issues¶
The following issues have been fixed since release 7.2.6.
- SAFER-4115: Fixes an issue where BigQuery OAuth credentials would not work with Feature Discovery projects.