March 15, 2021
See these important deprecation announcements for information about changes to DataRobot's support for older, expiring functionality.
Release v7.0.0 provides updated UI string translations for the following languages:
In the spotlight...¶
The following features are some of the highlights of Release 7.0:
Now GA: Bias detection and analysis tools¶
Bias and Fairness testing, now publicly available, provides methods to calculate fairness for a binary classification model and to identify any biases in the model’s predictive behavior.
Before model building, use Advanced Options > Bias and Fairness to define protected features and choose the appropriate fairness metric for your use case. A Help me choose questionnaire prompts DataRobot to recommend a metric. Once models are built, Bias and Fairness insights help identify bias in a model and visualize results of root-cause analysis into why the model is learning bias from the training data and from where.
Per-Class Bias uses the fairness threshold and fairness score of each class to determine if certain classes are experiencing bias in the model’s predictive behavior.
Cross-Class Data Disparity performs root-cause analysis of the model’s bias for the selected classes. The Data Disparity vs Feature Importance chart identifies which features impact bias most; the Feature details chart reports where bias exists within the feature.
Cross-Class Accuracy helps to understand how the model is performing and its behavior on a given protected feature/class segment.
Now GA: Accuracy-boosting train-time image augmentation¶
Train time image augmentation, a feature available for Visual AI projects, boosts accuracy on image datasets, especially those with few rows. More data usually means better accuracy and better generalization, but often you don’t have the resources (time, money, image availability, labeling expertise, etc.) to easily obtain it. With image augmentation you can create new image data from existing images by applying transformations.
You can create image transformations prior to model-building via Advanced options. Or, after model building completes, you can continue to tune the image dataset from the Leaderboard's Evaluate > Advanced Tuning tab. A new "Image Augmentation" task will appear in image blueprints. Improvements to augmentation, based on Beta feedback, include support for multimodal projects, an increase in the size of augmentation that DataRobot can perform, and an improved UI for previewing augmentation strategies. Also, post-modeling tuning and new augmentation list creation has moved to Advanced Tuning.
New features and enhancements¶
Feature Discovery enhancements¶
See details of Feature Discovery enhancements below:
- Increased blueprint support of summarized categorical features
- Summarized categorical insights now filter stop words
- Beta: Feature Discovery now available for unsupervised projects
- Beta: Feature Discovery deployments support governance workflow to manage secondary datasets)
- Beta: Support for Spark SQL queries in dynamic datasets now available in Feature Discovery
Other new features¶
See details of other new features below:
- Prediction threshold gets a UX upgrade
- Access additional Scoring Code models
- Developer Tools page now provides access R and Python clients
- Multiclass Feature Impact now supports custom sample sizes
- Beta: Multilabel classification capabilities expands classification options
- Beta: New Tiny BERT pretrained featurizer implementation extends NLP
- Beta: Scoring Code support for Keras models
New Feature Discovery features¶
Increased blueprint support of summarized categorical features increases accuracy and Leaderboard diversity¶
The summarized categorical variable type are for features that host a collection of categories (for example, the count of a product by category or department). If your original dataset does not have features of this type, DataRobot creates them (from secondary datasets) as part of the feature discovery process. With this release, DataRobot adds support for this feature type to a wider selection of blueprints, resulting in a greater number of models being run during Autopilot. This addition will be particularly impactful in Feature Discovery projects with secondary datasets.
Summarized categorical insights now filter stop words¶
With this release, insights for summarized categorical features now filter out stop words on demand (Category Cloud) and by default (Histogram) for single-token text. Removing stop words—commonly used terms that can be excluded from searches—improves interpretability if the words are not informative to the model. This is because, by filtering, users can focus on the important non-stopwords to better understand their data.
Beta: Feature Discovery now available for unsupervised projects¶
Previously, Feature Discovery did not support unsupervised learning projects. While the option was visible at project start when "No Target" was chosen, the UI returned an error message if you tried to configure Feature Discovery settings. Now available as a beta feature, you can set unsupervised mode, add secondary datasets, define relationships, and start a project. DataRobot will generate secondary features as in a supervised project, while eliminating supervised feature reduction (which requires a target).
Beta: Feature Discovery deployments support governance workflow to manage secondary datasets (MLOps required)¶
With this release, you can manage updates to secondary datasets in Feature Discovery deployments using the governance workflow. After an admin sets up the “Secondary dataset configuration changed” approval policy trigger in User Settings > Approval Policies, any changes to a secondary dataset will prompt a change request that must go through an approval process. The creator of the change request can view its status under History in Deployments > Overview, and reviewers will see a notification requesting that they review pending changes.
Beta: Support for Spark SQL queries in dynamic datasets now available in Feature Discovery secondary datasets¶
DataRobot offers the ability to enrich, transform, shape, and blend together snapshotted (static) datasets using Spark SQL queries from within the AI Catalog. This new functionality adds support for dynamic Spark SQL in secondary datasets for Feature Discovery projects. When enabled as a beta feature ("Enable Feature Discovery Support of Dynamic Spark SQL"), this new functionality increases flexibility in performing basic data prep. Authentication requirements remain the same.
Other new features¶
Prediction threshold gets a UX upgrade¶
With this release, DataRobot has upgraded the user experience for setting prediction thresholds on the Leaderboard. First, upgrades to the components on the ROC Curve, Profit Curve, Make Predictions, and Deploy tabs make assigning or selecting a suggested prediction threshold easier. Next, there is now a convenient one-click copy between the display threshold and the prediction threshold on the ROC Curve and Profit Curve tabs. Finally, the selected prediction threshold is now synched across all tabs in a model and for model downloads (such as a model package (.mlpkg) file).
Access additional Scoring Code models¶
In 7.0, Scoring Code coverage has increased. The following models have been rewritten to include Scoring Code:
Developer Tools page now provides access to R and Python clients¶
New in this release, Developer Tools now provides quick links to developer documentation. These include links to:
- Current REST API, Python client API, and R client API documentation.
- The developer portal.
- The Github community repositories.
Multiclass Feature Impact now supports custom sample sizes¶
Multiclass projects are now able to compute Feature Impact using custom sample size. Address inconsistencies in Feature Impact results, and reproduce those results in a much more consistent way, thus reducing friction during the model validation process.
Beta: Multilabel classification capabilities expands classification options¶
Multilabel modeling, now available as a public beta feature, is a kind of classification task where each data instance (row in a dataset) is associated with none, one, or several labels. Common uses are for text features with a list of topics (food, Boston, Italian) or images with a list of objects in it (a cat, two dogs, a bear). All the labels for a row builds a label set for the row. Multilabel classification then predicts label sets given new observations. While similar to multiclass modeling, multilabel modeling provides more flexibility.
|Data type||Description||Allowed as target?||Project type|
|Categorical||Single category per row, mutually exclusive||Yes||Multiclass|
|Multicategorical||Multiple categories per row, non-exclusive||Yes||Multilabel|
|Summarized Categorical||Multiple categories per row, multiple instances of each category allowed||No||Multiregression (not yet available)|
Beta: New Tiny BERT pretrained featurizer implementation extends NLP with no fine-tuning needed¶
BERT (Bidirectional Encoder Representations from Transformers) is Google's transformer-based de-facto standard for natural language processing (NLP) transfer learning. Tiny BERT (or any distilled, smaller, version of BERT) is now available with certain blueprints in the DataRobot Repository. These blueprints provide pretrained feature extraction in the NLP field, similar to Visual AI featurizers. However, for maximum flexibility, DataRobot's implementation offers two additional tunable pooling parameters—Max Pooling and Average Pooling. Tiny BERT blueprints are available for both UI and API users.
Beta: Scoring Code support for Keras models¶
Now publicly available, Keras models have been rewritten to include Scoring Code.
The lists of allowed and forbidden operations over DataStores and DataSources are now provided by new routes.
A new field
canDelete, has been added to the response of the
GET /api/v2/externalDataSources/route, which lists all viewable data sources.
Models can be retrained with custom monotonic constraints.
Models can be retrained with cross validation.
Creating a datetime model using POST /api/v2/projects/(projectId)/datetimeModels/ without specifying a featurelist will result in using the recommended featurelist for the specified blueprint. If there is no recommended featurelist, the project’s default featurelist will be used instead.
The new string field parameter
unsupervisedTypehas been added to two endpoints to set the type of unsupervised project as anomaly or clustering when a project is run in unsupervisedMode.
A new field,
canUseDatasetData, indicates whether a user can use dataset data for download, project creation, custom models training, or providing predictions.
DataRobot highly recommends updating to the latest API client for Python and R.
Scaleout models deprecated¶
Scaleout models will be deprecated in a future release and should not be used to train new models.
Customer-reported issues fixed in v7.0.0¶
The following issues have been fixed since release 6.3.4.
DM-4525: Data Connections are now properly listed in Credentials Management Page when the UI language is set to non-English.
DM-4637: Adds a new config setting,
KERBEROS_PEM_ENABLE, which when set to
Truewill allow the
kinitcommand to use a service ticket using
PKINITpreauth instead of using a keytab.
DM-4696: The following variables have changed:
AZURE_BLOB_STORAGE_CHUNK_SIZEenv variable is configurable (99MB default).
AZURE_BLOB_STORAGE_TIMEOUTenv variable is configurable (20 second default).
- EP-506: Fixes an issue with database timeout during index create/update.
- EP-750: Fixes an issue with systems using external directory services where some DataRobot containers were unable to resolve the
datarobot_useruser. This change introduces the
os_configuration.remote_user_credentialsparameter by mapping the external directory service credentials into DataRobot containers when set to
EP-795: For third-party tools, the admin interface for RabbitMQ now can have additional headers.
PLT-3052: Fixed LDAP group mapping for groups with special symbols in the name.
- MODEL-5033: Modified certain Keras Repository blueprints that make use of One Hot Encoding numerics so that they perform NDC before One Hot Encoding. This fix ensures prediction consistency between the ModelingAPI and BatchAPI.