Work with AI Catalog assets¶
Data assets within the AI Catalog can be one of the following:
- Materialized snapshots of tables/views, meaning DataRobot has pulled from the data asset and is currently keeping a copy of it in the catalog.
- Dynamic connections, meaning that the whole dataset is ingested from your data source when you create a modeling project from it, thus allowing you to work with the most up-to-date data.
If the data is snapshotted, those snapshots can be automatically refreshed periodically, and are also automatically versioned to preserve dataset lineage and enhance the overall governance capabilities of DataRobot.
Additionally, when Composable ML is enabled, you can save blueprints to the AI Catalog. From the catalog, a blueprint can be edited, used to train models in compatible projects, or shared.
The following sections describe tools for working with the catalog assets:
- Understand asset states.
- Update asset details (name, tags, and descriptions)
- Manage feature lists
- View configured relationships
- View version history and add comments
- Share and delete assets
- Perform bulk actions on assets
To add assets, see Import and create projects in the AI Catalog.
Find existing assets¶
Once in the AI Catalog, there are a variety of tools to help quickly locate the data assets you want to work with. You can:
Search for a specific asset using the search query box.
Use the dropdown to modify the order of all existing assets.
The default sort option is Creation date, except after searching for a specific asset, in which case the default is Relevance.
Under the search query box, you can filter assets by Source, Tags, and/or Owner.
For example, you can filter by any tags manually added to an asset:
Asset states¶
DataRobot adds badges to catalog entries to indicate the state of the dataset. Either:
State | Description |
---|---|
Dynamic | A dataset that has no snapshot. |
Spark | A dataset built from a Spark query. |
Snapshot | A dataset that has a snapshot. |
Static | A static file or URL-based dataset with a snapshot. Datasets uploaded using data stages also display the STATIC badge, however, the FROM field displays stage://{stageId}/{filename} . |
Versioning static assets
Static assets can only be versioned by uploads of the same type; datasets created by local files are versioned from local file uploads, and datasets created from a data stage are versioned from data stage uploads.
What happens if I create a snapshot from a dynamic dataset?
In the AI Catalog, the dataset will be marked as SNAPSHOT
; as with all SNAPSHOT
datasets, you can still create new snapshots from it. Note that for such a dataset, only the snapshots are used to create projects.
Asset details¶
When you add a dataset, DataRobot ingests the source data and runs EDA1 to register the asset and make it available from the catalog:
Once registered, you can also view additional information and manage asset details using the tabs described below:
The Info tab displays an overview of the asset's details as well as metadata.
Element | Description | |
---|---|---|
![]() |
Name | Name the asset. By default, this is the file name uploaded. |
![]() |
Description | Enter a helpful description of the asset. |
![]() |
Tags | Add tags to help when filtering assets in the AI Catalog. DataRobot offers any predefined tags that match the characters you entered. Select one by clicking or continue typing to add a new tag of alphanumeric characters (special characters and symbols are invalid). Either click outside the entry box or in the dropdown to add the tag. |
![]() |
Overview | An overview of the asset, including the full row count, feature count, and feature types. |
![]() |
Metadata | Additional metadata, including size, owner, and dataset ID. |
Click the pencil icons () to change the asset name, add a description, or add tags to aid in filtering, and then click anywhere outside of the box to save the change.
The Profile tab allows you to preview dataset column names and row data. It can be useful for finding or verifying column names when writing Spark SQL statements for blended datasets.
Info tab vs. Profile tab
The Info tab displays the data's total row count, feature count, and size.
The Profile tab only displays a preview of the data based on a 1MB raw sample, and the feature types and details are based on a 500MB sample.
Meaning the row count observed on the Profile tab may not match that displayed in the Info tab.
Note that the preview is a random sample of up to 1MB of the data and may be ordered differently from the original data. To see the complete, original data, use the Download Dataset option.
To preview a dataset, select it in the main catalog and click the pencil icon () to access dataset information (if available).
-
Click the Profile tab to preview the contents of the dataset:
-
Use the Columns dropdown to select the number of columns to display on the page and the scroll bars to scroll through those columns. Additionally, you can use the Rows dropdown to cycle through available data, 20 rows at a time.
The Profile tab also displays details for all features in the dataset. To view details for a particular feature, scroll to it in the display and click. The Feature Details listed in the right panel update to reflect statistics for the feature. (These are the same statistics as those displayed on the Data for EDA1.)
You can create new lists and feature transformations for features of any dataset in the catalog. To work with the tools, select the dataset in the main catalog and Feature Lists in the left panel.
When you create feature lists, they are copied to a project upon creation. You can then set the list to use for the project from the Feature List dropdown at the top of the Project Data list. See the section on working with Feature Lists for complete details on creating, modifying, and understanding these lists.
The Feature List tab also provides access to a tool for creating variable type feature transformations. While DataRobot bases variable type assignments on the values seen during EDA, there are times when you may need to change the type. Refer to feature transformations documentation for complete details.
To create a feature list:
-
Use the checkboxes to the left of feature names to select a set of features.
-
Click the Create new feature list from selection link, which becomes active after you select the first feature.
-
In the resulting dialog, provide a name for the new list and click Submit. The new list becomes available through the dropdown.
You can delete or rename any feature list you created. You cannot make any changes to the DataRobot default feature lists.
DataRobot’s Feature Discovery capability guides you through creating relationships, which define both the included datasets and how they are related to one another. The end product is a multitude of additional features that are a result of these linkings. The Feature Discovery engine analyzes the included datasets to determine a feature engineering “recipe” and, from that recipe, generates secondary features for training and predictions. Once these relationships are established, you can view them from the catalog.
To view relationships, select the dataset in the main catalog and click the Relationships tab to view, modify, or delete existing relationships:
See complete details of working with relationships before modifying relationship details.
The Version History tab lists all versions of a selected asset. The Status column indicates the snapshot status—green if successful, red if failed, gray if the original version did not have a snapshot.
Click a version to select it. Once selected, you can create a project from the version and download or delete the contents.
The Comments tab allows you to add comments to—even host a discussion around—any item in the catalog that you have access to. Comment functionality is available in the AI Catalog (illustrated below), and also as a model tab from the Leaderboard and in use case tracking. With comments, you can:
- Tag other users in a comment; DataRobot will then send them an email notification.
- Edit or delete any comment you have added (you cannot edit or delete other users' comments).
Asset actions¶
In the AI Catalog, there are a number of ways you can interact with data assets, including downloading, sharing, and deleting datasets.
Download datasets¶
To download a dataset, select it from the catalog list. From the dropdown menu in the upper right, select Download Dataset () and in the resulting dialog, browse to a download location and click Save.
Note
Only snapshotted datasets can be downloaded. Additionally, there is a 10GB file size limit; attempting to download a dataset larger than 10GB will fail.
Share assets¶
Assets in the AI Catalog can be shared to users, groups, and organizations.
Element | Description | |
---|---|---|
![]() |
Allow sharing | The user you're sharing with can share the asset with other users. |
![]() |
Can use data | The user you're sharing with can an use the data. How they use the data depends on their role. |
![]() |
User list | Enter the user(s) with whom you are sharing the asset. |
![]() |
Access level | Select from Users, by default. If your instance has Groups and Organizations configured, you can select from these categories. |
![]() |
Role | Select a role for the users, groups, or organizations that are sharing the asset:
|
![]() |
Share | Select Share to perform the operation. |
![]() |
Shared with | Shows the users, groups, and organizations that asset is shared with, along with their permission settings. |
Sharing with multiple users
When sharing a catalog asset with multiple users, DataRobot suggests creating a user group first, and then sharing with that group instead of individual users.
The catalog uses the same sharing window as other places in the application, with some fields specific to the data assets.
Delete assets¶
To delete a dataset, select the dataset from the catalog list. From the dropdown menu in the upper right, select Delete Dataset (). When prompted for confirmation, click Delete.
Bulk actions on datasets¶
You can share, tag, or delete multiple datasets at once using the bulk action functionality in the AI Catalog. Start by selecting the box next to the asset(s) you want to manage; select at least one asset to enable the bulk actions at the top. A counter also displays how many assets are actively selected.
Once you're done selecting assets, choose the appropriate action from the following options: Delete, Tag, or Share.