Manage data with the AI Catalog¶
DataRobot’s AI Catalog is comprised of three key functions:
- Ingest: Data is imported into DataRobot and sanitized for use throughout the platform.
- Storage: Reusable data assets are stored, accessed, and shared.
- Data Preparation: Clean, blend, transform, and enrich your data to maximize the effectiveness of your application.
You can access the AI Catalog from anywhere within DataRobot by clicking the AI Catalog tab at the top of the brower.
This tutorial shows you how to:
- Add data to the AI Catalog.
- View information about a dataset.
- Blend a dataset with another dataset using Spark SQL.
- Create a project.
To add data to the AI Catalog:
Click AI Catalog at the top of DataRobot window.
Click Add to catalog and select an import method.
The following table describes the methods:
Method Description New Data Connection Configure a JBDC connection to import from an external database of data lake. Existing Data Connection Select a configured data source to import data. Select the account and the data you want to add. Local File Browse to upload a local dataset or drag and drop a dataset. URL Import by specifying a URL. Spark SQL Use Spark SQL queries to select and prepare the data you want to store.
DataRobot registers the data after performing an initial exploratory data analysis (EDA1). Once registered, you can do the following:
- View information about a dataset, including its history.
- Blend the dataset with another dataset.
- Create an AutoML project.
View information about a dataset¶
Click a dataset in the catalog to view information about it.
|Asset tabs||Select a tab to work with the asset (dataset):
|Dataset Info||Update the name and description, and add tags to use for searches. The number of rows and features display on the right, along with other details.|
|State badges||Displayed badges indicate the state of the asset—whether it's in the process of being registered, whether it's static or dynamic, generated from a Spark SQL query, or snapshotted.|
|Create project||Create a machine learning project from the dataset.|
|Share||Share assets with other users, groups, and organizations.|
|actions menu||Download, delete, or create a snapshot of the dataset.|
|Renew Snapshot||Add a scheduled snapshot.|
Blend a dataset using Spark SQL¶
You can blend two or more datasets and use Spark SQL to select and transform features.
In the catalog, click Add to catalog and select Spark SQL.
Click Add data.
Select the tables you want to blend and click Add selected data.
For each dataset, click the actions menu and click Select Features.
Choose the features and click Add selected features to SQL. You can click the right arrows to add features one at a time.
Once you have added features from the datsets, add SQL commands to the editing window to generate a query (click Spark Docs on the upper right for Spark SQL documentation). Try out the query by clicking Run.
Click Save when you have the results you want. DataRobot registers the new dataset.
Create a project¶
Click a registered dataset in the catalog and click Create project. DataRobot uploads the data, conducts exploratory data analysis, and creates the machine learning project. You can then start building models.