Import and create projects in the AI Catalog¶
The AI Catalog enables seamlessly finding, sharing, tagging, and reusing data, helping to speed time to production and increase collaboration. The catalog provides easy access to the data needed to answer a business problem while ensuring security, compliance, and consistency. With the AI Catalog, you can:
- Execute simple data preparation, leveraging SQL scripts for pinpointed results.
- Create datasets without the full commitment of creating projects.
- Find, access, delete, and reuse the assets you need.
- Share data without sharing projects, decreasing risks and costs around data duplication.
- Support data security and governance, which reduces friction and speeds up model adoption, through selective addition to the catalog, role-based sharing, and an audit trail.
For on-premises users, DataRobot recommends enabling Elasticsearch for significantly improved search matches, relevancy, and rankings. Contact your DataRobot representative for help configuring and deploying Elasticsearch.
The AI Catalog is a centralized collaboration hub for working with data and related assets. The DataRobot landing page provides the option to start a project via the legacy method or by using the AI Catalog.
The following sections describe importing data and creating projects from the AI Catalog:
- Add new data
- Create a snapshot from a connected data source
- Create a project for a listed asset
Once in the catalog, use the additional tools to view, modify, and share assets.
The starting point for adding assets to the catalog is either the application home page or the AI Catalog home page:
Import methods are the same for both legacy and catalog entry—that is, via local file, HDFS, URL, or JDBC data source. From the catalog, however, you can also add by blending datasets with Spark. When uploading through the catalog, DataRobot completes EDA1 (for materialized assets), and saves the results for later re-use. For unmaterialized assets, DataRobot uploads and samples the data but does not save the results for later re-use. Additionally, you can upload calendars for use in time series projects.
To upload assets to the catalog:
Select the AI Catalog tab.
Click Add to catalog and select a source for the data:
The file browser for selecting a local file.
A URL (HTTP, HTTPS, local, S3, Google Cloud Storage). Note that the types of URLs supported depend on how your installation is configured.
A file stored in a Hadoop Distributed File System node (HDFS). DataRobot Hadoop environments only.
A blend of two or more datasets using Spark SQL.
Add data from external connections¶
Using JDBC, you can read data from external databases and add the data as assets to the AI Catalog for model building and predictions. See Data connections for more information.
If you haven't already, create the connections and add data sources.
Select the AI Catalog tab, click Add to catalog, and select Existing Data Connection.
Click the connection that holds the data you would like to add.
Select an account. Enter or use stored credentials for the connection to authenticate.
Once validated, select a source for data.
Element Description Schemas Select Schemas to list all schemas associated with the database connection. Select a schema from the displayed list. DataRobot then displays all tables that are part of that schema. Click Select for each table you want to add as a data source. Tables Select Tables to list all tables across all schemas. Click Select for each table you want to add as a data source. SQL Query Select data for your project with a SQL query. Search After you select how to filter the data sources (by schema, table, or SQL query), enter a text string to search. Data source list Click Select for data sources you want to add. Selected tables (datasets) display on the right. Click the
xto remove a single dataset or Clear all to remove all entries.
Policies Select a policy:
- Create snapshot: DataRobot takes a snapshot of the data.
- Create dynamic: DataRobot refreshes the data for future modeling and prediction activities.
Once the content is selected, click Proceed with registration.
DataRobot registers the new tables (datasets) and you can then create projects from them or perform other operations, like sharing and querying with SQL.
Use a SQL Query¶
You can use a SQL query to select specific elements of the named database and use them as your data source. DataRobot provides a web-based code editor with SQL syntax highlighting to help in query construction. Note that DataRobot’s SQL query option only supports SELECT-based queries. Also, SQL validation is only run on initial project creation. If you edit the query from the summary pane, DataRobot does not re-run the validation.
To use the query editor:
Once you have added data from an external connection, click the SQL query tab. By default, the Settings tab is selected.
Enter your query in the SQL query box.
To validate that your entry is well-formed, make sure that the Validate SQL Query box below the entry box is checked.
In some scenarios, it can be useful to disable syntax validation as the validation can take a long time to complete for some complex queries. If you disable validation, no results display. You can skip running the query and proceed to registration.
Select whether to create a snapshot.
Click Run to create a results preview.
Select the Results tab after computing completes.
Use the window-shade scroll to display more rows in the preview; if necessary, use the horizontal scroll bar to scroll through all columns of a row:
When you are satisfied with your results, click Proceed with registration. DataRobot validates the query and begins data ingestion. When complete, the dataset is published to the catalog. From here you can interact with the dataset as with any other asset type.
For more examples of working with the SQL editor, see Prepare data in AI Catalog with Spark SQL.
Calendars for time series projects can be uploaded either:
- Directly to the catalog with the Add to catalog button, using any of the upload methods. Calendars uploaded as a local file are automatically added to the AI Catalog, where they can then be shared and downloaded.
- From within the project using the Advanced options > Time Series tab.
When adding from Advanced options, use the choose file dropdown and choose AI Catalog:
A modal appears listing available calendars, which was determined based on the content of the dataset. Use the dropdown to sort the listing by type.
DataRobot determines whether the calendar is single or multiseries based on the number of columns. If two columns, only one of which is a date, it is single series; three columns with just one being a date makes it multiseries.
Click on any calendar dataset to see the associated details and select the calendar for use with the project.
The calendar file becomes part of the standard AI Catalog inventory and can be reused like any dataset. Calendars generated from Advanced options are saved to the catalog where you can then download them, apply further customization, and re-upload them.
Create a snapshot¶
You can uncheck Create Snapshot when adding external data connections, to meet certain security requirements, for example. When de-selected, DataRobot adds the database table to the catalog but does not take a snapshot, creating an unmaterialized data entry. When unchecked, DataRobot pulls the data once, runs EDA to learn the data structure, and then deletes the data. When requested for modeling or predictions, DataRobot then pulls the data. Snapshotted materialized data is stored on disk; unmaterialized data is stored remotely as your asset and only downloaded when needed.
You can schedule automated snapshot refreshes to sync your dataset with your data source regularly.
To determine whether an asset has been snapshotted, click on its catalog entry and check the details on the right. If it has been snapshotted, the last snapshot date displays; if not, a notification appears:
To create a snapshot for unmaterialized data:
Select the asset from the main catalog listing.
Expand the menu in the upper right and select Create Snapshot.
You cannot update the snapshot parameters that were defined when the catalog entry was added; snapshots are based on the original SQL.
DataRobot prompts for any credentials needed to access the data source. Click Yes, take snapshot to proceed.
DataRobot runs EDA. New snapshots are available from the version history, with the newest ("latest") snapshot becoming the one used by default for the dataset.
Once EDA completes, the displayed status updates to "SNAPSHOT" and a message appears indicating that publishing is complete. If you want the asset to no longer be snapshotted, remove the asset and add it again, making sure to uncheck Create Snapshot.
Create a project¶
You can create new projects directly from the AI Catalog; you can also use listed datasets as a source for predictions.
To create a project, from the catalog main listing, click on an asset to select it. In the upper right, click Create project.