Data preparation reference¶
Connection capabilities¶
The following table lists the connections currently available in Workbench as well as supported capabilities for each one:
Connection | GA or preview | Dynamic datasets | Live preview | Wrangling | In-source materialization |
---|---|---|---|---|---|
Snowflake | GA | ✔ | ✔ | ✔ | ✔ |
Google BigQuery | GA | ✔ | ✔ | ✔ | ✔ |
Databricks | GA | ✔ | ✔ | ✔ | ✔ |
AWS S3 | GA | ||||
ADLS Gen2 | Preview | ||||
SAP Datasphere | Preview |
FAQ¶
Add data¶
What is the Data Registry and why does it show my AI Catalog datasets?
The Data Registry is a catalog for your assets in Workbench that displays all static and snapshot datasets you have access to in the AI Catalog. Adding data via the Data Registry creates a link from the source of the data (the AI Catalog) to your Use Case.
Can one dataset be added to multiple Use Cases?
Yes. When you add a dataset to a Use Case via the Data Registry, you are establishing a link between the dataset and the Use Case.
How do I delete a dataset?
To remove a dataset from a Use Case, click Actions menu > Remove from Use Case. Note that this only removes the link from the data source to the Use Case, meaning team members in that specific Use Case will no longer see the dataset, however, if they have access to the same dataset in a different Use Case, they will still be able to access the dataset. You can control access to the source data from the AI Catalog in DataRobot Classic.
How can I browse and manage data connections not currently supported in Workbench?
You must use DataRobot Classic to manage any data connections not listed above. Additional connections will be added to Workbench in future releases.
How can I delete a data connection in Workbench?
You cannot delete data connections from within Workbench; to remove existing data connections, go to User Settings > Data Connections in DataRobot Classic.
How can I manage saved credentials?
You can manage saved credentials for your data connections in DataRobot Classic.
Wrangle data¶
What permissions are required to be able to push down operations to Snowflake?
Your account in Snowflake must have read
access to the selected database.
Are there situations where data is moved from source?
Yes, data is moved from the source:
- During an interactive wrangling session: 10,000 randomly sampled rows from the original table or view in Snowflake are brought into DataRobot for preview and profiling purposes.
- After publishing a wrangling recipe: When you publish a recipe, the transformations are pushed down to the source and applied to the entire input table or view in Snowflake. The resulting output is materialized in DataRobot as a snapshot dataset.
How do the wrangling insights differ from the exploratory data insights generated when registering a dataset in DataRobot?
The insights generated during data wrangling are based on the same live random sample of the raw dataset retrieved from your data source used during an interactive wrangling session. Whenever you adjust the row count or add operations, DataRobot updates the sample and performs exploratory data analysis again.
Why do I need to downsample my data?
If the size of the raw data in Snowflake does not meet DataRobot's file size requirements, you can configure automatic downsampling to reduce the size of the output dataset.
Considerations¶
Add data¶
Consider the following when adding data:
- Data connections must be removed in DataRobot Classic.
- URL import is not supported.
- There is currently no image support in previews.
Wrangle data¶
Consider the following when wrangling data:
- Profile cannot be customized and is limited to sample-based profiles.
- Unstructured data types are not supported.
- The live sample only supports random sampling.