Skip to content

Prepare data

DataRobot's wrangling capabilities provide a seamless, scalable, and secure way to access and transform data for modeling. In Workbench, "wrangle" is a visual interface for executing data cleaning at the source, whether that's the Data Registry in DataRobot or leveraging the compute environment and distributed architecture of your external data source. Why wrangle data in DataRobot?

  • It's fully integrated in Workbench—find the right datasets, apply transformations, and see the effects of those transformations on your dataset in realtime in one place.
  • It's pushed down—when using a data connection, leverage the scale of your cloud data warehouse or lake.
  • It's secure—limiting data movement means faster results, better performance, and enhanced security.

You can launch the data wrangler from the following areas in a Use Case:

When you wrangle a dataset, DataRobot pulls a uniform random sample of 10000 rows and calculates exploratory data insights on that sample, all while connected to your data source. Then, you build a recipe of operations you want to apply to the entire dataset—the transformations are first applied to the live sample to make sure it's being done correctly. When the recipe is ready to be published, it's pushed down to the data source where it's executed to materialize an output dataset.

DataRobot provides two different tools for wrangling data:

  • Wrangler: A GUI-based tool that allows you to build a recipe using operations—each operation applying a specific transformation to the dataset.
  • SQL Editor: A tool that allows you to build a recipe using SQL queries.

This section covers the following topics:

Topic Description
Wrangler Use Wrangler to build a recipe of one or more operations that allow you to interactively prepare data for modeling without moving it from your data source.
SQL Editor Use the SQL Editor to create a recipe comprised of SQL queries which you can then publish to your data source and generate an output dataset.
Publish a recipe Publish a recipe to push down transformations to your data source and generate an output dataset.
Reference
Supported data stores A complete list of supported data stores.
Wrangling large Snowflake datasets Tips for improving the performance of wrangling in Snowflake.

Feature considerations

Consider the following when wrangling data:

  • Profile cannot be customized and is limited to sample-based profiles.
  • Wrangling does not support query type datasets (i.e., a dataset built from a query).
  • Self-managed: You can wrangle Data Registry datasets of up to 20GB.
  • Managed SaaS (multi-tenant SaaS and AWS single-tenant SaaS deployments): You can wrangle Data Registry datasets of up to 100GB.
  • You can add dyanmic datasets using a JDBC driver, however, you can not preview or wrangle that data—you must first create a snapshot of the dataset.

FAQs

What permissions are required to be able to push down operations to an external data source?

You must have read access to the selected database.

Are there situations where data is moved from source?

Yes, data is moved from the source:

  • During an interactive wrangling session: 10,000 randomly sampled rows from the original table or view in the data source are brought into DataRobot for preview and profiling purposes.
  • After publishing a wrangling recipe: When you publish a recipe, the transformations are pushed down and applied to the entire input table or view in the data source. The resulting output is materialized in DataRobot as a snapshot dataset.
How do the wrangling insights differ from the exploratory data insights generated when registering a dataset in DataRobot?

The insights generated during data wrangling are based on the same live random sample of the raw dataset retrieved from your data source used during an interactive wrangling session. Whenever you adjust the row count or add operations, DataRobot updates the sample and performs exploratory data analysis again.

Why do I need to downsample my data?

If the size of the raw data in Snowflake does not meet DataRobot's file size requirements, you can configure automatic downsampling to reduce the size of the output dataset.