Developer documentation > Developer learning > Python API client user guide > Data > Data wrangling

Recipes¶

To clean, prepare, and wrangle your data into your desired shape, DataRobot provides reusable recipes for data preparation. Each recipe acts like a blueprint, taking one or more datasets or data sources as input and applying a series of operations to filter, modify, join or transform your data. You can then use the recipe to create a dataset ready for consumption. Recipes allow for quick iteration on data prep workflows, and enable re-use via its simple operations API.

Recipe terminology¶

Recipes use the following terminology:

Recipe: A reusable blueprint for how to create a new dataset by applying operations to transform one or more data inputs.
Recipe dialect: The dialect data wrangling operations should use when working with recipe inputs. For example, use Snowflake dialect when working with data assets from Snowflake.
Input: A dataset or data source providing data to a recipe. A recipe can have multiple inputs. A recipe's inputs must either be all datasets, or all tables from data sources pointing to the same data store.
Primary input: The input used as the base for the recipe. If no operations are applied by the recipe, a dataset identical to the primary input will be output by the recipe. A recipe will only have a single primary input.
Secondary input: An additional input to a recipe. A recipe can have multiple secondary inputs. Data from secondary inputs must be introduced into a recipe via join or other similar operation.
Recipe preview: A sample view of the recipe's data computed by applying the operations in a recipe on its inputs. The data featured in a recipe's preview is generally a sample of the recipe's fully transformed data.
Sampling: A setting through which the number of rows read from a recipe's primary input is modified when computing the recipe's preview.
Downsampling: A setting through which the number of rows written to the dataset published by a recipe is modified.
Operation: A way to modify how a recipe works with data from its inputs.
Wrangling operation: Transformation to apply to a recipe's data. Recipes can stack multiple wrangling operations on top of each other to transform data from its inputs.
Downsampling operation: Modification to the recipe's number of rows to write to a dataset when publishing. Recipes can optionally use a single downsampling operation.
Sampling operation: Modification to the number of rows to read from a recipe's primary data input. Recipes can optionally set a single sampling operation on their primary input.
Publishing: Action to create a new dataset containing the result of applying the recipe's operations on its inputs.

Review the recommended workflow to create, iterate, and publish with recipes below.

Create a datarobot.Recipe to work on a datarobot.Dataset or data from a datarobot.DataSource. The recipe will belong to a datarobot.Usecase.
Modify the recipe by updating its metadata, settings, inputs, operations, or downsampling.
Verify the recipe's data by requesting a recipe preview. If you are unhappy with the result, go back to step 2.
Publish the recipe to create a new datarobot.Dataset constructed according to the transformations in the recipe.

Create a recipe¶

There are two ways to create a recipe. You can use either a dataset or a table from a JDBC data source. They become the primary input for the recipe. You will also need a datarobot.Usecase, as each recipe will belong to a use case. Choose the DataWranglingDialect that best matches the source of the dataset or data source.

Create a recipe from a dataset¶

Use the Recipe.from_dataset method to create a recipe from an existing datarobot.Dataset:

>>> import datarobot as dr
>>> from datarobot.enums import DataWranglingDialect, RecipeType
>>> from datarobot.models.recipe_operation import RandomSamplingOperation
>>>
>>> # Get your use case and dataset
>>> my_use_case = dr.UseCase.list(search_params={"search": "My Use Case"})[0]
>>> dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f123')
>>>
>>> # Create a recipe from the dataset
>>> recipe = dr.Recipe.from_dataset(
...     use_case=my_use_case,
...     dataset=dataset,
...     dialect=DataWranglingDialect.SPARK,
...     recipe_type=RecipeType.WRANGLING,
...     sampling=RandomSamplingOperation(rows=500)
... )

Create a recipe from a JDBC table¶

Use the Recipe.from_data_store method to create a recipe directly from tables in a connected data source:

>>> import datarobot as dr
>>> from datarobot.enums import DataWranglingDataSourceTypes, DataWranglingDialect, RecipeType
>>> from datarobot.models.recipe import DataSourceInput
>>> from datarobot.models.recipe_operation import LimitSamplingOperation
>>>
>>> # Configure your data source input
>>> data_source_input = DataSourceInput(
...     canonical_name='Sales_Data_Connection', # data connection name
...     table='sales_transactions',
...     schema='PUBLIC',
...     sampling=LimitSamplingOperation(rows=1000)
... )
>>>
>>> # Get your use case and data store
>>> my_use_case = dr.UseCase.list(search_params={"search": "Sales Analysis"})[0]
>>> data_store = dr.DataStore.get('2g33a1b2c9e88f0001e6f657')
>>>
>>> # Create recipe from data source
>>> recipe = dr.Recipe.from_data_store(
...     use_case=my_use_case,
...     data_store=data_store,
...     data_source_type=DataWranglingDataSourceTypes.JDBC,
...     dialect=DataWranglingDialect.POSTGRES,
...     data_source_inputs=[data_source_input],
...     recipe_type=RecipeType.WRANGLING
... )

Retrieve recipes¶

You can retrieve a specific recipe by ID, or a list of all recipes, filtering the list as required.

>>> import datarobot as dr
>>>
>>> # Get a specific recipe by ID
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # List all recipes
>>> all_recipes = dr.Recipe.list()
>>>
>>> # Filter recipes. Use any number of params to filter.
>>> filtered_recipes = dr.Recipe.list(
...     search="My Recipe Name",
...     dialect=dr.enums.DataWranglingDialect.SPARK,
...     status="draft",
...     recipe_type=dr.enums.RecipeType.WRANGLING,
...     order_by="-updatedAt",  # Most recently updated first
...     created_by_username="data_scientist_user"
... )

Retrieve information about a recipe¶

The recipe object contains basic information about the recipe that you can query, as shown below.

>>> import datarobot as dr
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> recipe.id
u'690bbf77aa31530d8287ae5f'
>>> recipe.name
u"Customer Segmentation Dataset Recipe"

You can also retrieve the list of inputs and operations, as well as the settings for downsampling and general recipe settings.

>>> import datarobot as dr
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> # Access inputs, operations, downsampling and settings
>>> inputs = recipe.inputs
>>> primary_input = inputs[0] # First input is the primary input
>>> secondary_inputs = inputs[1:] # All others in the list are secondary inputs
>>>
>>> operations = recipe.operations
>>> downsampling_operation = recipe.downsampling
>>> settings = recipe.settings

Update recipe metadata fields¶

You can update the recipe's metadata fields (name, description, etc.) with Recipe.update as follows:

>>> import datarobot as dr
>>>
>>> # Retrieve an existing recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Update metadata fields
>>> recipe.update(
...     name="Customer Segmentation Dataset Recipe",
...     description="Recipe to create customer segmentation dataset."
... )

Update recipe inputs¶

You can update the list of inputs for a recipe with the Recipe.update method as shown below. By updating the list of inputs, you change the data fed into the recipe to transform. The first input in the list will be the recipe's primary input, with the rest being secondary inputs.

Recipe input considerations

The Recipe.update method will replace all existing inputs. If adding inputs, always include the existing primary input to avoid breaking the recipe.

Data from secondary inputs will not appear in the recipe preview unless somehow joined or combined with data from the primary input.

Recipe inputs must either be all datasets, or all tables from data sources pointing to the same data store.

>>> import datarobot as dr
>>> from datarobot.models.recipe import RecipeDatasetInput, JDBCTableDataSourceInput
>>>
>>> # Get the recipe and additional datasets
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> secondary_dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f456')
>>>
>>> # Add a secondary dataset input if the primary input is also a dataset
>>> recipe.update(
...     inputs=[
...         recipe.inputs[0],  # Keep the original primary input
...         RecipeDatasetInput.from_dataset(
...             dataset=secondary_dataset,
...             alias='customers_data'
...         )
...     ]
... )
>>>
>>> # You can also add data from a table in a data store
>>> data_store = dr.DataStore.get('5e1b4f8f2a3c4d5e6f7g8h9i')
>>> data_source = DataSource.create(
...     data_source_type="jdbc",
...     canonical_name="My Snowflake connection",
...     params=dr.DataSourceParameters(
...         data_store_id=data_store.id,
...         schema="PUBLIC",
...         table="stock_prices"
...     )
... )
>>> table = data_source.create_dataset()
>>> # Add data from a table in a data store if the primary input is also a table from the same data store
>>> recipe.update(
...     inputs=[
...         recipe.inputs[0],  # Primary input
...         JDBCTableDataSourceInput(
...             input_type=RecipeInputType.DATASOURCE,
...             data_source_id=data_source.id,
...             data_store_id=data_store.id,
...             dataset_id=table.id,
...             sampling=LimitSamplingOperation(rows=250),
...             alias='my_table_alias'
...         )
...     ]
... )

Update primary input sampling¶

You can choose to limit the number of rows to work with when iterating on your recipe operations. By specifying a sampling operation on the primary input of your recipe, you enable faster computation of the recipe preview. Sampling operations will not modify the number of rows when publishing to a dataset. You should only specify a sampling operation on the primary input. Since secondary inputs are always joined or combined with the primary input, the primary input is the only input that determines the number of rows to show in the recipe preview.

>>> from datarobot.models.recipe_operation import LimitSamplingOperation
>>>
>>> # Configure sampling for an input
>>> my_dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f456')
>>> dataset_input = RecipeDatasetInput.from_dataset(
...     dataset=my_dataset,
...     alias='sampled_data',
...     sampling=LimitSamplingOperation(rows=100)
... )
>>> # Update recipe with sampled input
>>> recipe.update(inputs=[dataset_input])

Update recipe wrangling operations¶

Wrangling operations are the building blocks of your recipe and define the transformations applied to your data. Operations are processed sequentially; the output of one operation becomes the input for the next operation, creating a transformation pipeline.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import *
>>> from datarobot.enums import FilterOperationFunctions, AggregationFunctions
>>>
>>> # Get your recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create a series of operations
>>> operations = [
...     # Filter rows where age > 18
...     FilterOperation(
...         conditions=[
...             FilterCondition(
...                 column="age",
...                 function=FilterOperationFunctions.GREATER_THAN,
...                 function_arguments=[18]
...             )
...         ],
...         keep_rows=True
...     ),
...     # Then create new column with full name
...     ComputeNewOperation(
...         expression="CONCAT(first_name, " ", last_name)",
...         new_feature_name="full_name"
...     ),
...     # Then group by department and calculate average salary
...     AggregationOperation(
...         aggregations=[
...             AggregateFeature(
...                 feature="salary",
...                 functions=[AggregationFunctions.AVERAGE]
...             )
...         ],
...         group_by_columns=["department"]
...     ),
... ]
>>>
>>> # Update the recipe with new list of wrangling operations
>>> recipe.update(operations=operations)

Below appears the list of available data wrangling operations with an example of their transformation.

Lags operation¶

The LagsOperation creates lagged versions of a column based on datetime ordering. The operation creates new columns for each specified lag.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import LagsOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create lags for 1 and 2 days for stock price analysis
>>> lags_op = LagsOperation(
...     column="stock_price",
...     orders=[1, 2],
...     datetime_partition_column="trade_date",
...     multiseries_id_column="ticker_symbol"  # For multiseries data (multiple stocks in this example)
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[lags_op])

Primary input dataset:

ticker_symbol	trade_date	stock_price
AAPL	2024-01-01	150.00
AAPL	2024-01-02	152.50
AAPL	2024-01-03	149.75
AAPL	2024-01-04	153.20
MSFT	2024-01-01	380.00
MSFT	2024-01-02	385.75
MSFT	2024-01-03	382.30
MSFT	2024-01-04	388.90

Recipe preview:

ticker_symbol	trade_date	stock_price	stock_price (1st lag)	stock_price (2nd lag)
AAPL	2024-01-01	150.00
AAPL	2024-01-02	152.50	150.00
AAPL	2024-01-03	149.75	152.50	150.00
AAPL	2024-01-04	153.20	149.75	152.50
MSFT	2024-01-01	380.00
MSFT	2024-01-02	385.75	380.00
MSFT	2024-01-03	382.30	385.75	380.00
MSFT	2024-01-04	388.90	382.30	385.75

Window categorical statistics operation¶

The WindowCategoricalStatsOperation calculates categorical statistics for a rolling window, creating new columns for each statistical method. This can be used to track trends in categorical data over time.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import WindowCategoricalStatsOperation
>>> from datarobot.enums import CategoricalStatsMethods
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Compute most frequent purchase in last 3 purchases
>>> window_cat_op = WindowCategoricalStatsOperation(
...     column="product_category",
...     window_size=3,  # Last 3 purchases
...     methods=[CategoricalStatsMethods.MOST_FREQUENT],
...     datetime_partition_column="purchase_date",
...     multiseries_id_column="customer_id"
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[window_cat_op])

Primary input dataset:

customer_id	purchase_date	product_category
CUST001	2024-01-01	Electronics
CUST001	2024-01-02	Clothing
CUST001	2024-01-03	Electronics
CUST001	2024-01-04	Electronics
CUST002	2024-01-01	Books
CUST002	2024-01-02	Books
CUST002	2024-01-03	Electronics
CUST002	2024-01-04	Books

Recipe preview:

customer_id	purchase_date	product_category	product_category (3 rows most frequent)
CUST001	2024-01-01	Electronics	Electronics
CUST001	2024-01-02	Clothing	Electronics
CUST001	2024-01-03	Electronics	Electronics
CUST001	2024-01-04	Electronics	Electronics
CUST002	2024-01-01	Books	Books
CUST002	2024-01-02	Books	Books
CUST002	2024-01-03	Electronics	Books
CUST002	2024-01-04	Books	Books

Window numerical statistics operation¶

The WindowNumericStatsOperation calculates numeric statistics for a rolling window, creating new columns for each statistical method. This operation is useful for computing moving averages, maximums, minimums, and other statistics over time.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import WindowNumericStatsOperation
>>> from datarobot.enums import NumericStatsMethods
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Track max and average of last 3 transactions
>>> window_num_op = WindowNumericStatsOperation(
...     column="sales_amount",
...     window_size=3,  # Last 3 transactions
...     methods=[NumericStatsMethods.AVG, NumericStatsMethods.MAX],
...     datetime_partition_column="transaction_date",
...     multiseries_id_column="store_id"
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[window_num_op])

Primary input dataset:

store_id	transaction_date	sales_amount
STORE01	2024-01-01	100.00
STORE01	2024-01-02	150.00
STORE01	2024-01-03	120.00
STORE01	2024-01-04	200.00
STORE02	2024-01-01	80.00
STORE02	2024-01-02	90.00
STORE02	2024-01-03	110.00
STORE02	2024-01-04	95.00

Recipe preview:

store_id	transaction_date	sales_amount	sales_amount (3 rows avg)	sales_amount (3 rows max)
STORE01	2024-01-01	100.00	100.00	100.00
STORE01	2024-01-02	150.00	125.00	150.00
STORE01	2024-01-03	120.00	123.33	150.00
STORE01	2024-01-04	200.00	156.67	200.00
STORE02	2024-01-01	80.00	80.00	80.00
STORE02	2024-01-02	90.00	85.00	90.00
STORE02	2024-01-03	110.00	93.33	110.00
STORE02	2024-01-04	95.00	98.33	110.00

Time series operation¶

The TimeSeriesOperation generates a dataset ready for time series modeling by creating forecast points, distances, and various time-aware features. By defining a task plan, multiple time-series transformations like lags and rolling statistics are executed and added as features to the recipe data.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import TimeSeriesOperation, TaskPlanElement, Lags
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Define task plan for feature engineering
>>> task_plan = [
...     TaskPlanElement(
...         column="sales_amount",
...         task_list=[Lags(orders=[1])]
...     )
... ]
>>>
>>> # Create time series operation
>>> time_series_op = TimeSeriesOperation(
...     target_column="sales_amount",
...     datetime_partition_column="sale_date",
...     forecast_distances=[1],  # Predict 1 period ahead
...     task_plan=task_plan
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[time_series_op])

Primary input dataset:

store_id	sale_date	sales_amount
STORE01	2024-01-01	1000
STORE01	2024-01-02	1200
STORE01	2024-01-03	1100
STORE01	2024-01-04	1300

Recipe preview:

store_id (actual)	sale_date (actual)	sales_amount (actual)	Forecast Point	Forecast Distance	sales_amount (1st lag)	sales_amount (naive 1 row seasonal value)
STORE01	2024-01-02	1200	2024-01-01	1	1000	1000
STORE01	2024-01-03	1100	2024-01-02	1	1200	1200
STORE01	2024-01-04	1300	2024-01-03	1	1100	1100

Compute new operation¶

The ComputeNewOperation creates a new feature using a SQL expression, allowing you to derive calculated fields from existing columns. This operation can be useful for creating custom business logic, mathematical transformations, and feature combinations.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import ComputeNewOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create compute new operation to compute total cost, factoring in a discount %
>>> compute_op = ComputeNewOperation(
...     expression="ROUND(quantity * unit_price * (1 - discount), 2)",
...     new_feature_name="total_cost"
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[compute_op])

Primary input dataset:

order_id	quantity	unit_price	discount
ORD001	3	25.50	0.10
ORD002	1	15.00	0.00
ORD003	2	40.00	0.15
ORD004	5	12.25	0.05

Recipe preview:

order_id	quantity	unit_price	discount	total_cost
ORD001	3	25.50	0.10	68.85
ORD002	1	15.00	0.00	15.00
ORD003	2	40.00	0.15	68.00
ORD004	5	12.25	0.05	58.19

Rename column operation¶

The RenameColumnsOperation renames one or more columns. This operation is often useful for standardizing column names, making them more descriptive, or ensuring consistent column naming for specific downstream processes.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import RenameColumnsOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Rename customer id, product name and quantity columns
>>> rename_op = RenameColumnsOperation(
...     column_mappings={
...         'cust_id': 'customer_id',
...         'prod_name': 'product_name',
...         'qty': 'quantity'
...     }
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[rename_op])

Primary input dataset:

cust_id	prod_name	qty	price
C001	Widget A	3	25.99
C002	Gadget B	1	15.50
C001	Tool C	2	45.00
C003	Widget A	5	25.99

Recipe preview:

customer_id	product_name	quantity	price
C001	Widget A	3	25.99
C002	Gadget B	1	15.50
C001	Tool C	2	45.00
C003	Widget A	5	25.99

Filter operation¶

The FilterOperation removes or keeps rows based on one or more filter conditions. Apply multiple conditions with AND/OR logic to create complex filtering rules.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import FilterOperation, FilterCondition
>>> from datarobot.enums import FilterOperationFunctions
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create filter conditions to keep customers over 18 with active status
>>> conditions = [
...     FilterCondition(
...         column="age",
...         function=FilterOperationFunctions.GREATER_THAN_OR_EQUALS,
...         function_arguments=[18]
...     ),
...     FilterCondition(
...         column="status",
...         function=FilterOperationFunctions.EQUALS,
...         function_arguments=["active"]
...     )
... ]
>>>
>>> # Create filter operation
>>> filter_op = FilterOperation(
...     conditions=conditions,
...     keep_rows=True,  # Keep matching rows
...     operator="and"   # Both conditions must be true
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[filter_op])

Primary input dataset:

customer_id	age	status	purchase_amount
C001	25	active	150.00
C002	17	active	75.00
C003	30	inactive	200.00
C004	22	active	95.00

Recipe preview:

customer_id	age	status	purchase_amount
C001	25	active	150.00
C004	22	active	95.00

Drop columns operation¶

The DropColumnsOperation removes one or more columns. This operation is useful for eliminating unnecessary fields, sensitive information, or columns that won't be used in downstream processes.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import DropColumnsOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create operation to drop 2 extra columns
>>> drop_op = DropColumnsOperation(
...     columns=['internal_notes', 'legacy_id']
... )
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[drop_op])

Primary input dataset:

customer_id	name	email	internal_notes	legacy_id
C001	John Doe	john@email.com	VIP customer	L001
C002	Jane Doe	jane@email.com	New customer	L002
C003	Bob Lee	bob@email.com	Frequent buyer	L003

Recipe preview:

customer_id	name	email
C001	John Doe	john@email.com
C002	Jane Doe	jane@email.com
C003	Bob Lee	bob@email.com

Dedupe rows operation¶

The DedupeRowsOperation removes duplicate rows, keeping only unique combinations of values. The operation references values across all columns. This operation helps clean data by eliminating redundant records.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import DedupeRowsOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create dedupe rows operation
>>> dedupe_op = DedupeRowsOperation()
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[dedupe_op])

Primary input dataset:

customer_id	product	quantity	price
C001	Widget A	2	25.99
C002	Gadget B	1	15.50
C001	Widget A	2	25.99
C003	Tool C	3	45.00
C002	Gadget B	1	15.50

Recipe preview:

customer_id	product	quantity	price
C001	Widget A	2	25.99
C002	Gadget B	1	15.50
C003	Tool C	3	45.00

Find-and-replace operation¶

The FindAndReplaceOperation searches for specific strings or patterns in a column and replaces them with new values. The operation supports exact matches, partial matches, or regular expressions for flexible text manipulation.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import FindAndReplaceOperation
>>> from datarobot.enums import FindAndReplaceMatchMode
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Replace instances of 'In Progress' (case insensitive) with 'Active'
>>> replace_op = FindAndReplaceOperation(
...     column="status",
...     find="In Progress",
...     replace_with="Active",
...     match_mode=FindAndReplaceMatchMode.EXACT,
...     is_case_sensitive=False
... )
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[replace_op])

Primary input dataset:

order_id	status	customer_name
ORD001	In Progress	John Smith
ORD002	Completed	Jane Doe
ORD003	in progress	Bob Johnson
ORD004	Cancelled	Alice Brown

Recipe preview:

order_id	status	customer_name
ORD001	Active	John Smith
ORD002	Completed	Jane Doe
ORD003	Active	Bob Johnson
ORD004	Cancelled	Alice Brown

Aggregation operation¶

The AggregationOperation groups data by the specified columns and calculates summary features like sum, average and count. This operation is useful for creating analytical summaries and computing derived features. A new column will be created for each aggregation function applied for each feature chosen for aggregation.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import AggregationOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Group by customer id and product category
>>> # Compute the sum of orders and customer's average order amount
>>> agg_op = AggregationOperation(
...     group_by_columns=['customer_id', 'product_category'],
...     aggregations=[
...         AggregateFeature(
...             feature="order_amount",
...             functions=[AggregationFunctions.SUM, AggregationFunctions.AVERAGE]
...         )
...     ]
... )
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[agg_op])

Primary input dataset:

customer_id	product_category	order_id	order_amount
C001	Electronics	ORD001	150.00
C001	Electronics	ORD002	200.00
C001	Clothing	ORD003	75.00
C002	Electronics	ORD004	300.00
C002	Electronics	ORD005	125.00

Recipe preview:

customer_id	product_category	order_amount_sum	order_amount_avg
C001	Electronics	350.00	175.00
C001	Clothing	75.00	75.00
C002	Electronics	425.00	212.50

Join operation¶

The JoinOperation allows for joining an additional data input to the recipe's current data. This operation can enable you to enrich your primary dataset with additional information from secondary datasets. The join operation only supports one or more equality predicates as the join condition.

Note

The additional data input is treated as the right side of the join.

>>> import datarobot as dr
>>> from datarobot.models.recipe import RecipeDatasetInput
>>> from datarobot.models.recipe_operation import JoinOperation
>>> from datarobot.enums import JoinType
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Get the secondary dataset and add it as an input
>>> dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f123')
>>> recipe.update(
...     inputs=[
...         recipe.inputs[0],  # Keep the original primary input
...         RecipeDatasetInput.from_dataset(
...             dataset=dataset,
...             alias='customers'
...         )
...     ]
... )
>>>
>>> # Join secondary dataset on customer id
>>> # Right dataset in join will always be the new dataset
>>> join_op = JoinOperation.join_dataset(
...     dataset=dataset,
...     join_type=JoinTypes.INNER,
...     right_prefix='cust_',
...     left_keys=['customer_id'],
...     right_keys=['id']
... )
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[join_op])

Primary input dataset (orders)

order_id	customer_id	amount
ORD001	C001	150.00
ORD002	C002	200.00
ORD003	C001	75.00

Secondary input dataset (customers)

id	name	city
C001	John Smith	New York
C002	Jane Doe	Los Angeles
C003	Bob Lee	Chicago

Recipe preview:

order_id	customer_id	amount	cust_id	cust_name	cust_city
ORD001	C001	150.00	C001	John Smith	New York
ORD002	C002	200.00	C002	Jane Doe	Los Angeles
ORD003	C001	75.00	C001	John Smith	New York

Set recipe SQL transformation directly¶

For advanced use cases, you can set the recipe's transformation using a SQL expression. This provides maximum flexibility for complex operations that may not be available through standard wrangling operations.

Important: Setting SQL directly changes the recipe type to SQL and bypasses any existing wrangling operations.

>>> import datarobot as dr
>>>
>>> # Get your recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Define your SQL transformation
>>> sql_query = "MY SQL EXPRESSION HERE"
>>> # Update the recipe with SQL
>>> recipe.update(sql=sql_query)

Preview recipe data¶

Before publishing your recipe, you can preview the transformed data to validate your transformations and ensure they produce the expected results with Recipe.get_preview.

>>> import datarobot as dr
>>>
>>> # Get your recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Generate a preview of the transformed data
>>> preview = recipe.get_preview()
>>> # View preview data as a DataFrame
>>> preview.df

Update recipe downsampling¶

Downsampling modifies the size of the dataset published by the recipe, which can improve performance for large datasets and speed up development and testing. This is particularly useful when working with millions of rows where a representative sample is sufficient when publishing to a dataset. Downsampling does not affect the number of rows in the recipe preview. Set a recipe's downsampling with a downsampling operation.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import RandomDownsamplingOperation
>>>
>>> # Get your recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Configure random downsampling to 50,000 rows
>>> downsampling = RandomDownsamplingOperation(max_rows=50_000)
>>> # Apply downsampling to the recipe
>>> recipe.update(downsampling=downsampling)
>>> # Disable downsampling
>>> recipe.update(downsampling=None)

Publish recipe to dataset¶

Once your recipe is complete, you can publish it with Recipe.publish_to_dataset to create a dataset with your transformed data.

>>> import datarobot as dr
>>>
>>> # Get your recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Publish recipe to create a new dataset
>>> dataset = recipe.publish_to_dataset(
...     name="Customer Segmentation Data",
...     do_snapshot=True
... )
>>>
>>> # Publish and attach to an existing use case
>>> use_case = dr.UseCase.get('5e1b4f8f2a3c4d5e6f7g8h9i')
>>> dataset_with_use_case = recipe.publish_to_dataset(
...     name="Advanced Customer Analytics",
...     use_cases=use_case
... )