Skip to content

Recipes

To clean, prepare, and wrangle your data into your desired shape, DataRobot provides reusable recipes for data preparation. Each recipe acts like a blueprint, taking one or more datasets or data sources as input and applying a series of operations to filter, modify, join or transform your data. You can then use the recipe to create a dataset ready for consumption. Recipes allow for quick iteration on data prep workflows, and enable re-use via its simple operations API.

Recipe terminology

Recipes use the following terminology:

  • Recipe: A reusable blueprint for how to create a new dataset by applying operations to transform one or more data inputs.
  • Recipe dialect: The dialect data wrangling operations should use when working with recipe inputs. For example, use Snowflake dialect when working with data assets from Snowflake.
  • Input: A dataset or data source providing data to a recipe. A recipe can have multiple inputs. A recipe's inputs must either be all datasets, or all tables from data sources pointing to the same data store.
  • Primary input: The input used as the base for the recipe. If no operations are applied by the recipe, a dataset identical to the primary input will be output by the recipe. A recipe will only have a single primary input.
  • Secondary input: An additional input to a recipe. A recipe can have multiple secondary inputs. Data from secondary inputs must be introduced into a recipe via join or other similar operation.
  • Recipe preview: A sample view of the recipe's data computed by applying the operations in a recipe on its inputs. The data featured in a recipe's preview is generally a sample of the recipe's fully transformed data.
  • Sampling: A setting through which the number of rows read from a recipe's primary input is modified when computing the recipe's preview.
  • Downsampling: A setting through which the number of rows written to the dataset published by a recipe is modified.
  • Operation: A way to modify how a recipe works with data from its inputs.
  • Wrangling operation: Transformation to apply to a recipe's data. Recipes can stack multiple wrangling operations on top of each other to transform data from its inputs.
  • Downsampling operation: Modification to the recipe's number of rows to write to a dataset when publishing. Recipes can optionally use a single downsampling operation.
  • Sampling operation: Modification to the number of rows to read from a recipe's primary data input. Recipes can optionally set a single sampling operation on their primary input.
  • Publishing: Action to create a new dataset containing the result of applying the recipe's operations on its inputs.

Review the recommended workflow to create, iterate, and publish with recipes below.

  1. Create a datarobot.Recipe to work on a datarobot.Dataset or data from a datarobot.DataSource. The recipe will belong to a datarobot.Usecase.
  2. Modify the recipe by updating its metadata, settings, inputs, operations, or downsampling.
  3. Verify the recipe's data by requesting a recipe preview. If you are unhappy with the result, go back to step 2.
  4. Publish the recipe to create a new datarobot.Dataset constructed according to the transformations in the recipe.

Create a recipe

There are two ways to create a recipe. You can use either a dataset or a table from a JDBC data source. They become the primary input for the recipe. You will also need a datarobot.Usecase, as each recipe will belong to a use case. Choose the DataWranglingDialect that best matches the source of the dataset or data source.

Create a recipe from a dataset

Use the Recipe.from_dataset method to create a recipe from an existing datarobot.Dataset:

>>> import datarobot as dr
>>> from datarobot.enums import DataWranglingDialect, RecipeType
>>> from datarobot.models.recipe_operation import RandomSamplingOperation
>>>
>>> # Get your use case and dataset
>>> my_use_case = dr.UseCase.list(search_params={"search": "My Use Case"})[0]
>>> dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f123')
>>>
>>> # Create a recipe from the dataset
>>> recipe = dr.Recipe.from_dataset(
...     use_case=my_use_case,
...     dataset=dataset,
...     dialect=DataWranglingDialect.SPARK,
...     recipe_type=RecipeType.WRANGLING,
...     sampling=RandomSamplingOperation(rows=500)
... )

Create a recipe from a JDBC table

Use the Recipe.from_data_store method to create a recipe directly from tables in a connected data source:

>>> import datarobot as dr
>>> from datarobot.enums import DataWranglingDataSourceTypes, DataWranglingDialect, RecipeType
>>> from datarobot.models.recipe import DataSourceInput
>>> from datarobot.models.recipe_operation import LimitSamplingOperation
>>>
>>> # Configure your data source input
>>> data_source_input = DataSourceInput(
...     canonical_name='Sales_Data_Connection', # data connection name
...     table='sales_transactions',
...     schema='PUBLIC',
...     sampling=LimitSamplingOperation(rows=1000)
... )
>>>
>>> # Get your use case and data store
>>> my_use_case = dr.UseCase.list(search_params={"search": "Sales Analysis"})[0]
>>> data_store = dr.DataStore.get('2g33a1b2c9e88f0001e6f657')
>>>
>>> # Create recipe from data source
>>> recipe = dr.Recipe.from_data_store(
...     use_case=my_use_case,
...     data_store=data_store,
...     data_source_type=DataWranglingDataSourceTypes.JDBC,
...     dialect=DataWranglingDialect.POSTGRES,
...     data_source_inputs=[data_source_input],
...     recipe_type=RecipeType.WRANGLING
... )

Retrieve recipes

You can retrieve a specific recipe by ID, or a list of all recipes, filtering the list as required.

>>> import datarobot as dr
>>>
>>> # Get a specific recipe by ID
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # List all recipes
>>> all_recipes = dr.Recipe.list()
>>>
>>> # Filter recipes. Use any number of params to filter.
>>> filtered_recipes = dr.Recipe.list(
...     search="My Recipe Name",
...     dialect=dr.enums.DataWranglingDialect.SPARK,
...     status="draft",
...     recipe_type=dr.enums.RecipeType.WRANGLING,
...     order_by="-updatedAt",  # Most recently updated first
...     created_by_username="data_scientist_user"
... )

Retrieve information about a recipe

The recipe object contains basic information about the recipe that you can query, as shown below.

>>> import datarobot as dr
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> recipe.id
u'690bbf77aa31530d8287ae5f'
>>> recipe.name
u"Customer Segmentation Dataset Recipe"

You can also retrieve the list of inputs and operations, as well as the settings for downsampling and general recipe settings.

>>> import datarobot as dr
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> # Access inputs, operations, downsampling and settings
>>> inputs = recipe.inputs
>>> primary_input = inputs[0] # First input is the primary input
>>> secondary_inputs = inputs[1:] # All others in the list are secondary inputs
>>>
>>> operations = recipe.operations
>>> downsampling_operation = recipe.downsampling
>>> settings = recipe.settings

Update recipe metadata fields

You can update the recipe's metadata fields (name, description, etc.) with Recipe.update as follows:

>>> import datarobot as dr
>>>
>>> # Retrieve an existing recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Update metadata fields
>>> recipe.update(
...     name="Customer Segmentation Dataset Recipe",
...     description="Recipe to create customer segmentation dataset."
... )

Update recipe inputs

You can update the list of inputs for a recipe with the Recipe.update method as shown below. By updating the list of inputs, you change the data fed into the recipe to transform. The first input in the list will be the recipe's primary input, with the rest being secondary inputs.

Recipe input considerations

The Recipe.update method will replace all existing inputs. If adding inputs, always include the existing primary input to avoid breaking the recipe.

Data from secondary inputs will not appear in the recipe preview unless somehow joined or combined with data from the primary input.

Recipe inputs must either be all datasets, or all tables from data sources pointing to the same data store.

>>> import datarobot as dr
>>> from datarobot.models.recipe import RecipeDatasetInput, JDBCTableDataSourceInput
>>>
>>> # Get the recipe and additional datasets
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> secondary_dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f456')
>>>
>>> # Add a secondary dataset input if the primary input is also a dataset
>>> recipe.update(
...     inputs=[
...         recipe.inputs[0],  # Keep the original primary input
...         RecipeDatasetInput.from_dataset(
...             dataset=secondary_dataset,
...             alias='customers_data'
...         )
...     ]
... )
>>>
>>> # You can also add data from a table in a data store
>>> data_store = dr.DataStore.get('5e1b4f8f2a3c4d5e6f7g8h9i')
>>> data_source = DataSource.create(
...     data_source_type="jdbc",
...     canonical_name="My Snowflake connection",
...     params=dr.DataSourceParameters(
...         data_store_id=data_store.id,
...         schema="PUBLIC",
...         table="stock_prices"
...     )
... )
>>> table = data_source.create_dataset()
>>> # Add data from a table in a data store if the primary input is also a table from the same data store
>>> recipe.update(
...     inputs=[
...         recipe.inputs[0],  # Primary input
...         JDBCTableDataSourceInput(
...             input_type=RecipeInputType.DATASOURCE,
...             data_source_id=data_source.id,
...             data_store_id=data_store.id,
...             dataset_id=table.id,
...             sampling=LimitSamplingOperation(rows=250),
...             alias='my_table_alias'
...         )
...     ]
... )

Update primary input sampling

You can choose to limit the number of rows to work with when iterating on your recipe operations. By specifying a sampling operation on the primary input of your recipe, you enable faster computation of the recipe preview. Sampling operations will not modify the number of rows when publishing to a dataset. You should only specify a sampling operation on the primary input. Since secondary inputs are always joined or combined with the primary input, the primary input is the only input that determines the number of rows to show in the recipe preview.

>>> from datarobot.models.recipe_operation import LimitSamplingOperation
>>>
>>> # Configure sampling for an input
>>> my_dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f456')
>>> dataset_input = RecipeDatasetInput.from_dataset(
...     dataset=my_dataset,
...     alias='sampled_data',
...     sampling=LimitSamplingOperation(rows=100)
... )
>>> # Update recipe with sampled input
>>> recipe.update(inputs=[dataset_input])

Update recipe wrangling operations

Wrangling operations are the building blocks of your recipe and define the transformations applied to your data. Operations are processed sequentially; the output of one operation becomes the input for the next operation, creating a transformation pipeline.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import *
>>> from datarobot.enums import FilterOperationFunctions, AggregationFunctions
>>>
>>> # Get your recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create a series of operations
>>> operations = [
...     # Filter rows where age > 18
...     FilterOperation(
...         conditions=[
...             FilterCondition(
...                 column="age",
...                 function=FilterOperationFunctions.GREATER_THAN,
...                 function_arguments=[18]
...             )
...         ],
...         keep_rows=True
...     ),
...     # Then create new column with full name
...     ComputeNewOperation(
...         expression="CONCAT(first_name, " ", last_name)",
...         new_feature_name="full_name"
...     ),
...     # Then group by department and calculate average salary
...     AggregationOperation(
...         aggregations=[
...             AggregateFeature(
...                 feature="salary",
...                 functions=[AggregationFunctions.AVERAGE]
...             )
...         ],
...         group_by_columns=["department"]
...     ),
... ]
>>>
>>> # Update the recipe with new list of wrangling operations
>>> recipe.update(operations=operations)

Below appears the list of available data wrangling operations with an example of their transformation.

Lags operation

The LagsOperation creates lagged versions of a column based on datetime ordering. The operation creates new columns for each specified lag.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import LagsOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create lags for 1 and 2 days for stock price analysis
>>> lags_op = LagsOperation(
...     column="stock_price",
...     orders=[1, 2],
...     datetime_partition_column="trade_date",
...     multiseries_id_column="ticker_symbol"  # For multiseries data (multiple stocks in this example)
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[lags_op])

Primary input dataset:

ticker_symbol trade_date stock_price
AAPL 2024-01-01 150.00
AAPL 2024-01-02 152.50
AAPL 2024-01-03 149.75
AAPL 2024-01-04 153.20
MSFT 2024-01-01 380.00
MSFT 2024-01-02 385.75
MSFT 2024-01-03 382.30
MSFT 2024-01-04 388.90

Recipe preview:

ticker_symbol trade_date stock_price stock_price (1st lag) stock_price (2nd lag)
AAPL 2024-01-01 150.00
AAPL 2024-01-02 152.50 150.00
AAPL 2024-01-03 149.75 152.50 150.00
AAPL 2024-01-04 153.20 149.75 152.50
MSFT 2024-01-01 380.00
MSFT 2024-01-02 385.75 380.00
MSFT 2024-01-03 382.30 385.75 380.00
MSFT 2024-01-04 388.90 382.30 385.75

Window categorical statistics operation

The WindowCategoricalStatsOperation calculates categorical statistics for a rolling window, creating new columns for each statistical method. This can be used to track trends in categorical data over time.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import WindowCategoricalStatsOperation
>>> from datarobot.enums import CategoricalStatsMethods
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Compute most frequent purchase in last 3 purchases
>>> window_cat_op = WindowCategoricalStatsOperation(
...     column="product_category",
...     window_size=3,  # Last 3 purchases
...     methods=[CategoricalStatsMethods.MOST_FREQUENT],
...     datetime_partition_column="purchase_date",
...     multiseries_id_column="customer_id"
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[window_cat_op])

Primary input dataset:

customer_id purchase_date product_category
CUST001 2024-01-01 Electronics
CUST001 2024-01-02 Clothing
CUST001 2024-01-03 Electronics
CUST001 2024-01-04 Electronics
CUST002 2024-01-01 Books
CUST002 2024-01-02 Books
CUST002 2024-01-03 Electronics
CUST002 2024-01-04 Books

Recipe preview:

customer_id purchase_date product_category product_category (3 rows most frequent)
CUST001 2024-01-01 Electronics Electronics
CUST001 2024-01-02 Clothing Electronics
CUST001 2024-01-03 Electronics Electronics
CUST001 2024-01-04 Electronics Electronics
CUST002 2024-01-01 Books Books
CUST002 2024-01-02 Books Books
CUST002 2024-01-03 Electronics Books
CUST002 2024-01-04 Books Books

Window numerical statistics operation

The WindowNumericStatsOperation calculates numeric statistics for a rolling window, creating new columns for each statistical method. This operation is useful for computing moving averages, maximums, minimums, and other statistics over time.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import WindowNumericStatsOperation
>>> from datarobot.enums import NumericStatsMethods
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Track max and average of last 3 transactions
>>> window_num_op = WindowNumericStatsOperation(
...     column="sales_amount",
...     window_size=3,  # Last 3 transactions
...     methods=[NumericStatsMethods.AVG, NumericStatsMethods.MAX],
...     datetime_partition_column="transaction_date",
...     multiseries_id_column="store_id"
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[window_num_op])

Primary input dataset:

store_id transaction_date sales_amount
STORE01 2024-01-01 100.00
STORE01 2024-01-02 150.00
STORE01 2024-01-03 120.00
STORE01 2024-01-04 200.00
STORE02 2024-01-01 80.00
STORE02 2024-01-02 90.00
STORE02 2024-01-03 110.00
STORE02 2024-01-04 95.00

Recipe preview:

store_id transaction_date sales_amount sales_amount (3 rows avg) sales_amount (3 rows max)
STORE01 2024-01-01 100.00 100.00 100.00
STORE01 2024-01-02 150.00 125.00 150.00
STORE01 2024-01-03 120.00 123.33 150.00
STORE01 2024-01-04 200.00 156.67 200.00
STORE02 2024-01-01 80.00 80.00 80.00
STORE02 2024-01-02 90.00 85.00 90.00
STORE02 2024-01-03 110.00 93.33 110.00
STORE02 2024-01-04 95.00 98.33 110.00

Time series operation

The TimeSeriesOperation generates a dataset ready for time series modeling by creating forecast points, distances, and various time-aware features. By defining a task plan, multiple time-series transformations like lags and rolling statistics are executed and added as features to the recipe data.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import TimeSeriesOperation, TaskPlanElement, Lags
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Define task plan for feature engineering
>>> task_plan = [
...     TaskPlanElement(
...         column="sales_amount",
...         task_list=[Lags(orders=[1])]
...     )
... ]
>>>
>>> # Create time series operation
>>> time_series_op = TimeSeriesOperation(
...     target_column="sales_amount",
...     datetime_partition_column="sale_date",
...     forecast_distances=[1],  # Predict 1 period ahead
...     task_plan=task_plan
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[time_series_op])

Primary input dataset:

store_id sale_date sales_amount
STORE01 2024-01-01 1000
STORE01 2024-01-02 1200
STORE01 2024-01-03 1100
STORE01 2024-01-04 1300

Recipe preview:

store_id (actual) sale_date (actual) sales_amount (actual) Forecast Point Forecast Distance sales_amount (1st lag) sales_amount (naive 1 row seasonal value)
STORE01 2024-01-02 1200 2024-01-01 1 1000 1000
STORE01 2024-01-03 1100 2024-01-02 1 1200 1200
STORE01 2024-01-04 1300 2024-01-03 1 1100 1100

Compute new operation

The ComputeNewOperation creates a new feature using a SQL expression, allowing you to derive calculated fields from existing columns. This operation can be useful for creating custom business logic, mathematical transformations, and feature combinations.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import ComputeNewOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create compute new operation to compute total cost, factoring in a discount %
>>> compute_op = ComputeNewOperation(
...     expression="ROUND(quantity * unit_price * (1 - discount), 2)",
...     new_feature_name="total_cost"
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[compute_op])

Primary input dataset:

order_id quantity unit_price discount
ORD001 3 25.50 0.10
ORD002 1 15.00 0.00
ORD003 2 40.00 0.15
ORD004 5 12.25 0.05

Recipe preview:

order_id quantity unit_price discount total_cost
ORD001 3 25.50 0.10 68.85
ORD002 1 15.00 0.00 15.00
ORD003 2 40.00 0.15 68.00
ORD004 5 12.25 0.05 58.19

Rename column operation

The RenameColumnsOperation renames one or more columns. This operation is often useful for standardizing column names, making them more descriptive, or ensuring consistent column naming for specific downstream processes.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import RenameColumnsOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Rename customer id, product name and quantity columns
>>> rename_op = RenameColumnsOperation(
...     column_mappings={
...         'cust_id': 'customer_id',
...         'prod_name': 'product_name',
...         'qty': 'quantity'
...     }
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[rename_op])

Primary input dataset:

cust_id prod_name qty price
C001 Widget A 3 25.99
C002 Gadget B 1 15.50
C001 Tool C 2 45.00
C003 Widget A 5 25.99

Recipe preview:

customer_id product_name quantity price
C001 Widget A 3 25.99
C002 Gadget B 1 15.50
C001 Tool C 2 45.00
C003 Widget A 5 25.99

Filter operation

The FilterOperation removes or keeps rows based on one or more filter conditions. Apply multiple conditions with AND/OR logic to create complex filtering rules.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import FilterOperation, FilterCondition
>>> from datarobot.enums import FilterOperationFunctions
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create filter conditions to keep customers over 18 with active status
>>> conditions = [
...     FilterCondition(
...         column="age",
...         function=FilterOperationFunctions.GREATER_THAN_OR_EQUALS,
...         function_arguments=[18]
...     ),
...     FilterCondition(
...         column="status",
...         function=FilterOperationFunctions.EQUALS,
...         function_arguments=["active"]
...     )
... ]
>>>
>>> # Create filter operation
>>> filter_op = FilterOperation(
...     conditions=conditions,
...     keep_rows=True,  # Keep matching rows
...     operator="and"   # Both conditions must be true
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[filter_op])

Primary input dataset:

customer_id age status purchase_amount
C001 25 active 150.00
C002 17 active 75.00
C003 30 inactive 200.00
C004 22 active 95.00

Recipe preview:

customer_id age status purchase_amount
C001 25 active 150.00
C004 22 active 95.00

Drop columns operation

The DropColumnsOperation removes one or more columns. This operation is useful for eliminating unnecessary fields, sensitive information, or columns that won't be used in downstream processes.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import DropColumnsOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create operation to drop 2 extra columns
>>> drop_op = DropColumnsOperation(
...     columns=['internal_notes', 'legacy_id']
... )
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[drop_op])

Primary input dataset:

customer_id name email internal_notes legacy_id
C001 John Doe john@email.com VIP customer L001
C002 Jane Doe jane@email.com New customer L002
C003 Bob Lee bob@email.com Frequent buyer L003

Recipe preview:

customer_id name email
C001 John Doe john@email.com
C002 Jane Doe jane@email.com
C003 Bob Lee bob@email.com

Dedupe rows operation

The DedupeRowsOperation removes duplicate rows, keeping only unique combinations of values. The operation references values across all columns. This operation helps clean data by eliminating redundant records.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import DedupeRowsOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create dedupe rows operation
>>> dedupe_op = DedupeRowsOperation()
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[dedupe_op])

Primary input dataset:

customer_id product quantity price
C001 Widget A 2 25.99
C002 Gadget B 1 15.50
C001 Widget A 2 25.99
C003 Tool C 3 45.00
C002 Gadget B 1 15.50

Recipe preview:

customer_id product quantity price
C001 Widget A 2 25.99
C002 Gadget B 1 15.50
C003 Tool C 3 45.00

Find-and-replace operation

The FindAndReplaceOperation searches for specific strings or patterns in a column and replaces them with new values. The operation supports exact matches, partial matches, or regular expressions for flexible text manipulation.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import FindAndReplaceOperation
>>> from datarobot.enums import FindAndReplaceMatchMode
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Replace instances of 'In Progress' (case insensitive) with 'Active'
>>> replace_op = FindAndReplaceOperation(
...     column="status",
...     find="In Progress",
...     replace_with="Active",
...     match_mode=FindAndReplaceMatchMode.EXACT,
...     is_case_sensitive=False
... )
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[replace_op])

Primary input dataset:

order_id status customer_name
ORD001 In Progress John Smith
ORD002 Completed Jane Doe
ORD003 in progress Bob Johnson
ORD004 Cancelled Alice Brown

Recipe preview:

order_id status customer_name
ORD001 Active John Smith
ORD002 Completed Jane Doe
ORD003 Active Bob Johnson
ORD004 Cancelled Alice Brown

Aggregation operation

The AggregationOperation groups data by the specified columns and calculates summary features like sum, average and count. This operation is useful for creating analytical summaries and computing derived features. A new column will be created for each aggregation function applied for each feature chosen for aggregation.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import AggregationOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Group by customer id and product category
>>> # Compute the sum of orders and customer's average order amount
>>> agg_op = AggregationOperation(
...     group_by_columns=['customer_id', 'product_category'],
...     aggregations=[
...         AggregateFeature(
...             feature="order_amount",
...             functions=[AggregationFunctions.SUM, AggregationFunctions.AVERAGE]
...         )
...     ]
... )
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[agg_op])

Primary input dataset:

customer_id product_category order_id order_amount
C001 Electronics ORD001 150.00
C001 Electronics ORD002 200.00
C001 Clothing ORD003 75.00
C002 Electronics ORD004 300.00
C002 Electronics ORD005 125.00

Recipe preview:

customer_id product_category order_amount_sum order_amount_avg
C001 Electronics 350.00 175.00
C001 Clothing 75.00 75.00
C002 Electronics 425.00 212.50

Join operation

The JoinOperation allows for joining an additional data input to the recipe's current data. This operation can enable you to enrich your primary dataset with additional information from secondary datasets. The join operation only supports one or more equality predicates as the join condition.

Note

The additional data input is treated as the right side of the join.

>>> import datarobot as dr
>>> from datarobot.models.recipe import RecipeDatasetInput
>>> from datarobot.models.recipe_operation import JoinOperation
>>> from datarobot.enums import JoinType
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Get the secondary dataset and add it as an input
>>> dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f123')
>>> recipe.update(
...     inputs=[
...         recipe.inputs[0],  # Keep the original primary input
...         RecipeDatasetInput.from_dataset(
...             dataset=dataset,
...             alias='customers'
...         )
...     ]
... )
>>>
>>> # Join secondary dataset on customer id
>>> # Right dataset in join will always be the new dataset
>>> join_op = JoinOperation.join_dataset(
...     dataset=dataset,
...     join_type=JoinTypes.INNER,
...     right_prefix='cust_',
...     left_keys=['customer_id'],
...     right_keys=['id']
... )
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[join_op])

Primary input dataset (orders)

order_id customer_id amount
ORD001 C001 150.00
ORD002 C002 200.00
ORD003 C001 75.00

Secondary input dataset (customers)

id name city
C001 John Smith New York
C002 Jane Doe Los Angeles
C003 Bob Lee Chicago

Recipe preview:

order_id customer_id amount cust_id cust_name cust_city
ORD001 C001 150.00 C001 John Smith New York
ORD002 C002 200.00 C002 Jane Doe Los Angeles
ORD003 C001 75.00 C001 John Smith New York

Set recipe SQL transformation directly

For advanced use cases, you can set the recipe's transformation using a SQL expression. This provides maximum flexibility for complex operations that may not be available through standard wrangling operations.

Important: Setting SQL directly changes the recipe type to SQL and bypasses any existing wrangling operations.

>>> import datarobot as dr
>>>
>>> # Get your recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Define your SQL transformation
>>> sql_query = "MY SQL EXPRESSION HERE"
>>> # Update the recipe with SQL
>>> recipe.update(sql=sql_query)

Preview recipe data

Before publishing your recipe, you can preview the transformed data to validate your transformations and ensure they produce the expected results with Recipe.get_preview.

>>> import datarobot as dr
>>>
>>> # Get your recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Generate a preview of the transformed data
>>> preview = recipe.get_preview()
>>> # View preview data as a DataFrame
>>> preview.df

Update recipe downsampling

Downsampling modifies the size of the dataset published by the recipe, which can improve performance for large datasets and speed up development and testing. This is particularly useful when working with millions of rows where a representative sample is sufficient when publishing to a dataset. Downsampling does not affect the number of rows in the recipe preview. Set a recipe's downsampling with a downsampling operation.

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import RandomDownsamplingOperation
>>>
>>> # Get your recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Configure random downsampling to 50,000 rows
>>> downsampling = RandomDownsamplingOperation(max_rows=50_000)
>>> # Apply downsampling to the recipe
>>> recipe.update(downsampling=downsampling)
>>> # Disable downsampling
>>> recipe.update(downsampling=None)

Publish recipe to dataset

Once your recipe is complete, you can publish it with Recipe.publish_to_dataset to create a dataset with your transformed data.

>>> import datarobot as dr
>>>
>>> # Get your recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Publish recipe to create a new dataset
>>> dataset = recipe.publish_to_dataset(
...     name="Customer Segmentation Data",
...     do_snapshot=True
... )
>>>
>>> # Publish and attach to an existing use case
>>> use_case = dr.UseCase.get('5e1b4f8f2a3c4d5e6f7g8h9i')
>>> dataset_with_use_case = recipe.publish_to_dataset(
...     name="Advanced Customer Analytics",
...     use_cases=use_case
... )