Recipes¶
To clean, prepare, and wrangle your data into your desired shape, DataRobot provides reusable recipes for data preparation. Each recipe acts like a blueprint, taking one or more datasets or data sources as input and applying a series of operations to filter, modify, join or transform your data. You can then use the recipe to create a dataset ready for consumption. Recipes allow for quick iteration on data prep workflows, and enable re-use via its simple operations API.
Recipe terminology¶
Recipes use the following terminology:
Recipe: A reusable blueprint for how to create a new dataset by applying operations to transform one or more data inputs.Recipe dialect: The dialect data wrangling operations should use when working with recipe inputs. For example, use Snowflake dialect when working with data assets from Snowflake.Input: A dataset or data source providing data to a recipe. A recipe can have multiple inputs. A recipe's inputs must either be all datasets, or all tables from data sources pointing to the same data store.Primary input: The input used as the base for the recipe. If no operations are applied by the recipe, a dataset identical to the primary input will be output by the recipe. A recipe will only have a single primary input.Secondary input: An additional input to a recipe. A recipe can have multiple secondary inputs. Data from secondary inputs must be introduced into a recipe via join or other similar operation.Recipe preview: A sample view of the recipe's data computed by applying the operations in a recipe on its inputs. The data featured in a recipe's preview is generally a sample of the recipe's fully transformed data.Sampling: A setting through which the number of rows read from a recipe's primary input is modified when computing the recipe's preview.Downsampling: A setting through which the number of rows written to the dataset published by a recipe is modified.Operation: A way to modify how a recipe works with data from its inputs.Wrangling operation: Transformation to apply to a recipe's data. Recipes can stack multiple wrangling operations on top of each other to transform data from its inputs.Downsampling operation: Modification to the recipe's number of rows to write to a dataset when publishing. Recipes can optionally use a single downsampling operation.Sampling operation: Modification to the number of rows to read from a recipe's primary data input. Recipes can optionally set a single sampling operation on their primary input.Publishing: Action to create a new dataset containing the result of applying the recipe's operations on its inputs.
Review the recommended workflow to create, iterate, and publish with recipes below.
- Create a
datarobot.Recipeto work on adatarobot.Datasetor data from adatarobot.DataSource. The recipe will belong to adatarobot.Usecase. - Modify the recipe by updating its metadata, settings, inputs, operations, or downsampling.
- Verify the recipe's data by requesting a recipe preview. If you are unhappy with the result, go back to step 2.
- Publish the recipe to create a new
datarobot.Datasetconstructed according to the transformations in the recipe.
Create a recipe¶
There are two ways to create a recipe. You can use either a dataset or a table from a JDBC data source. They become the primary input for the recipe. You will also need a datarobot.Usecase, as each recipe will belong to a use case. Choose the DataWranglingDialect that best matches the source of the dataset or data source.
Create a recipe from a dataset¶
Use the Recipe.from_dataset method to create a recipe from an existing datarobot.Dataset:
>>> import datarobot as dr
>>> from datarobot.enums import DataWranglingDialect, RecipeType
>>> from datarobot.models.recipe_operation import RandomSamplingOperation
>>>
>>> # Get your use case and dataset
>>> my_use_case = dr.UseCase.list(search_params={"search": "My Use Case"})[0]
>>> dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f123')
>>>
>>> # Create a recipe from the dataset
>>> recipe = dr.Recipe.from_dataset(
... use_case=my_use_case,
... dataset=dataset,
... dialect=DataWranglingDialect.SPARK,
... recipe_type=RecipeType.WRANGLING,
... sampling=RandomSamplingOperation(rows=500)
... )
Create a recipe from a JDBC table¶
Use the Recipe.from_data_store method to create a recipe directly from tables in a connected data source:
>>> import datarobot as dr
>>> from datarobot.enums import DataWranglingDataSourceTypes, DataWranglingDialect, RecipeType
>>> from datarobot.models.recipe import DataSourceInput
>>> from datarobot.models.recipe_operation import LimitSamplingOperation
>>>
>>> # Configure your data source input
>>> data_source_input = DataSourceInput(
... canonical_name='Sales_Data_Connection', # data connection name
... table='sales_transactions',
... schema='PUBLIC',
... sampling=LimitSamplingOperation(rows=1000)
... )
>>>
>>> # Get your use case and data store
>>> my_use_case = dr.UseCase.list(search_params={"search": "Sales Analysis"})[0]
>>> data_store = dr.DataStore.get('2g33a1b2c9e88f0001e6f657')
>>>
>>> # Create recipe from data source
>>> recipe = dr.Recipe.from_data_store(
... use_case=my_use_case,
... data_store=data_store,
... data_source_type=DataWranglingDataSourceTypes.JDBC,
... dialect=DataWranglingDialect.POSTGRES,
... data_source_inputs=[data_source_input],
... recipe_type=RecipeType.WRANGLING
... )
Retrieve recipes¶
You can retrieve a specific recipe by ID, or a list of all recipes, filtering the list as required.
>>> import datarobot as dr
>>>
>>> # Get a specific recipe by ID
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # List all recipes
>>> all_recipes = dr.Recipe.list()
>>>
>>> # Filter recipes. Use any number of params to filter.
>>> filtered_recipes = dr.Recipe.list(
... search="My Recipe Name",
... dialect=dr.enums.DataWranglingDialect.SPARK,
... status="draft",
... recipe_type=dr.enums.RecipeType.WRANGLING,
... order_by="-updatedAt", # Most recently updated first
... created_by_username="data_scientist_user"
... )
Retrieve information about a recipe¶
The recipe object contains basic information about the recipe that you can query, as shown below.
>>> import datarobot as dr
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> recipe.id
u'690bbf77aa31530d8287ae5f'
>>> recipe.name
u"Customer Segmentation Dataset Recipe"
You can also retrieve the list of inputs and operations, as well as the settings for downsampling and general recipe settings.
>>> import datarobot as dr
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> # Access inputs, operations, downsampling and settings
>>> inputs = recipe.inputs
>>> primary_input = inputs[0] # First input is the primary input
>>> secondary_inputs = inputs[1:] # All others in the list are secondary inputs
>>>
>>> operations = recipe.operations
>>> downsampling_operation = recipe.downsampling
>>> settings = recipe.settings
Update recipe metadata fields¶
You can update the recipe's metadata fields (name, description, etc.) with Recipe.update as follows:
>>> import datarobot as dr
>>>
>>> # Retrieve an existing recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Update metadata fields
>>> recipe.update(
... name="Customer Segmentation Dataset Recipe",
... description="Recipe to create customer segmentation dataset."
... )
Update recipe inputs¶
You can update the list of inputs for a recipe with the Recipe.update method as shown below. By updating the list of inputs, you change the data fed into the recipe to transform. The first input in the list will be the recipe's primary input, with the rest being secondary inputs.
Recipe input considerations
The Recipe.update method will replace all existing inputs. If adding inputs, always include the existing primary input to avoid breaking the recipe.
Data from secondary inputs will not appear in the recipe preview unless somehow joined or combined with data from the primary input.
Recipe inputs must either be all datasets, or all tables from data sources pointing to the same data store.
>>> import datarobot as dr
>>> from datarobot.models.recipe import RecipeDatasetInput, JDBCTableDataSourceInput
>>>
>>> # Get the recipe and additional datasets
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> secondary_dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f456')
>>>
>>> # Add a secondary dataset input if the primary input is also a dataset
>>> recipe.update(
... inputs=[
... recipe.inputs[0], # Keep the original primary input
... RecipeDatasetInput.from_dataset(
... dataset=secondary_dataset,
... alias='customers_data'
... )
... ]
... )
>>>
>>> # You can also add data from a table in a data store
>>> data_store = dr.DataStore.get('5e1b4f8f2a3c4d5e6f7g8h9i')
>>> data_source = DataSource.create(
... data_source_type="jdbc",
... canonical_name="My Snowflake connection",
... params=dr.DataSourceParameters(
... data_store_id=data_store.id,
... schema="PUBLIC",
... table="stock_prices"
... )
... )
>>> table = data_source.create_dataset()
>>> # Add data from a table in a data store if the primary input is also a table from the same data store
>>> recipe.update(
... inputs=[
... recipe.inputs[0], # Primary input
... JDBCTableDataSourceInput(
... input_type=RecipeInputType.DATASOURCE,
... data_source_id=data_source.id,
... data_store_id=data_store.id,
... dataset_id=table.id,
... sampling=LimitSamplingOperation(rows=250),
... alias='my_table_alias'
... )
... ]
... )
Update primary input sampling¶
You can choose to limit the number of rows to work with when iterating on your recipe operations. By specifying a sampling operation on the primary input of your recipe, you enable faster computation of the recipe preview. Sampling operations will not modify the number of rows when publishing to a dataset. You should only specify a sampling operation on the primary input. Since secondary inputs are always joined or combined with the primary input, the primary input is the only input that determines the number of rows to show in the recipe preview.
>>> from datarobot.models.recipe_operation import LimitSamplingOperation
>>>
>>> # Configure sampling for an input
>>> my_dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f456')
>>> dataset_input = RecipeDatasetInput.from_dataset(
... dataset=my_dataset,
... alias='sampled_data',
... sampling=LimitSamplingOperation(rows=100)
... )
>>> # Update recipe with sampled input
>>> recipe.update(inputs=[dataset_input])
Update recipe wrangling operations¶
Wrangling operations are the building blocks of your recipe and define the transformations applied to your data. Operations are processed sequentially; the output of one operation becomes the input for the next operation, creating a transformation pipeline.
>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import *
>>> from datarobot.enums import FilterOperationFunctions, AggregationFunctions
>>>
>>> # Get your recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create a series of operations
>>> operations = [
... # Filter rows where age > 18
... FilterOperation(
... conditions=[
... FilterCondition(
... column="age",
... function=FilterOperationFunctions.GREATER_THAN,
... function_arguments=[18]
... )
... ],
... keep_rows=True
... ),
... # Then create new column with full name
... ComputeNewOperation(
... expression="CONCAT(first_name, " ", last_name)",
... new_feature_name="full_name"
... ),
... # Then group by department and calculate average salary
... AggregationOperation(
... aggregations=[
... AggregateFeature(
... feature="salary",
... functions=[AggregationFunctions.AVERAGE]
... )
... ],
... group_by_columns=["department"]
... ),
... ]
>>>
>>> # Update the recipe with new list of wrangling operations
>>> recipe.update(operations=operations)
Below appears the list of available data wrangling operations with an example of their transformation.
Lags operation¶
The LagsOperation creates lagged versions of a column based on datetime ordering. The operation creates new columns for each specified lag.
>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import LagsOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create lags for 1 and 2 days for stock price analysis
>>> lags_op = LagsOperation(
... column="stock_price",
... orders=[1, 2],
... datetime_partition_column="trade_date",
... multiseries_id_column="ticker_symbol" # For multiseries data (multiple stocks in this example)
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[lags_op])
Primary input dataset:
| ticker_symbol | trade_date | stock_price |
|---|---|---|
| AAPL | 2024-01-01 | 150.00 |
| AAPL | 2024-01-02 | 152.50 |
| AAPL | 2024-01-03 | 149.75 |
| AAPL | 2024-01-04 | 153.20 |
| MSFT | 2024-01-01 | 380.00 |
| MSFT | 2024-01-02 | 385.75 |
| MSFT | 2024-01-03 | 382.30 |
| MSFT | 2024-01-04 | 388.90 |
Recipe preview:
| ticker_symbol | trade_date | stock_price | stock_price (1st lag) | stock_price (2nd lag) |
|---|---|---|---|---|
| AAPL | 2024-01-01 | 150.00 | ||
| AAPL | 2024-01-02 | 152.50 | 150.00 | |
| AAPL | 2024-01-03 | 149.75 | 152.50 | 150.00 |
| AAPL | 2024-01-04 | 153.20 | 149.75 | 152.50 |
| MSFT | 2024-01-01 | 380.00 | ||
| MSFT | 2024-01-02 | 385.75 | 380.00 | |
| MSFT | 2024-01-03 | 382.30 | 385.75 | 380.00 |
| MSFT | 2024-01-04 | 388.90 | 382.30 | 385.75 |
Window categorical statistics operation¶
The WindowCategoricalStatsOperation calculates categorical statistics for a rolling window, creating new columns for each statistical method. This can be used to track trends in categorical data over time.
>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import WindowCategoricalStatsOperation
>>> from datarobot.enums import CategoricalStatsMethods
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Compute most frequent purchase in last 3 purchases
>>> window_cat_op = WindowCategoricalStatsOperation(
... column="product_category",
... window_size=3, # Last 3 purchases
... methods=[CategoricalStatsMethods.MOST_FREQUENT],
... datetime_partition_column="purchase_date",
... multiseries_id_column="customer_id"
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[window_cat_op])
Primary input dataset:
| customer_id | purchase_date | product_category |
|---|---|---|
| CUST001 | 2024-01-01 | Electronics |
| CUST001 | 2024-01-02 | Clothing |
| CUST001 | 2024-01-03 | Electronics |
| CUST001 | 2024-01-04 | Electronics |
| CUST002 | 2024-01-01 | Books |
| CUST002 | 2024-01-02 | Books |
| CUST002 | 2024-01-03 | Electronics |
| CUST002 | 2024-01-04 | Books |
Recipe preview:
| customer_id | purchase_date | product_category | product_category (3 rows most frequent) |
|---|---|---|---|
| CUST001 | 2024-01-01 | Electronics | Electronics |
| CUST001 | 2024-01-02 | Clothing | Electronics |
| CUST001 | 2024-01-03 | Electronics | Electronics |
| CUST001 | 2024-01-04 | Electronics | Electronics |
| CUST002 | 2024-01-01 | Books | Books |
| CUST002 | 2024-01-02 | Books | Books |
| CUST002 | 2024-01-03 | Electronics | Books |
| CUST002 | 2024-01-04 | Books | Books |
Window numerical statistics operation¶
The WindowNumericStatsOperation calculates numeric statistics for a rolling window, creating new columns for each statistical method. This operation is useful for computing moving averages, maximums, minimums, and other statistics over time.
>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import WindowNumericStatsOperation
>>> from datarobot.enums import NumericStatsMethods
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Track max and average of last 3 transactions
>>> window_num_op = WindowNumericStatsOperation(
... column="sales_amount",
... window_size=3, # Last 3 transactions
... methods=[NumericStatsMethods.AVG, NumericStatsMethods.MAX],
... datetime_partition_column="transaction_date",
... multiseries_id_column="store_id"
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[window_num_op])
Primary input dataset:
| store_id | transaction_date | sales_amount |
|---|---|---|
| STORE01 | 2024-01-01 | 100.00 |
| STORE01 | 2024-01-02 | 150.00 |
| STORE01 | 2024-01-03 | 120.00 |
| STORE01 | 2024-01-04 | 200.00 |
| STORE02 | 2024-01-01 | 80.00 |
| STORE02 | 2024-01-02 | 90.00 |
| STORE02 | 2024-01-03 | 110.00 |
| STORE02 | 2024-01-04 | 95.00 |
Recipe preview:
| store_id | transaction_date | sales_amount | sales_amount (3 rows avg) | sales_amount (3 rows max) |
|---|---|---|---|---|
| STORE01 | 2024-01-01 | 100.00 | 100.00 | 100.00 |
| STORE01 | 2024-01-02 | 150.00 | 125.00 | 150.00 |
| STORE01 | 2024-01-03 | 120.00 | 123.33 | 150.00 |
| STORE01 | 2024-01-04 | 200.00 | 156.67 | 200.00 |
| STORE02 | 2024-01-01 | 80.00 | 80.00 | 80.00 |
| STORE02 | 2024-01-02 | 90.00 | 85.00 | 90.00 |
| STORE02 | 2024-01-03 | 110.00 | 93.33 | 110.00 |
| STORE02 | 2024-01-04 | 95.00 | 98.33 | 110.00 |
Time series operation¶
The TimeSeriesOperation generates a dataset ready for time series modeling by creating forecast points, distances, and various time-aware features. By defining a task plan, multiple time-series transformations like lags and rolling statistics are executed and added as features to the recipe data.
>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import TimeSeriesOperation, TaskPlanElement, Lags
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Define task plan for feature engineering
>>> task_plan = [
... TaskPlanElement(
... column="sales_amount",
... task_list=[Lags(orders=[1])]
... )
... ]
>>>
>>> # Create time series operation
>>> time_series_op = TimeSeriesOperation(
... target_column="sales_amount",
... datetime_partition_column="sale_date",
... forecast_distances=[1], # Predict 1 period ahead
... task_plan=task_plan
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[time_series_op])
Primary input dataset:
| store_id | sale_date | sales_amount |
|---|---|---|
| STORE01 | 2024-01-01 | 1000 |
| STORE01 | 2024-01-02 | 1200 |
| STORE01 | 2024-01-03 | 1100 |
| STORE01 | 2024-01-04 | 1300 |
Recipe preview:
| store_id (actual) | sale_date (actual) | sales_amount (actual) | Forecast Point | Forecast Distance | sales_amount (1st lag) | sales_amount (naive 1 row seasonal value) |
|---|---|---|---|---|---|---|
| STORE01 | 2024-01-02 | 1200 | 2024-01-01 | 1 | 1000 | 1000 |
| STORE01 | 2024-01-03 | 1100 | 2024-01-02 | 1 | 1200 | 1200 |
| STORE01 | 2024-01-04 | 1300 | 2024-01-03 | 1 | 1100 | 1100 |
Compute new operation¶
The ComputeNewOperation creates a new feature using a SQL expression, allowing you to derive calculated fields from existing columns. This operation can be useful for creating custom business logic, mathematical transformations, and feature combinations.
>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import ComputeNewOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create compute new operation to compute total cost, factoring in a discount %
>>> compute_op = ComputeNewOperation(
... expression="ROUND(quantity * unit_price * (1 - discount), 2)",
... new_feature_name="total_cost"
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[compute_op])
Primary input dataset:
| order_id | quantity | unit_price | discount |
|---|---|---|---|
| ORD001 | 3 | 25.50 | 0.10 |
| ORD002 | 1 | 15.00 | 0.00 |
| ORD003 | 2 | 40.00 | 0.15 |
| ORD004 | 5 | 12.25 | 0.05 |
Recipe preview:
| order_id | quantity | unit_price | discount | total_cost |
|---|---|---|---|---|
| ORD001 | 3 | 25.50 | 0.10 | 68.85 |
| ORD002 | 1 | 15.00 | 0.00 | 15.00 |
| ORD003 | 2 | 40.00 | 0.15 | 68.00 |
| ORD004 | 5 | 12.25 | 0.05 | 58.19 |
Rename column operation¶
The RenameColumnsOperation renames one or more columns. This operation is often useful for standardizing column names, making them more descriptive, or ensuring consistent column naming for specific downstream processes.
>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import RenameColumnsOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Rename customer id, product name and quantity columns
>>> rename_op = RenameColumnsOperation(
... column_mappings={
... 'cust_id': 'customer_id',
... 'prod_name': 'product_name',
... 'qty': 'quantity'
... }
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[rename_op])
Primary input dataset:
| cust_id | prod_name | qty | price |
|---|---|---|---|
| C001 | Widget A | 3 | 25.99 |
| C002 | Gadget B | 1 | 15.50 |
| C001 | Tool C | 2 | 45.00 |
| C003 | Widget A | 5 | 25.99 |
Recipe preview:
| customer_id | product_name | quantity | price |
|---|---|---|---|
| C001 | Widget A | 3 | 25.99 |
| C002 | Gadget B | 1 | 15.50 |
| C001 | Tool C | 2 | 45.00 |
| C003 | Widget A | 5 | 25.99 |
Filter operation¶
The FilterOperation removes or keeps rows based on one or more filter conditions. Apply multiple conditions with AND/OR logic to create complex filtering rules.
>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import FilterOperation, FilterCondition
>>> from datarobot.enums import FilterOperationFunctions
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create filter conditions to keep customers over 18 with active status
>>> conditions = [
... FilterCondition(
... column="age",
... function=FilterOperationFunctions.GREATER_THAN_OR_EQUALS,
... function_arguments=[18]
... ),
... FilterCondition(
... column="status",
... function=FilterOperationFunctions.EQUALS,
... function_arguments=["active"]
... )
... ]
>>>
>>> # Create filter operation
>>> filter_op = FilterOperation(
... conditions=conditions,
... keep_rows=True, # Keep matching rows
... operator="and" # Both conditions must be true
... )
>>>
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[filter_op])
Primary input dataset:
| customer_id | age | status | purchase_amount |
|---|---|---|---|
| C001 | 25 | active | 150.00 |
| C002 | 17 | active | 75.00 |
| C003 | 30 | inactive | 200.00 |
| C004 | 22 | active | 95.00 |
Recipe preview:
| customer_id | age | status | purchase_amount |
|---|---|---|---|
| C001 | 25 | active | 150.00 |
| C004 | 22 | active | 95.00 |
Drop columns operation¶
The DropColumnsOperation removes one or more columns. This operation is useful for eliminating unnecessary fields, sensitive information, or columns that won't be used in downstream processes.
>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import DropColumnsOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create operation to drop 2 extra columns
>>> drop_op = DropColumnsOperation(
... columns=['internal_notes', 'legacy_id']
... )
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[drop_op])
Primary input dataset:
| customer_id | name | internal_notes | legacy_id | |
|---|---|---|---|---|
| C001 | John Doe | john@email.com | VIP customer | L001 |
| C002 | Jane Doe | jane@email.com | New customer | L002 |
| C003 | Bob Lee | bob@email.com | Frequent buyer | L003 |
Recipe preview:
| customer_id | name | |
|---|---|---|
| C001 | John Doe | john@email.com |
| C002 | Jane Doe | jane@email.com |
| C003 | Bob Lee | bob@email.com |
Dedupe rows operation¶
The DedupeRowsOperation removes duplicate rows, keeping only unique combinations of values. The operation references values across all columns. This operation helps clean data by eliminating redundant records.
>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import DedupeRowsOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Create dedupe rows operation
>>> dedupe_op = DedupeRowsOperation()
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[dedupe_op])
Primary input dataset:
| customer_id | product | quantity | price |
|---|---|---|---|
| C001 | Widget A | 2 | 25.99 |
| C002 | Gadget B | 1 | 15.50 |
| C001 | Widget A | 2 | 25.99 |
| C003 | Tool C | 3 | 45.00 |
| C002 | Gadget B | 1 | 15.50 |
Recipe preview:
| customer_id | product | quantity | price |
|---|---|---|---|
| C001 | Widget A | 2 | 25.99 |
| C002 | Gadget B | 1 | 15.50 |
| C003 | Tool C | 3 | 45.00 |
Find-and-replace operation¶
The FindAndReplaceOperation searches for specific strings or patterns in a column and replaces them with new values. The operation supports exact matches, partial matches, or regular expressions for flexible text manipulation.
>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import FindAndReplaceOperation
>>> from datarobot.enums import FindAndReplaceMatchMode
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Replace instances of 'In Progress' (case insensitive) with 'Active'
>>> replace_op = FindAndReplaceOperation(
... column="status",
... find="In Progress",
... replace_with="Active",
... match_mode=FindAndReplaceMatchMode.EXACT,
... is_case_sensitive=False
... )
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[replace_op])
Primary input dataset:
| order_id | status | customer_name |
|---|---|---|
| ORD001 | In Progress | John Smith |
| ORD002 | Completed | Jane Doe |
| ORD003 | in progress | Bob Johnson |
| ORD004 | Cancelled | Alice Brown |
Recipe preview:
| order_id | status | customer_name |
|---|---|---|
| ORD001 | Active | John Smith |
| ORD002 | Completed | Jane Doe |
| ORD003 | Active | Bob Johnson |
| ORD004 | Cancelled | Alice Brown |
Aggregation operation¶
The AggregationOperation groups data by the specified columns and calculates summary features like sum, average and count. This operation is useful for creating analytical summaries and computing derived features. A new column will be created for each aggregation function applied for each feature chosen for aggregation.
>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import AggregationOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Group by customer id and product category
>>> # Compute the sum of orders and customer's average order amount
>>> agg_op = AggregationOperation(
... group_by_columns=['customer_id', 'product_category'],
... aggregations=[
... AggregateFeature(
... feature="order_amount",
... functions=[AggregationFunctions.SUM, AggregationFunctions.AVERAGE]
... )
... ]
... )
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[agg_op])
Primary input dataset:
| customer_id | product_category | order_id | order_amount |
|---|---|---|---|
| C001 | Electronics | ORD001 | 150.00 |
| C001 | Electronics | ORD002 | 200.00 |
| C001 | Clothing | ORD003 | 75.00 |
| C002 | Electronics | ORD004 | 300.00 |
| C002 | Electronics | ORD005 | 125.00 |
Recipe preview:
| customer_id | product_category | order_amount_sum | order_amount_avg |
|---|---|---|---|
| C001 | Electronics | 350.00 | 175.00 |
| C001 | Clothing | 75.00 | 75.00 |
| C002 | Electronics | 425.00 | 212.50 |
Join operation¶
The JoinOperation allows for joining an additional data input to the recipe's current data. This operation can enable you to enrich your primary dataset with additional information from secondary datasets. The join operation only supports one or more equality predicates as the join condition.
Note
The additional data input is treated as the right side of the join.
>>> import datarobot as dr
>>> from datarobot.models.recipe import RecipeDatasetInput
>>> from datarobot.models.recipe_operation import JoinOperation
>>> from datarobot.enums import JoinType
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Get the secondary dataset and add it as an input
>>> dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f123')
>>> recipe.update(
... inputs=[
... recipe.inputs[0], # Keep the original primary input
... RecipeDatasetInput.from_dataset(
... dataset=dataset,
... alias='customers'
... )
... ]
... )
>>>
>>> # Join secondary dataset on customer id
>>> # Right dataset in join will always be the new dataset
>>> join_op = JoinOperation.join_dataset(
... dataset=dataset,
... join_type=JoinTypes.INNER,
... right_prefix='cust_',
... left_keys=['customer_id'],
... right_keys=['id']
... )
>>> # Apply the operation to the recipe
>>> recipe.update(operations=[join_op])
Primary input dataset (orders)
| order_id | customer_id | amount |
|---|---|---|
| ORD001 | C001 | 150.00 |
| ORD002 | C002 | 200.00 |
| ORD003 | C001 | 75.00 |
Secondary input dataset (customers)
| id | name | city |
|---|---|---|
| C001 | John Smith | New York |
| C002 | Jane Doe | Los Angeles |
| C003 | Bob Lee | Chicago |
Recipe preview:
| order_id | customer_id | amount | cust_id | cust_name | cust_city |
|---|---|---|---|---|---|
| ORD001 | C001 | 150.00 | C001 | John Smith | New York |
| ORD002 | C002 | 200.00 | C002 | Jane Doe | Los Angeles |
| ORD003 | C001 | 75.00 | C001 | John Smith | New York |
Set recipe SQL transformation directly¶
For advanced use cases, you can set the recipe's transformation using a SQL expression. This provides maximum flexibility for complex operations that may not be available through standard wrangling operations.
Important: Setting SQL directly changes the recipe type to SQL and bypasses any existing wrangling operations.
>>> import datarobot as dr
>>>
>>> # Get your recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Define your SQL transformation
>>> sql_query = "MY SQL EXPRESSION HERE"
>>> # Update the recipe with SQL
>>> recipe.update(sql=sql_query)
Preview recipe data¶
Before publishing your recipe, you can preview the transformed data to validate your transformations and ensure they produce the expected results with Recipe.get_preview.
>>> import datarobot as dr
>>>
>>> # Get your recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Generate a preview of the transformed data
>>> preview = recipe.get_preview()
>>> # View preview data as a DataFrame
>>> preview.df
Update recipe downsampling¶
Downsampling modifies the size of the dataset published by the recipe, which can improve performance for large datasets and speed up development and testing. This is particularly useful when working with millions of rows where a representative sample is sufficient when publishing to a dataset. Downsampling does not affect the number of rows in the recipe preview. Set a recipe's downsampling with a downsampling operation.
>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import RandomDownsamplingOperation
>>>
>>> # Get your recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Configure random downsampling to 50,000 rows
>>> downsampling = RandomDownsamplingOperation(max_rows=50_000)
>>> # Apply downsampling to the recipe
>>> recipe.update(downsampling=downsampling)
>>> # Disable downsampling
>>> recipe.update(downsampling=None)
Publish recipe to dataset¶
Once your recipe is complete, you can publish it with Recipe.publish_to_dataset to create a dataset with your transformed data.
>>> import datarobot as dr
>>>
>>> # Get your recipe
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>>
>>> # Publish recipe to create a new dataset
>>> dataset = recipe.publish_to_dataset(
... name="Customer Segmentation Data",
... do_snapshot=True
... )
>>>
>>> # Publish and attach to an existing use case
>>> use_case = dr.UseCase.get('5e1b4f8f2a3c4d5e6f7g8h9i')
>>> dataset_with_use_case = recipe.publish_to_dataset(
... name="Advanced Customer Analytics",
... use_cases=use_case
... )