Skip to content

Recipes

class datarobot.models.recipe.Recipe

Data wrangling entity containing information required to transform one or more datasets and generate SQL.

A recipe acts like a blueprint for creating a dataset by applying a series of operations (filters, aggregations, etc.) to one or more input datasets or datasources.

  • Variables:
    • id (str) – The unique identifier of the recipe.
    • name (str) – The name of the recipe. Not unique.
    • status (str) – The status of the recipe.
    • dialect (DataWranglingDialect) – The dialect of the recipe.
    • recipe_type (RecipeType) – The type of the recipe.
    • inputs (List[Union[JDBCTableDataSourceInput, RecipeDatasetInput]]) – The list of inputs for the recipe. Each input can be either a JDBCTableDataSourceInput or a RecipeDatasetInput. The first input is the primary input. All other secondary inputs must be joined or otherwise combined with the primary input to appear in the recipe data preview and published dataset.
    • operations (Optional[List[WranglingOperation]]) – The list of operations for the recipe.
    • downsampling (Optional[DownsamplingOperation]) – The downsampling operation applied to the recipe. Used when publishing the recipe to a dataset.
    • settings (Optional[RecipeSettings]) – The settings for the recipe.

Examples

Create a recipe from a dataset or data source,

>>> import datarobot as dr
>>> from datarobot.enums import DataWranglingDialect, RecipeType
>>> from datarobot.models.recipe_operation import RandomSamplingOperation
>>> my_use_case = dr.UseCase.list(search_params={"search": "My Use Case"})[0]
>>> dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f123')
>>> recipe = dr.Recipe.from_dataset(
...     use_case=my_use_case,
...     dataset=dataset,
...     dialect=DataWranglingDialect.SPARK,
...     recipe_type=RecipeType.WRANGLING,
...     sampling=RandomSamplingOperation(rows=500)
... )

or use an existing recipe.

>>> recipe = dr.Recipe.list(search="My Recipe")[0]
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')

Adjust the recipe’s name, description or other metadata fields.

>>> recipe.update(
...     name='My updated recipe name',
...     description='Updated description for my recipe'
... )

Then add additional datasets or data sources as inputs into the recipe.

>>> from datarobot.models.recipe import RecipeDatasetInput
>>> my_other_dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f456')
>>> recipe.update(
...     inputs=[
...         recipe.inputs[0],
...         RecipeDatasetInput.from_dataset(
...             dataset=my_other_dataset,
...             alias='dataset_B'
...         )
...     ]
... )

Apply wrangling operations to the recipe to join, filter, aggregate, or transform the recipe’s data,

>>> from datarobot.models.recipe_operation import JoinOperation, FilterOperation, FilterCondition
>>> from datarobot.enums import JoinType, FilterOperationFunctions
>>> join_op = JoinOperation.join_dataset(
...     dataset=my_other_dataset,
...     join_type=JoinType.INNER,
...     right_prefix='B_',
...     left_keys=['id'],
...     right_keys=['id']
... )
>>> filter_op = FilterOperation(
...     conditions=[
...         FilterCondition(
...             column='B_value',
...             function=FilterOperationFunctions.GREATER_THAN,
...             function_arguments=[100]
...         )
...     ]
... )
>>> recipe.update(operations=[join_op, filter_op])

or manually set the SQL for the recipe if you prefer to write your own SQL transformations:

>>> recipe.update(sql=(
...     "SELECT A.*, B.value AS B_value FROM dataset_A AS A "
...     "INNER JOIN dataset_B AS B ON A.id = B.id "
...     "WHERE B.value > 100"
... ))

Then review the data preview generated for the recipe.

>>> preview = recipe.get_preview()
>>> preview.df

Finally, publish the recipe to create a new dataset.

>>> published_dataset = recipe.publish_to_dataset(
...     name='My new Dataset built from recipe',
...     do_snapshot=True,
...     use_cases=my_use_case,
...     max_wait=600
... )

update(name=None, description=None, sql=None, recipe_type=None, inputs=None, operations=None, settings=None, **kwargs)

Update the recipe.

  • Parameters:
    • name (Optional[str]) – The new recipe name.
    • description (Optional[str]) – The new recipe description.
    • sql (Optional[str]) – The new wrangling sql. Only applicable for the SQL recipe_type.
    • recipe_type (Optional[RecipeType]) – The new type of the recipe. Only switching between SQL and WRANGLING is applicable.
    • inputs (Optional[List[JDBCTableDataSourceInput | RecipeDatasetInput]]) – The new list of recipe inputs. You can update sampling and/or aliases using this parameter. Only specify sampling on the primary input (the first input in the list).
    • operations (Optional[List[WranglingOperation]]) – The new list of operations. Only applicable for the WRANGLING recipe_type.
    • settings (Optional[RecipeSettings]) – The new recipe settings.
    • downsampling (Optional[DownsamplingOperation]) – The new downsampling or None if you wouldn’t like to apply any downsampling on publishing.
  • Return type: None

Examples

Update recipe metadata fields name and description:

>>> import datarobot as dr
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> recipe.update(
...     name='My updated recipe name',
...     description='Updated description for my recipe'
... )

Update recipe inputs to include 2 datasets to allow for joining data:

>>> import datarobot as dr
>>> from datarobot.models.recipe import RecipeDatasetInput
>>> from datarobot.models.recipe_operation import RandomSamplingOperation
>>> primary_dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f123')
>>> secondary_dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f456')
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> recipe.update(
...     inputs=[
...         RecipeDatasetInput.from_dataset(
...             dataset=primary_dataset,
...             sampling=RandomSamplingOperation(rows=500),
...             alias='dataset_A'
...         ),
...         RecipeDatasetInput.from_dataset(
...             dataset=secondary_dataset,
...             alias='dataset_B'
...         )
...     ]
... )

Update recipe operations to filter out users younger than 18 years old:

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import FilterOperation, FilterCondition
>>> from datarobot.enums import FilterOperationFunctions
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> filter_op = FilterOperation(
...     conditions=[
...         FilterCondition(
...             column='age',
...             function=FilterOperationFunctions.GREATER_THAN_OR_EQUAL,
...             function_arguments=[18]
...         )
...     ]
... )
>>> recipe.update(operations=[filter_op])

Update recipe settings to change the column used for feature weights:

>>> import datarobot as dr
>>> from datarobot.models.recipe import RecipeSettings
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> recipe.update(
...     settings=RecipeSettings(
...         weights_feature='observation_weights'
...     )
... )

Update downsampling to only keep 500 random rows when publishing:

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import RandomDownsamplingOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> recipe.update(
...     downsampling=RandomDownsamplingOperation(max_rows=500)
... )

Notes

Setting the sql metadata field on a non-SQL type recipe will convert the recipe to a SQL type recipe.

get_preview(max_wait=600, number_of_operations_to_use=None)

Retrieve preview of sample data. Compute preview if absent.

  • Parameters:
    • max_wait (int) – Maximum number of seconds to wait when retrieving the preview.
    • number_of_operations_to_use (Optional[int]) – Number of operations to use when computing the preview. If provided, the first N operations will be used. If not provided, all operations will be used.
  • Returns: preview – The preview of the application of the recipe.
  • Return type: RecipePreview

Examples

>>> import datarobot as dr
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> preview = recipe.get_preview()
>>> preview
RecipePreview(
    columns=['feature_1', 'feature_2', 'feature_3'],
    count=4,
    data=[['5', 'true', 'James'], ['-7', 'false', 'Bryan'], ['2', 'false', 'Jamie'], ['4', 'true', 'Lyra']],
    total_count=4,
    byte_size=46,
    result_schema=[
        {'data_type': 'INT_TYPE', 'name': 'feature_1'},
        {'data_type': 'BOOLEAN_TYPE', 'name': 'feature_2'},
        {'data_type': 'STRING_TYPE', 'name': 'feature_3'}
    ],
    stored_count=4,
    estimated_size_exceeds_limit=False,
)
>>> preview.df
  feature_1 feature_2 feature_3
0         5      true     James
1        -7     false     Bryan
2         2     false     Jamie
3         4      true      Lyra

publish_to_dataset(name=None, do_snapshot=None, persist_data_after_ingestion=None, categories=None, credential=None, credential_id=None, use_kerberos=None, materialization_destination=None, max_wait=600, use_cases=None)

A blocking call to publish the recipe to a new Dataset.

  • Parameters:
    • name (Optional[str]) – The name for the new dataset.
    • do_snapshot (Optional[bool]) – If true, create a snapshot dataset.
    • persist_data_after_ingestion (Optional[bool]) – If true, will enforce saving all data for download and sampling.
    • categories (Optional[List[str]]) – A list of strings describing the intended use of the dataset.
    • credential (Optional[Credential]) – The credential to use to authenticate with the database, if required.
    • credential_id (Optional[str]) – The ID of the set of credentials to use to authenticate with the database, if required.
    • use_kerberos (Optional[bool]) – If true, use kerberos authentication for database authentication.
    • materialization_destination (Optional[MaterializationDestination]) – Destination table information to create and materialize the recipe to. If None, the recipe will be materialized in DataRobot.
    • max_wait (int) – Number of seconds to wait for the dataset to be created.
    • use_cases (Union[List[UseCase], UseCase, List[str], str, None]) – One or more use cases to which the published dataset will be added. Can pass UseCase instances or use case IDs.
  • Returns: dataset – The newly created dataset.
  • Return type: Dataset

Examples

>>> import datarobot as dr
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> dataset = recipe.publish_to_dataset(
...     name='Published Dataset from Recipe',
...     do_snapshot=True,
...     max_wait=600
... )

classmethod update_downsampling(recipe_id, downsampling)

Set the downsampling operation for the recipe. Downsampling is applied during publishing. Consider using update() instead to update a Recipe instance.

  • Parameters:
    • recipe_id (str) – Recipe ID.
    • downsampling (Optional[DownsamplingOperation]) – Downsampling operation to be applied during publishing. If None, no downsampling will be applied.
  • Returns: recipe – Recipe with updated downsampling.
  • Return type: Recipe

Examples

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import RandomDownsamplingOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> recipe = dr.Recipe.update_downsampling(
...     recipe_id=recipe.id,
...     downsampling=RandomDownsamplingOperation(max_rows=1000)
... )

SEE ALSO

Recipe.update

retrieve_preview(max_wait=600, number_of_operations_to_use=None)

Retrieve preview of sample data. Compute preview if absent.

Deprecated since version 3.10: This method is deprecated and will be removed in 3.12. Use Recipe.get_preview instead.

  • Parameters:
    • max_wait (int) – Maximum number of seconds to wait when retrieving the preview.
    • number_of_operations_to_use (Optional[int]) – Number of operations to use when computing the preview. If provided, the first N operations will be used. If not provided, all operations will be used.
  • Returns: preview – Preview data computed.
  • Return type: Dict[str, Any]

SEE ALSO

Recipe.get_preview

retrieve_insights(max_wait=600, number_of_operations_to_use=None)

Retrieve insights for the recipe sample data. Requires a preview of sample data to be computed first with .get_preview(). Computing the preview starts the insights job in the background automatically if it not already running. Will block thread until insights are ready or max_wait is exceeded.

  • Parameters:
    • max_wait (int) – Maximum number of seconds to wait when retrieving the insights.
    • number_of_operations_to_use (Optional[int]) – Number of operations to use when computing insights. A preview must be computed first for the same number of operations. If provided, the first N operations will be used. If not provided, all operations will be used.
  • Returns: insights – The insights for the recipe sample data.
  • Return type: Dict[str, Any]

classmethod set_inputs(recipe_id, inputs)

Set the inputs for the recipe. Inputs can be a dataset or a JDBC data source table. Consider using update() instead to update a Recipe instance.

Examples

>>> import datarobot as dr
>>> from datarobot.models.recipe import RecipeDatasetInput
>>> from datarobot.models.recipe_operation import RandomSamplingOperation
>>> primary_dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f123')
>>> secondary_dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f456')
>>> recipe = dr.Recipe.set_inputs(
...     recipe_id='690bbf77aa31530d8287ae5f',
...     inputs=[
...         RecipeDatasetInput.from_dataset(
...             dataset=primary_dataset,
...             sampling=RandomSamplingOperation(rows=500),
...             alias='dataset_A'
...         ),
...         RecipeDatasetInput.from_dataset(
...             dataset=secondary_dataset,
...             alias='dataset_B'
...         )
...     ]
... )

SEE ALSO

Recipe.update

classmethod set_operations(recipe_id, operations)

Set the list of operations to use in the recipe. Operations are applied in order on the input(s). Consider using update() instead to update a Recipe instance.

  • Parameters:
    • recipe_id (str) – Recipe ID.
    • operations (List[WranglingOperation]) – List of operations to set in the recipe.
  • Returns: recipe – Recipe with updated list of operations.
  • Return type: Recipe

Examples

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import FilterOperation, FilterCondition
>>> from datarobot.enums import FilterOperationFunctions
>>> recipe = dr.Recipe.get("690bbf77aa31530d8287ae5f")
>>> new_operations = [
...    FilterOperation(
...        conditions=[
...            FilterCondition(
...                column="column_A",
...                function=FilterOperationFunctions.GREATER_THAN,
...                function_arguments=[100]
...            )
...        ]
...    )
... ]
>>> recipe = dr.Recipe.set_operations(recipe.id, operations=new_operations)

SEE ALSO

Recipe.update

classmethod set_recipe_metadata(recipe_id, metadata)

Update metadata for a recipe. Consider using update() instead to update a Recipe instance.

  • Parameters:
    • recipe_id (str) – Recipe ID.
    • metadata (Dict[str, str]) – Dictionary of metadata fields to update.

Examples

>>> import datarobot as dr
>>> recipe = dr.Recipe.get("690bbf77aa31530d8287ae5f")
>>> new_metadata = {
...     "name": "Updated Recipe Name",
...     "description": "This is an updated description for the recipe."
... }
>>> recipe = dr.Recipe.set_recipe_metadata(recipe.id, metadata=new_metadata)
  • Returns: recipe – New recipe with updated metadata.
  • Return type: Recipe

SEE ALSO

Recipe.update

Notes

Updating the sql metadata field on a non-SQL type recipe will convert the recipe to a SQL type recipe.

classmethod set_settings(recipe_id, settings)

Update the settings for a recipe. Consider using update() instead to update a Recipe instance.

  • Parameters:
    • recipe_id (str) – Recipe ID.
    • settings (RecipeSettings) – RecipeSettings containing the settings to be applied.
  • Returns: recipe – Recipe with updated settings.
  • Return type: Recipe

Examples

>>> import datarobot as dr
>>> from datarobot.models.recipe_settings import RecipeSettings
>>> recipe = dr.Recipe.get("690bbf77aa31530d8287ae5f")
>>> new_settings = RecipeSettings(
...     weights_feature="feature_weights"
... )
>>> recipe = dr.Recipe.set_settings(recipe.id, settings=new_settings)

SEE ALSO

Recipe.update

classmethod list(search=None, dialect=None, status=None, recipe_type=None, order_by=None, created_by_user_id=None, created_by_username=None)

List recipes. Apply filters to narrow down results.

  • Parameters:
    • search (Optional[str]) – Recipe name to filter by.
    • dialect (Optional[DataWranglingDialect]) – Recipe dialect to filter by.
    • status (Optional[str]) – Recipe status to filter by. E.g., draft, published.
    • recipe_type (Optional[RecipeType]) – Recipe type to filter by.
    • order_by (Optional[str]) – Field to order results by. For reverse ordering prefix with ‘-’, e.g. -recipe_id.
    • created_by_user_id (Optional[str]) – User ID to filter recipes by. Return recipes created by user(s) associated with a user ID.
    • created_by_username (Optional[str]) – User name to filter recipes by. Return recipes created by user(s) associated with username.
  • Returns: recipes – List of recipes matching the filter criteria.
  • Return type: List[Recipe]

Examples

>>> import datarobot as dr
>>> recipes = dr.Recipe.list()
>>> recipes
[Recipe(
    dialect='spark',
    id='690bbf77aa31530d8287ae5f',
    name='Sample Recipe',
    status='draft',
    recipe_type='SQL',
    inputs=[...],
    operations=[...],
    downsampling=...,
    settings=...,
), ...]

SEE ALSO

Recipe.get

classmethod get(recipe_id)

Retrieve a recipe by ID.

  • Parameters: recipe_id (str) – The ID of the recipe to retrieve.
  • Returns: recipe – The recipe with the specified ID.
  • Return type: Recipe

Examples

>>> import datarobot as dr
>>> recipe = dr.Recipe.get("690bbf77aa31530d8287ae5f")
>>> recipe
Recipe(
    dialect='spark',
    id='690bbf77aa31530d8287ae5f',
    name='Sample Recipe',
    status='draft',
    recipe_type='SQL',
    inputs=[...],
    operations=[...],
    downsampling=...,
    settings=...,
)

SEE ALSO

Recipe.list

get_sql(operations=None)

Generate SQL for the recipe, taking into account its operations and inputs. This does not modify the recipe.

  • Parameters: operations (Optional[List[WranglingOperation]]) – If provided, generate SQL for the given list of operations instead of the recipe’s operations, using the recipe’s inputs as the base. .. deprecated:: 3.10 operations is deprecated and will be removed in 3.12. Use generate_sql_for_operations class method instead.
  • Returns: sql – Generated SQL string.
  • Return type: str

Examples

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import FilterOperation, FilterCondition
>>> from datarobot.enums import FilterOperationFunctions
>>> recipe = dr.Recipe.get("690bbf77aa31530d8287ae5f")
>>> recipe.update(operations=[
...    FilterOperation(
...        conditions=[
...            FilterCondition(
...                column="column_A",
...                function=FilterOperationFunctions.GREATER_THAN,
...                function_arguments=[100]
...            )
...        ]
...    )
... ])
>>> recipe.get_sql()
"SELECT `sample_dataset`.`column_A` FROM `sample_dataset` WHERE `sample_dataset`.`column_A` > 100"

SEE ALSO

Recipe.generate_sql_for_operations

classmethod generate_sql_for_operations(recipe_id, operations)

Generate SQL for an arbitrary list of operations, using an existing recipe as a base. This does not modify the recipe. If you want to generate SQL for a recipe’s operations, use get_sql() instead.

  • Parameters:
    • recipe_id (str) – The ID of the recipe to use as a base. The SQL generation will use the recipe’s inputs and dialect.
    • operations (List[WranglingOperation]) – The list of operations to generate SQL for.
  • Returns: sql – Generated SQL string.
  • Return type: str

Examples

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import FilterOperation, FilterCondition
>>> from datarobot.enums import FilterOperationFunctions
>>> dr.Recipe.generate_sql_for_operations(
...    recipe_id="690bbf77aa31530d8287ae5f",
...    operations=[
...        FilterOperation(
...            conditions=[
...                FilterCondition(
...                    column="column_A",
...                    function=FilterOperationFunctions.LESS_THAN,
...                    function_arguments=[20]
...                )
...            ]
...        )
...    ]
... )
"SELECT `sample_dataset`.`column_A` FROM `sample_dataset` WHERE `sample_dataset`.`column_A` < 20"

classmethod from_data_store(use_case, data_store, data_source_type, dialect, data_source_inputs, recipe_type=RecipeType.WRANGLING)

Create a recipe using one or more data sources from a data store as input.

  • Parameters:
    • use_case (UseCase) – The use case where the recipe should be created.
    • data_store (DataStore) – The data store containing the data to use as input for the recipe.
    • data_source_type (DataWranglingDataSourceTypes) – The type of data source to use when connecting to the data store.
    • dialect (DataWranglingDialect) – The dialect of the recipe.
    • data_source_inputs (List[DataSourceInput]) – List of data source inputs for the recipe. Each input will be used to create a data source. If specifying multiple data source inputs, the first input is used as the primary data input. Secondary data sources must be joined or otherwise combined with the primary data source to appear in the data preview or published dataset.
    • recipe_type (RecipeType) – The type of the recipe. Only supports sql and wrangling recipe types.
  • Returns: recipe – The recipe created.
  • Return type: Recipe

Examples

Create a wrangling recipe with 2 Snowflake tables as inputs from a data store:

>>> import datarobot as dr
>>> from datarobot.enums import DataWranglingDialect, DataWranglingDataSourceTypes, RecipeType
>>> from datarobot.models.recipe import DataSourceInput
>>> from datarobot.models.recipe_operation import LimitSamplingOperation
>>> from datetime import datetime
>>> my_use_case = dr.UseCase.list(search_params={"search": "My Use Case"})[0]
>>> data_store = dr.DataStore.list(name="Snowflake Data Store")[0]
>>> now = datetime.now().strftime("%Y%m%d_%H%M%S")
>>> primary_input = DataSourceInput(
...     canonical_name=f"Data source for stock_trades {now}",
...     table="stock_trades",
...     schema="PUBLIC",
...     sampling=LimitSamplingOperation(rows=1000)
... )
>>> secondary_input = DataSourceInput(
...     canonical_name=f"Data source for hist_stock_prices {now}",
...     table="hist_stock_prices",
...     schema="PUBLIC"
... )
>>> recipe = dr.Recipe.from_data_store(
...     use_case=my_use_case,
...     data_store=data_store,
...     data_source_type=DataWranglingDataSourceTypes.JDBC,
...     dialect=DataWranglingDialect.SNOWFLAKE,
...     data_source_inputs=[primary_input, secondary_input],
...     recipe_type=RecipeType.WRANGLING
... )
>>> recipe.update(name="My Snowflake wrangling recipe for stock trades and historical prices")

classmethod from_dataset(use_case, dataset, dialect=None, inputs=None, recipe_type=RecipeType.WRANGLING, snapshot_policy=DataWranglingSnapshotPolicy.LATEST, sampling=None)

Create a recipe using a dataset as input.

  • Parameters:
    • use_case (UseCase) – The use case where the recipe should be created.
    • dataset (Dataset) – The dataset to use as input for the recipe.
    • dialect (Optional[DataWranglingDialect]) – The dialect of the recipe. Required for most recipe types.
    • inputs (Optional[List[DatasetInput]]) –

      The configuration for the dataset input of the recipe. Currently only supports sampling configuration. Must be a list of length 1.

      Deprecated since version 3.10: This parameter is deprecated and will be removed in 3.12. Use the sampling parameter instead. * recipe_type (RecipeType) – The type of the recipe. * snapshot_policy (Optional[DataWranglingSnapshotPolicy]) – The snapshot policy to use for the dataset input. * sampling (Optional[SamplingOperation]) – The sampling configuration to use for the input dataset. This determines which rows from the dataset are used when generating a data preview. If not provided the full dataset will be used. * Returns: recipe – The recipe created. * Return type: Recipe

Examples

Create a wrangling recipe to work with a snowflake dataset:

>>> import datarobot as dr
>>> from datarobot.enums import DataWranglingDialect, RecipeType
>>> from datarobot.models.recipe_operation import RandomSamplingOperation
>>> my_use_case = dr.UseCase.list(search_params={"search": "My Use Case"})[0]
>>> dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f123')
>>> recipe = dr.Recipe.from_dataset(
...     use_case=my_use_case,
...     dataset=dataset,
...     dialect=DataWranglingDialect.SNOWFLAKE,
...     recipe_type=RecipeType.WRANGLING,
...     sampling=RandomSamplingOperation(rows=200)
... )
>>> recipe.update(name='My Snowflake wrangling recipe for dataset X')

class datarobot.models.recipe.RecipeSettings

Recipe settings for optional parameters that can be set to support or modify other recipe features or interactions. E.g. Using some downsampling strategies require target and weights_feature to be set.

  • Parameters:
    • target (Optional[str]) – The feature to use as the target at the modeling stage.
    • weights_feature (Optional[str]) – The feature denoting weights.
    • prediction_point (Optional[str]) – The date column to be used as the prediction point for time-based feature engineering.
    • relationships_configuration_id (Optional[str]) – Deprecated:: 3.10. relationships_configuration_id is deprecated and has no effect. Will be removed in 3.12 or later.
    • feature_discovery_supervised_feature_reduction (Optional[bool]) – Whether to run supervised feature reduction for feature discovery.
    • spark_instance_size (Optional[SparkInstanceSizes]) – The Spark instance size to use, if applicable.

class datarobot.models.recipe.RecipeMetadata

Recipe metadata for metadata fields that can be set on a recipe, e.g. name, description, etc.

  • Variables:
    • name (Optional[str]) – The name of the recipe.
    • description (Optional[str]) – The description of the recipe.
    • recipe_type (Optional[RecipeType]) – The type of the recipe.
    • sql (Optional[str]) – The SQL query of the transformation that the recipe performs.

class datarobot.models.recipe.RecipePreview

A preview of data output from the application of a recipe.

  • Variables:
    • columns (List[str]) – List of column names in the preview.
    • count (int) – Number of rows in the preview.
    • data (List[List[Any]]) – The preview data as a list of rows, where each row is a list of values.
    • total_count (int) – Total number of rows in the dataset.
    • byte_size (int) – Data memory usage in bytes.
    • result_schema (List[Dict[Any]]) – JDBC result schema for the preview data.
    • stored_count (int) – Number of rows available for preview.
    • estimated_size_exceeds_limit (bool) – If downsampling should be done based on sample size.
    • next (Optional[RecipePreview]) – The next set of preview data, if available, otherwise None.
    • previous (Optional[RecipePreview]) – The previous set of preview data, if available, otherwise None.
    • df (pandas.DataFrame) – The preview data as a pandas DataFrame.

Recipe Inputs

Inputs are datasets or data sources fed into recipes to be joined, filtered, or otherwise transformed by recipe operations. Recipes have a single primary data input, which is used as the base of the recipe data, and can have multiple secondary data inputs.

class datarobot.models.recipe.RecipeDatasetInput

Dataset input configuration used to specify a dataset as an input to a recipe.

  • Parameters:
    • input_type (RecipeInputType) – The type of the recipe input. Must be RecipeInputType.DATASET.
    • dataset_id (str) – Id of the dataset to use as an input.
    • dataset_version_id (Optional[str]) – Id of the dataset version to use as an input, if the snapshot policy is not set to latest.
    • snapshot_policy (Optional[DataWranglingSnapshotPolicy]) – The snapshot policy to use when selecting the dataset version.
    • sampling (Union[SamplingOperation, Dict[str, Any], None]) – Sampling operation to apply to the dataset input. This determines how much data is used from the dataset when generating a data preview of the recipe. Only required if this dataset is the primary input.
    • alias (Optional[str]) – Alias for the dataset input. Useful when crafting SQL statements used in transformations.

Examples

>>> import datarobot as dr
>>> from datarobot.models.recipe import RecipeDatasetInput
>>> from datarobot.enums import RecipeInputType, DataWranglingSnapshotPolicy
>>> from datarobot.models.recipe_operation import LimitSamplingOperation
>>> input_dataset = RecipeDatasetInput(
...     input_type=RecipeInputType.DATASET,
...     dataset_id='5f43a1b2c9e77f0001e6f123',
...     snapshot_policy=DataWranglingSnapshotPolicy.LATEST,
...     sampling=LimitSamplingOperation(rows=250)
... )

Create a RecipeDatasetInput from a Dataset.

>>> import datarobot as dr
>>> from datarobot.models.recipe import RecipeDatasetInput
>>> from datarobot.models.recipe_operation import LimitSamplingOperation
>>> dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f123')
>>> input_dataset = RecipeDatasetInput.from_dataset(
...     dataset=dataset,
...     sampling=LimitSamplingOperation(rows=250),
...     alias='my_dataset'
... )

classmethod from_dataset(dataset, snapshot_policy=DataWranglingSnapshotPolicy.LATEST, sampling=None, alias=None)

Create RecipeDatasetInput configuration for a Dataset.

  • Parameters:
    • dataset (Dataset) – Dataset to create RecipeDatasetInput from.
    • snapshot_policy (Optional[DataWranglingSnapshotPolicy]) – Snapshot policy to use when selecting the dataset version.
    • sampling (Optional[SamplingOperation]) – Sampling operation to apply to the input dataset. This determines how much data is used from the dataset when generating a data preview of the recipe. Only required if this dataset is the primary input.
    • alias (Optional[str]) – Alias for the dataset input. Useful when crafting SQL statements used in transformations.
  • Returns: The recipe dataset input created.
  • Return type: RecipeDatasetInput

class datarobot.models.recipe.JDBCTableDataSourceInput

JDBC data source input configuration used to specify a table from a JDBC data source as an input to a recipe.

  • Parameters:
    • input_type (RecipeInputType) – The type of the input. Must be RecipeInputType.DATASOURCE.
    • data_source_id (str) – Id of the JDBC data source.
    • data_store_id (str) – Id of the data store the data source connects to.
    • dataset_id (Optional[str]) – The ID of the dataset created from the data source.
    • sampling (Union[SamplingOperation, Dict[str, Any], None]) – Sampling operation to apply to the input dataset. This determines how much data is used from the dataset when generating a data preview of the recipe. Only required if this dataset is the primary input.
    • alias (Optional[str]) – Alias for the JDBC data source input. Useful when crafting SQL statements used in transformations.

Examples

>>> import datarobot as dr
>>> from datarobot.models.recipe import JDBCTableDataSourceInput
>>> from datarobot.enums import RecipeInputType
>>> from datarobot.models.recipe_operation import LimitSamplingOperation
>>> data_store = dr.DataStore.list(name="Snowflake Connection")[0]
>>> data_source = DataSource.create(
...     data_source_type="jdbc",
...     canonical_name="My Snowflake connection",
...     params=dr.DataSourceParameters(
...         data_store_id=data_store.id,
...         schema="PUBLIC",
...         table="stock_prices",
...     )
... )
>>> dataset = data_source.create_dataset(do_snapshot=True)
>>> input = JDBCTableDataSourceInput(
...     input_type=RecipeInputType.DATASOURCE,
...     data_source_id=data_source.id,
...     data_store_id=data_store.id,
...     dataset_id=dataset.id,
...     sampling=LimitSamplingOperation(rows=250),
...     alias='my_table_alias'
... )
>>> recipe = dr.Recipe.get('690e0ee89676e54e365b32e5')
>>> recipe.update(inputs=[input])

class datarobot.models.recipe.DataSourceInput

Data source input configuration used to create a new recipe from a data store.

  • Parameters:
    • canonical_name (str) – The unique name of the data source.
    • table (str) – Table or view name in the data store.
    • schema (Optional[str]) – Schema associated with the table or view in the data store.
    • catalog (Optional[str]) – Catalog name in the data source, if supported.
    • sampling (SamplingOperation) – Sampling operation to apply to the data source input. This determines how much data is used from the data source for generating a data preview of the recipe. If specifying multiple data source inputs, only provide sampling for the first data source input.

Examples

Note: Canonical name must be unique to avoid collisions when creating a recipe from a data store. Append a unique identifier if necessary.

>>> from datarobot.models.recipe import DataSourceInput
>>> from datarobot.models.recipe_operation import RandomSamplingOperation
>>> from datetime import datetime
>>> now = datetime.now().strftime("%Y-%m-%d-%H_%M_%S")
>>> input_config = DataSourceInput(
...     canonical_name=f'Snowflake connection stock_prices_{now}',
...     table='stock_prices',
...     schema='PUBLIC',
...     sampling=RandomSamplingOperation(rows=500)
... )

class datarobot.models.recipe.DatasetInput

Wrapper for dataset input configuration passed when creating a new recipe from dataset.

deprecated:: 3.10 : This class is deprecated and may be removed in 3.12 or later. Use the sampling parameter of Recipe.from_dataset() instead.

  • Parameters: sampling (SamplingOperation) – Sampling operation to apply to the dataset input.

Examples

>>> import datarobot as dr
>>> from datarobot.models.recipe import DatasetInput
>>> from datarobot.models.recipe_operation import LimitSamplingOperation
>>> from datarobot.enums import DataWranglingDialect, RecipeType
>>> my_use_case = dr.UseCase.list(search_params={"search": "My Use Case"})[0]
>>> dataset = dr.Dataset.get('5f43a1b2c9e77f0001e6f123')
>>> input_config = DatasetInput(
...     sampling=LimitSamplingOperation(rows=250)
... )
>>> recipe = dr.Recipe.from_dataset(
...     use_case=my_use_case,
...     dataset=dataset,
...     dialect=DataWranglingDialect.SPARK,
...     recipe_type=RecipeType.SQL,
...     inputs=[input_config]
... )

Recipe Operations

class datarobot.models.recipe_operation.BaseOperation

Single base transformation unit in Data Wrangler recipe.

Sampling Operations

Sampling determines which rows from the recipe’s primary data input are used when generating a data preview. The primary data input is the first input in the list of recipe inputs. Only set sampling on the recipe’s primary data input.

class datarobot.models.recipe_operation.SamplingOperation

Base class for sampling operations.

class datarobot.models.recipe_operation.RandomSamplingOperation

A sampling technique that randomly selects the specified number of rows from the input when generating the sample data for a recipe.

  • Parameters:
    • rows (int) – The number of rows to sample.
    • seed (Optional[int]) – The random seed to use for sampling. Optional.

Examples

Using the default seed:

>>> from datarobot.models.recipe_operation import RandomSamplingOperation
>>> op = RandomSamplingOperation(rows=500)

Randomly generating a seed value:

>>> from datarobot.models.recipe_operation import RandomSamplingOperation
>>> import random
>>> random_op = RandomSamplingOperation(rows=500, seed=random.randint(1, 10000))

class datarobot.models.recipe_operation.LimitSamplingOperation

A sampling technique that samples the first N rows from the input when generating the sample data for a recipe.

  • Parameters: rows (int) – The number of rows to sample.

Examples

Using the limit sampling operation to sample the first 100 rows:

>>> from datarobot.models.recipe_operation import LimitSamplingOperation
>>> op = LimitSamplingOperation(rows=100)

class datarobot.models.recipe_operation.DatetimeSamplingOperation

A sampling technique that samples n rows by ordering rows based on a datetime partition column and selecting according to the strategy specified (e.g. latest, earliest). Supports multiseries data.

  • Parameters:
    • datetime_partition_column (str) – The datetime partition column to order by.
    • rows (int) – The number of rows to sample.
    • strategy (Union[str, DatetimeSamplingStrategy, None]) – The datetime sampling strategy to use. Optional.
    • multiseries_id_column (Optional[str]) – Column name used to identify each time series within the input data. Required only for multiseries data.
    • selected_series (Optional[List[str]]) – The list of series identifiers to include when sampling multiseries data. Requires multiseries_id_column to be set.

Examples

Create a sampling operation to sample the latest 200 stock trades for tickers ‘AAPL’ and ‘MSFT’:

>>> from datarobot.models.recipe_operation import DatetimeSamplingOperation
>>> from datarobot.enums import DatetimeSamplingStrategy
>>> op = DatetimeSamplingOperation(
...     datetime_partition_column='trade_date',
...     rows=200,
...     strategy=DatetimeSamplingStrategy.LATEST,
...     multiseries_id_column='ticker',
...     selected_series=['AAPL', 'MSFT']
... )

class datarobot.models.recipe_operation.TableSampleSamplingOperation

A sampling technique that uses a table sample method to randomly select a percentage of rows from the input when generating the sample data for a recipe. Not supported for all data inputs. For data stores that support table sampling this method is generally more efficient than random sampling.

  • Parameters:
    • percent (int) – The percentage (%) of rows to sample (0-100).
    • seed (Optional[int]) – The random seed to use for sampling. Optional.

Examples

Sample using 50% of the input datasource using the default seed:

>>> from datarobot.models.recipe_operation import TableSampleSamplingOperation
>>> op = TableSampleSamplingOperation(percent=50)

Downsampling Operations

Downsampling reduces the size of the dataset published for faster experimentation.

class datarobot.models.recipe_operation.DownsamplingOperation

Base class for downsampling operations.

class datarobot.models.recipe_operation.RandomDownsamplingOperation

A downsampling technique that reduces the size of the majority class using random sampling (i.e., each sample has an equal probability of being chosen).

  • Parameters:
    • max_rows (int) – The maximum number of rows to downsample to.
    • seed (int) – The random seed to use for downsampling. Optional.

Examples

Using the default seed:

>>> from datarobot.models.recipe_operation import RandomDownsamplingOperation
>>> op = RandomDownsamplingOperation(max_rows=600)

Randomly generating a seed value:

>>> from datarobot.models.recipe_operation import RandomDownsamplingOperation
>>> import random
>>> random_op = RandomDownsamplingOperation(max_rows=600, seed=random.randint(1, 10000))

class datarobot.models.recipe_operation.SmartDownsamplingOperation

A downsampling technique that relies on the distribution of target values to adjust size and specifies how much a specific class was sampled in a new column.

For this technique to work, ensure the recipe has set target and weightsFeature in the recipe’s settings.

  • Parameters:
    • max_rows (int) – The maximum number of rows to downsample to.
    • method (SmartDownsamplingMethod) – The downsampling method to use.
    • seed (int) – The random seed to use for downsampling. Optional.

Examples

>>> from datarobot.models.recipe_operation import SmartDownsamplingOperation, SmartDownsamplingMethod
>>> op = SmartDownsamplingOperation(max_rows=1000, method=SmartDownsamplingMethod.BINARY)

Wrangling Operations

class datarobot.models.recipe_operation.WranglingOperation

Base class for data wrangling operations.

class datarobot.models.recipe_operation.LagsOperation

Data wrangling operation to create one or more lags for a feature based off of a datetime ordering feature. This operation will create a new column for each lag order specified.

  • Parameters:
    • column (str) – Column name to create lags for.
    • orders (List[int]) – List of lag orders to create.
    • datetime_partition_column (str) – Column name used to partition the data by datetime. Used to order the data for lag creation.
    • multiseries_id_column (Optional[str]) – Column name used to identify time series within the data. Required only for multiseries.

Examples

Create lags of orders 1, 5 and 30 in stock price data on opening price column “open_price”, ordered by datetime column “date”. The data contains multiple time series identified by “ticker_symbol”:

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import LagsOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> lags_op = LagsOperation(
...     column="open_price",
...     orders=[1, 5, 30],
...     datetime_partition_column="date",
...     multiseries_id_column="ticker_symbol",
... )
>>> recipe.update(operations=[lags_op])

class datarobot.models.recipe_operation.WindowCategoricalStatsOperation

Data wrangling operation to calculate categorical statistics for a rolling window. This operation will create a new column for each method specified.

  • Parameters:
    • column (str) – Column name to create rolling statistics for.
    • window_size (int) – Number of rows to include in the rolling window.
    • methods (List[CategoricalStatsMethods]) – List of methods to apply for rolling statistics. Currently only supports datarobot.enums.CategoricalStatsMethods.MOST_FREQUENT.
    • datetime_partition_column (str) – Column name used to partition the data by datetime. Used to order the timeseries data.
    • multiseries_id_column (Optional[str]) – Column name used to identify each time series within the data. Required only for multiseries.
    • rolling_most_frequent_udf (Optional[str]) – Fully qualified path to rolling most frequent user defined function. Used to optimize sql execution with snowflake.

Examples

Create rolling categorical statistics to track the most frequent product category purchased by customers based on their last 50 purchases:

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import WindowCategoricalStatsOperation
>>> from datarobot.enums import CategoricalStatsMethods
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> window_cat_stats_op = WindowCategoricalStatsOperation(
...     column="product_category",
...     window_size=50,
...     methods=[CategoricalStatsMethods.MOST_FREQUENT],
...     datetime_partition_column="purchase_date",
...     multiseries_id_column="customer_id",
... )
>>> recipe.update(operations=[window_cat_stats_op])

class datarobot.models.recipe_operation.WindowNumericStatsOperation

Data wrangling operation to calculate numeric statistics for a rolling window. This operation will create one or more new columns.

  • Parameters:
    • column (str) – Column name to create rolling statistics for.
    • window_size (int) – Number of rows to include in the rolling window.
    • methods (List[NumericStatsMethods]) – List of methods to apply for rolling statistics. A new column will be created for each method.
    • datetime_partition_column (str) – Column name used to partition the data by datetime. Used to order the timeseries data.
    • multiseries_id_column (Optional[str]) – Column name used to identify each time series within the data. Required only for multiseries.
    • rolling_median_udf (Optional[str]) – Fully qualified path to a rolling median user-defined function. Used to optimize SQL execution with Snowflake.

Examples

Create rolling numeric statistics to track the maximum, minimum, and median stock prices over the last 7 trading sessions:

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import WindowNumericStatsOperation
>>> from datarobot.enums import NumericStatsMethods
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> window_num_stats_op = WindowNumericStatsOperation(
...     column="stock_price",
...     window_size=7,
...     methods=[
...         NumericStatsMethods.MAX,
...         NumericStatsMethods.MIN,
...         NumericStatsMethods.MEDIAN,
...     ],
...     datetime_partition_column="trading_date",
...     multiseries_id_column="ticker_symbol",
... )
>>> recipe.update(operations=[window_num_stats_op])

class datarobot.models.recipe_operation.TimeSeriesOperation

Data wrangling operation to generate a dataset ready for time series modeling: with forecast point, forecast distances, known in advance columns, etc.

  • Parameters:
    • target_column (str) – Target column to use for generating naive baseline features during feature reduction.
    • datetime_partition_column (str) – Column name used to partition the data by datetime. Used to order the time series data.
    • forecast_distances (List[int]) – List of forecast distances to generate features for. Each distance represents a relative position that determines how many rows ahead to predict.
    • task_plan (List[TaskPlanElement]) – List of task plans for each column.
    • baseline_periods (Optional[List[int]]) – List of integers representing the periodicities used to generate naive baseline features from the target. Baseline period = 1 corresponds to the naive latest baseline.
    • known_in_advance_columns (Optional[List[str]]) – List of columns that are known in advance at prediction time, i.e. features that do not need to be lagged.
    • multiseries_id_column (Optional[str]) – Column name used to identify each time series within the data. Required only for multiseries.
    • rolling_median_udf (Optional[str]) – Fully qualified path to rolling median user defined function. Used to optimize SQL execution with Snowflake.
    • rolling_most_frequent_udf (Optional[str]) – Fully qualified path to rolling most frequent user defined function.
    • forecast_point (Optional[datetime]) – To use at prediction time.

Examples

Create a time series operation for sales forecasting with forecast distances of 7 and 30 days, using the sale amount as the target column, the date of the sale for datetime ordering, and “store_id” as the multiseries identifier. The operation includes a task plan to compute lags of orders 1, 7, and 30 on the sales amount, and specifies known in advance columns “promotion” and “holiday_flag”:

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import TimeSeriesOperation, TaskPlanElement, Lags
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> task_plan = [
...     TaskPlanElement(
...         column="sales_amount",
...         task_list=[Lags(orders=[1, 7, 30])]
...     )
... ]
>>> time_series_op = TimeSeriesOperation(
...     target_column="sales_amount",
...     datetime_partition_column="sale_date",
...     forecast_distances=[7, 30],
...     task_plan=[task_plan],
...     known_in_advance_columns=["promotion", "holiday_flag"],
...     multiseries_id_column="store_id"
... )
>>> recipe.update(operations=[time_series_op])

class datarobot.models.recipe_operation.ComputeNewOperation

Data wrangling operation to create a new feature computed using a SQL expression.

  • Parameters:
    • expression (str) – SQL expression to compute the new feature.
    • new_feature_name (str) – Name of the new feature.

Examples

Create a new feature “total_sales” by summing the total of “online_sales” and “in_store_sales”, rounded to the nearest dollar:

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import ComputeNewOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> compute_new_op = ComputeNewOperation(
...     expression="ROUND(online_sales + in_store_sales, 0)",
...     new_feature_name="total_sales"
... )
>>> recipe.update(operations=[compute_new_op])

class datarobot.models.recipe_operation.RenameColumnsOperation

Data wrangling operation to rename one or more columns.

  • Parameters: column_mappings (Dict[str, str]) – Mapping of original column names to new column names.

Examples

Rename columns “old_name1” to “new_name1” and “old_name2” to “new_name2”:

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import RenameColumnsOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> rename_op = RenameColumnsOperation(
...     column_mappings={'old_name1': 'new_name1', 'old_name2': 'new_name2'}
... )
>>> recipe.update(operations=[rename_op])

class datarobot.models.recipe_operation.FilterOperation

Data wrangling operation to filter rows based on one or more conditions.

  • Parameters:
    • conditions (List[FilterCondition]) – List of conditions to filter on.
    • keep_rows (Optional[bool]) – If matching rows should be kept or dropped.
    • operator (Optional[str]) – Operator to use between conditions when using multiple conditions. Allowed values: [and, or].

Examples

Filter input to only keep users older than 18:

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import FilterOperation, FilterCondition
>>> from datarobot.enums import FilterOperationFunctions
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> condition = FilterCondition(
...     column="age",
...     function=FilterOperationFunctions.GREATER_THAN,
...     function_arguments=[18]
... )
>>> filter_op = FilterOperation(conditions=[condition], keep_rows=True)
>>> recipe.update(operations=[filter_op])

Filter input to filter out rows where “status” is either “inactive” or “banned”:

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import FilterOperation, FilterCondition
>>> from datarobot.enums import FilterOperationFunctions
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> inactive_cond = FilterCondition(
...     column="status",
...     function=FilterOperationFunctions.EQUALS,
...     function_arguments=["inactive"]
... )
>>> banned_cond = FilterCondition(
...     column="status",
...     function=FilterOperationFunctions.EQUALS,
...     function_arguments=["banned"]
... )
>>> filter_op = FilterOperation(
...     conditions=[inactive_cond, banned_cond],
...     keep_rows=False,
...     operator="or"
... )
>>> recipe.update(operations=[filter_op])

class datarobot.models.recipe_operation.DropColumnsOperation

Data wrangling operation to drop one or more columns.

  • Parameters: columns (List[str]) – Columns to drop.

Examples

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import DropColumnsOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> drop_op = DropColumnsOperation(columns=['col1', 'col2'])
>>> recipe.update(operations=[drop_op])

class datarobot.models.recipe_operation.DedupeRowsOperation

Data wrangling operation to remove duplicate rows. Uses values from all columns.

Examples

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import DedupeRowsOperation
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> dedupe_op = DedupeRowsOperation()
>>> recipe.update(operations=[dedupe_op])

class datarobot.models.recipe_operation.FindAndReplaceOperation

Data wrangling operation to find and replace strings in a column.

  • Parameters:
    • column (str) – Column name to perform find and replace on.
    • find (str) – String or expression to find.
    • replace_with (str) – String to replace with.
    • match_mode (FindAndReplaceMatchMode) – Match mode to use when finding strings.
    • is_case_sensitive (bool) – Whether the find operation should be case sensitive.

Examples

Set Recipe operations to search for exact match of “old_value” in column “col1” and replace with “new_value”:

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import FindAndReplaceOperation
>>> from datarobot.enums importFindAndReplaceMatchMode
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> find_replace_op = FindAndReplaceOperation(
...     column="col1",
...     find="old_value",
...     replace_with="new_value",
...     match_mode=FindAndReplaceMatchMode.EXACT,
...     is_case_sensitive=True
... )
>>> recipe.update(operations=[find_replace_op])

Set Recipe operations to use regular expression to replace names starting with “Brand” in column “name” and replace with “Lyra”:

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import FindAndReplaceOperation
>>> from datarobot.enums import FindAndReplaceMatchMode
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> find_replace_op = FindAndReplaceOperation(
...     column="name",
...     find="^Brand.*",
...     replace_with="Lyra",
...     match_mode=FindAndReplaceMatchMode.REGEX
... )
>>> recipe.update(operations=[find_replace_op])

class datarobot.models.recipe_operation.AggregationOperation

Data wrangling operation to compute aggregate metrics for one or more features by grouping data by one or more columns. This operation will retain all group by columns in the output dataset and create a new column for each aggregation function applied to each feature chosen for aggregation.

  • Parameters:
    • aggregations (List[AggregateFeature]) – List of features to aggregate with the aggregation functions to apply on each feature. Any features in the list of aggregations should not appear in the group_by_columns list.
    • group_by_columns (List[str]) – List of columns to group by. Any column name in this list should not appear in the list of features to aggregate.

Examples

Create an aggregation operation to compute the total and average sales amounts, and total sales quantity per region. This will create 3 new columns sales_amount_sum, sales_amount_avg, and sales_quantity_sum in the output dataset:

>>> import datarobot as dr
>>> from datarobot.models.recipe_operation import AggregationOperation, AggregateFeature
>>> from datarobot.enums import AggregationFunctions
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> agg_sales = AggregateFeature(
...     feature="sales_amount",
...     functions=[AggregationFunctions.SUM, AggregationFunctions.AVG]
... )
>>> agg_quantity = AggregateFeature(
...     feature="sales_quantity",
...     functions=[AggregationFunctions.SUM]
... )
>>> aggregation_op = AggregationOperation(
...     aggregations=[agg_sales, agg_quantity],
...     group_by_columns=["region"]
... )
>>> recipe.update(operations=[aggregation_op])

class datarobot.models.recipe_operation.JoinOperation

Data wrangling operation to join an additional data input to the current data. The additional data input is treated as the right side of the join. The additional data input must be added to the recipe inputs when updating the recipe with this operation.

The join condition only supports equality predicates. Multiple fields are combined with AND operators (e.g., JOIN A, B ON A.x = B.y AND A.z = B.z AND A.t = B.t).

Examples

Join customer details with an additional dataset of credit card information using customer id:

>>> import datarobot as dr
>>> from datarobot.models.recipe import RecipeDatasetInput
>>> from datarobot.models.recipe_operation import JoinOperation
>>> from datarobot.enums import JoinTypes
>>> cc_dataset = dr.Dataset.get('5f43a1e2e4b0c123456789ab')
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> inputs = [recipe.inputs[0], RecipeDatasetInput.from_dataset(cc_dataset)]
>>> join_op = JoinOperation.join_dataset(
...     dataset=cc_dataset,
...     join_type=JoinTypes.INNER,
...     right_prefix='cc_',
...     left_keys=['customer_id'],
...     right_keys=['customer_id']
... )
>>> recipe.update(operations=[join_op], inputs=inputs)

Join sales data with a reference table of sales targets that applies to all stores (Cartesian join to broadcast targets to every sales record):

>>> import datarobot as dr
>>> from datarobot.models.recipe import JDBCTableDataSourceInput
>>> from datarobot.models.recipe_operation import JoinOperation
>>> from datarobot.enums import JoinTypes
>>> recipe = dr.Recipe.get('690bbf77aa31530d8287ae5f')
>>> data_source_id = "647873c5a721e5647c15bbdc"
>>> reference_table_input = JDBCTableDataSourceInput(
...     input_type=RecipeInputType.DATASOURCE,
...     data_source_id=data_source_id,
...     data_store_id="6418452b8a79f972e8ffe208",
...     alias="targets_table"
... )
>>> inputs = [recipe.inputs[0], reference_table_input]
>>> join_op = JoinOperation.join_jdbc_data_source_table(
...     data_source_id=data_source_id,
...     join_type=JoinTypes.CARTESIAN,
...     right_prefix='ref_'
... )
>>> recipe.update(operations=[join_op], inputs=inputs)

classmethod join_dataset(dataset, join_type, right_prefix=None, left_keys=None, right_keys=None)

Create a JoinOperation to join a dataset to the data in the recipe.

  • Parameters:
    • dataset (Dataset) – Dataset to join with. This dataset must already be added to the recipe inputs.
    • join_type (JoinType) – Type of join to perform.
    • right_prefix (Optional[str]) – Optional prefix to add to all column names from the joined dataset in the join result.
    • left_keys (Optional[List[str]]) – List of column names to be used in the “ON” clause for the left side of the join. Required for inner and left joins, not used for Cartesian joins.
    • right_keys (Optional[List[str]]) – List of column names to be used in the “ON” clause for the right side of the join. Required for inner and left joins, not used for Cartesian joins.
  • Return type: JoinOperation

classmethod join_jdbc_data_source_table(data_source_id, join_type, right_prefix=None, left_keys=None, right_keys=None)

Create a JoinOperation to join a JDBC table input from a data source to the data in the recipe.

  • Parameters:
    • data_source_id (str) – Data source ID for the JDBC table to join with. This data source must already be added to the recipe inputs.
    • join_type (JoinType) – Type of join to perform.
    • right_prefix (Optional[str]) – Optional prefix to add to all column names from the joined table in the join result.
    • left_keys (Optional[List[str]]) – List of column names to be used in the “ON” clause for the left side of the join. Required for inner and left joins, not used for Cartesian joins.
    • right_keys (Optional[List[str]]) – List of column names to be used in the “ON” clause for the right side of the join. Required for inner and left joins, not used for Cartesian joins.
  • Return type: JoinOperation

Enums and Helpers

class datarobot.models.recipe_operation.TaskPlanElement

Represents a task plan element for a specific column in a time series operation.

  • Parameters:
    • column (str) – Column name for which the task plan is defined.
    • task_list (List[BaseTimeAwareTask]) – List of time-aware tasks to be applied to the column.

class datarobot.models.recipe_operation.BaseTimeAwareTask

Base class for time-aware tasks in time series operation task plan.

class datarobot.models.recipe_operation.CategoricalStats

Time-aware task to compute categorical statistics for a rolling window.

  • Parameters:
    • methods (List[CategoricalStatsMethods]) – List of categorical statistical methods to apply for rolling statistics.
    • window_size (int) – Number of rows to include in the rolling window.

class datarobot.models.recipe_operation.NumericStats

Time-aware task to compute numeric statistics for a rolling window.

  • Parameters:
    • methods (List[NumericStatsMethods]) – List of numeric statistical methods to apply for rolling statistics.
    • window_size (int) – Number of rows to include in the rolling window.

class datarobot.models.recipe_operation.Lags

Time-aware task to create one or more lags for a feature.

  • Parameters: orders (List[int]) – List of lag orders to create.

class datarobot.enums.CategoricalStatsMethods

Supported categorical stats methods for data wrangling.

class datarobot.enums.NumericStatsMethods

Supported numeric stats methods for data wrangling.

class datarobot.models.recipe_operation.FilterCondition

Condition to filter rows in a FilterOperation.

  • Parameters:
    • column (str) – Column name to apply the condition on.
    • function (FilterOperationFunctions) – The filtering function to use.
    • function_arguments (List[Union[str, int, float]]) – The list of arguments for the filtering function.

Examples

FilterCondition to filter rows where “age” is between 18 and 65:

>>> from datarobot.models.recipe_operation import FilterCondition
>>> from datarobot.enums import FilterOperationFunctions
>>> condition = FilterCondition(
...     column="age",
...     function=FilterOperationFunctions.BETWEEN,
...     function_arguments=[18, 65]
... )

class datarobot.enums.FilterOperationFunctions

Operations supported in a FilterCondition.

class datarobot.models.recipe_operation.AggregateFeature

Feature to aggregate and the aggregation functions to apply in a AggregationOperation.

  • Parameters:
    • feature (str) – Feature to aggregate.
    • functions (List[AggregationFunctions]) – List of aggregation functions to apply. A new column will be created for each function. Some feature types may not support all aggregation functions, e.g. categorical features do not support numeric aggregation functions like SUM or AVG.

Examples

AggregateFeature to compute the sum and average of sales:

>>> from datarobot.models.recipe_operation import AggregateFeature
>>> from datarobot.enums import AggregationFunctions
>>> aggregate_feature = AggregateFeature(
...     feature="sales_amount",
...     functions=[AggregationFunctions.SUM, AggregationFunctions.AVG]
... )

class datarobot.enums.AggregationFunctions

Supported aggregation functions for data wrangling.

class datarobot.enums.FindAndReplaceMatchMode

Find and replace modes used when searching for strings to replace.

class datarobot.enums.DatetimeSamplingStrategy

Supported datetime sampling strategies.

class datarobot.enums.SmartDownsamplingMethod

Smart downsampling methods.

class datarobot.enums.DataWranglingSnapshotPolicy

Data wrangling snapshot policy options.

class datarobot.enums.RecipeType

Data wrangling supported recipe types.

class datarobot.enums.DataWranglingDialect

Data wrangling supported dialects.

class datarobot.enums.DataWranglingDataSourceTypes

Data wrangling supported data source types.

class datarobot.enums.RecipeInputType

Data wrangling supported recipe input types.