A project built from a particular training dataset
Variables:
id (str) – the id of the project
project_name (str) – the name of the project
project_description (str) – an optional description for the project
mode (int) – The current autopilot mode. 0: Full Autopilot. 2: Manual Mode.
4: Comprehensive Autopilot. null: Mode not set.
target (str) – the name of the selected target features
target_type (str) – Indicating what kind of modeling is being done in this project Options are: ‘Regression’,
‘Binary’ (Binary classification), ‘Multiclass’ (Multiclass classification),
‘Multilabel’ (Multilabel classification)
holdout_unlocked (bool) – whether the holdout has been unlocked
metric (str) – the selected project metric (e.g. LogLoss)
stage (str) – the stage the project has reached - one of datarobot.enums.PROJECT_STAGE
partition (dict) – information about the selected partitioning options
positive_class (str) – for binary classification projects, the selected positive class; otherwise, None
created (datetime.datetime) – the time the project was created
advanced_options (AdvancedOptions) – information on the advanced options that were selected for the project settings,
e.g. a weights column or a cap of the runtime of models that can advance autopilot stages
max_train_pct (float) – The maximum percentage of the project dataset that can be used without going into the
validation data or being too large to submit any blueprint for training
max_train_rows (int) – the maximum number of rows that can be trained on without going into the validation data
or being too large to submit any blueprint for training
file_name (str) – The name of the file uploaded for the project dataset
credentials (Optional[List]) – A list of credentials for the datasets used in relationship configuration
(previously graphs). For Feature Discovery projects, the list must be formatted
in dictionary record format. Provide the catalogVersionId and credentialId
for each dataset that is to be used in the project that requires authentication.
feature_engineering_prediction_point (Optional[str]) – For time-aware Feature Engineering, this parameter specifies the column from the
primary dataset to use as the prediction point.
unsupervised_mode (Optional[bool]) – (New in version v2.20) defaults to False, indicates whether this is an unsupervised project.
relationships_configuration_id (Optional[str]) – (New in version v2.21) id of the relationships configuration to use
query_generator_id (Optional[str]) – (New in version v2.27) id of the query generator applied for time series data prep
segmentation (dict, optional) – information on the segmentation options for segmented project
partitioning_method (PartitioningMethod, optional) – (New in version v3.0) The partitioning class for this project. This attribute should only be used
with newly-created projects and before calling Project.analyze_and_model(). After the project has been
aimed, see Project.partition for actual partitioning options.
catalog_id (str) – (New in version v3.0) ID of the dataset used during creation of the project.
catalog_version_id (str) – (New in version v3.0) The object ID of the catalog_version which the project’s dataset belongs to.
use_gpu (bool) – (New in version v3.2) Whether project allows usage of GPUs
Either accepts an AdvancedOptions object or individual keyword arguments.
This is an inplace update.
Raises:ValueError – Raised if an object passed to the options parameter is not an AdvancedOptions instance,
a valid keyword argument from the AdvancedOptions class, or a combination of an AdvancedOptions
instance AND keyword arguments.
Project creation is asynchronous process, which means that after
initial request we will keep polling status of async process
that is responsible for project creation until it’s finished.
For SDK users this only means that this method might raise
exceptions related to it’s async nature.
Parameters:
sourcedata (basestring, file, pathlib.Path or pandas.DataFrame) – Dataset to use for the project.
If string can be either a path to a local file, url to publicly
available file or raw file content. If using a file, the filename
must consist of ASCII characters only.
project_name (str, unicode, optional) – The name to assign to the empty project.
max_wait (Optional[int]) – Time in seconds after which project creation is considered
unsuccessful
read_timeout (int) – The maximum number of seconds to wait for the server to respond indicating that the
initial upload is complete
dataset_filename (string or None, optional) – (New in version v2.14) File name to use for dataset.
Ignored for url and file path sources.
use_case (UseCase | string, optional) – A single UseCase object or ID to add this new Project to. Must be a kwarg.
AsyncFailureError – Polling for status of async process resulted in response
with unsupported status code. Beginning in version 2.1, this
will be ProjectAsyncFailureError, a subclass of AsyncFailureError
Create a project from a datasource on a WebHDFS server.
Parameters:
url (str) – The location of the WebHDFS file, both server and full path. Per the DataRobot
specification, must begin with hdfs://, e.g. hdfs:///tmp/10kDiabetes.csv
port (Optional[int]) – The port to use. If not specified, will default to the server default (50070)
project_name (Optional[str]) – A name to give to the project
max_wait (int) – The maximum number of seconds to wait before giving up.
p=Project.create_from_hdfs('hdfs:///tmp/somedataset.csv',project_name="New API project")p.id>>>'5921731dkqshda8yd28h'p.project_name>>>'New API project'
Create a project from a data source. Either data_source or data_source_id
should be specified.
Parameters:
data_source_id (str) – the identifier of the data source.
username (Optional[str]) – The username for database authentication. If supplied password must also be supplied.
password (Optional[str]) – The password for database authentication. The password is encrypted
at server side and never saved / stored. If supplied username must also be supplied.
credential_id (Optional[str]) – The ID of the set of credentials to
use instead of user and password. Note that with this change, username and password
will become optional.
use_kerberos (Optional[bool]) – Server default is False.
If true, use kerberos authentication for database authentication.
credential_data (dict, optional) – The credentials to authenticate with the database, to use instead of user/password or
credential ID.
project_name (Optional[str]) – optional, a name to give to the project.
max_wait (int) – optional, the maximum number of seconds to wait before giving up.
use_case (UseCase | string, optional) – A single UseCase object or ID to add this new Project to. Must be a kwarg.
Raises:InvalidUsageError – Raised if either username or password is passed without the other.
dataset_id (string) – The ID of the dataset entry to user for the project’s Dataset
dataset_version_id (string, optional) – The ID of the dataset version to use for the project dataset. If not specified - uses
latest version associated with dataset_id
project_name (string, optional) – The name of the project to be created.
If not specified, will be “Untitled Project” for database connections, otherwise
the project name will be based on the file used.
user (string, optional) – The username for database authentication.
password (string, optional) – The password (in cleartext) for database authentication. The password
will be encrypted on the server side in scope of HTTP request and never saved or stored
credential_id (string, optional) – The ID of the set of credentials to use instead of user and password.
use_kerberos (Optional[bool]) – Server default is False.
If true, use kerberos authentication for database authentication.
use_sample_from_dataset (Optional[bool]) – Server default is False
If true, use the EDA sample for the project instead of the full data.
It is optional for datasets between 500 MB and 10 GB.
For datasets over 10 GB, this is always set to True on the server side.
credential_data (dict, optional) – The credentials to authenticate with the database, to use instead of user/password or
credential ID.
max_wait (int) – optional, the maximum number of seconds to wait before giving up.
use_case (UseCase | string, optional) – A single UseCase object or ID to add this new Project to. Must be a kwarg.
Given a temporary async status location poll for no more than max_wait seconds
until the async process (project creation or setting the target, for example)
finishes successfully, then return the ready project
Parameters:
async_location (str) – The URL for the temporary async status resource. This is returned
as a header in the response to a request that initiates an
async process
max_wait (int) – The maximum number of seconds to wait before giving up.
Chain together project creation, file upload, and target selection.
Notes
While this function provides a simple means to get started, it does not expose
all possible parameters. For advanced usage, using create, set_advanced_options
and analyze_and_model directly is recommended.
Parameters:
sourcedata (str or pandas.DataFrame) – The path to the file to upload. Can be either a path to a
local file or a publicly accessible URL (starting with http://, https://,
file://, or s3://). If the source is a DataFrame, it will be serialized to a
temporary buffer.
If using a file, the filename must consist of ASCII
characters only.
target (Optional[str]) – The name of the target column in the uploaded file. Should not be provided if
unsupervised_mode is True.
project_name (str) – The project name.
worker_count (Optional[int]) – The number of workers that you want to allocate to this project.
metric (Optional[str]) – The name of metric to use.
autopilot_on (boolean, defaultTrue) – Whether or not to begin modeling automatically.
blueprint_threshold (Optional[int]) – Number of hours the model is permitted to run.
Minimum 1
response_cap (Optional[float]) – Quantile of the response distribution to use for response capping
Must be in range 0.5 .. 1.0
positive_class (str, float, or int; optional) – Specifies a level of the target column that should be treated as the
positive class for binary classification. May only be specified
for binary classification targets.
target_type (Optional[str]) – Override the automatically selected target_type. An example usage would be setting the
target_type=’Multiclass’ when you want to preform a multiclass classification task on a
numeric column that has a low cardinality.
You can use TARGET_TYPE enum.
unsupervised_mode (boolean, defaultFalse) – Specifies whether to create an unsupervised project.
blend_best_models (Optional[bool]) – blend best models during Autopilot run
scoring_code_only (Optional[bool]) – Keep only models that can be converted to scorable java code during Autopilot run.
shap_only_mode (Optional[bool]) – Keep only models that support SHAP values during Autopilot run. Use SHAP-based insights
wherever possible. Defaults to False.
prepare_model_for_deployment (Optional[bool]) – Prepare model for deployment during Autopilot run.
The preparation includes creating reduced feature list models, retraining best model on
higher sample size, computing insights and assigning “RECOMMENDED FOR DEPLOYMENT” label.
consider_blenders_in_recommendation (Optional[bool]) – Include blenders when selecting a model to prepare for deployment in an Autopilot Run.
Defaults to False.
min_secondary_validation_model_count (Optional[int]) – Compute “All backtest” scores (datetime models) or cross validation scores
for the specified number of highest ranking models on the Leaderboard,
if over the Autopilot default.
relationships_configuration_id (Optional[str]) – (New in version v2.23) id of the relationships configuration to use
autopilot_with_feature_discovery (Optional[bool].) – (New in version v2.23) If true, autopilot will run on a feature list that includes
features found via search for interactions.
feature_discovery_supervised_feature_reduction (Optional[bool]) – (New in version v2.23) Run supervised feature reduction for feature discovery projects.
unsupervised_type (UnsupervisedTypeEnum, optional) – (New in version v2.27) Specifies whether an unsupervised project is anomaly detection
or clustering.
autopilot_cluster_list (list(int), optional) – (New in version v2.27) Specifies the list of clusters to build for each model during
Autopilot. Specifying multiple values in a list will build models with each number
of clusters for the Leaderboard.
bias_mitigation_feature_name (Optional[str]) – The feature from protected features that will be used in a bias mitigation task to
mitigate bias
bias_mitigation_technique (Optional[str]) – One of datarobot.enums.BiasMitigationTechnique
Options:
‘preprocessingReweighing’
‘postProcessingRejectionOptionBasedClassification’
The technique by which we’ll mitigate bias, which will inform which bias mitigation task
we insert into blueprints
include_bias_mitigation_feature_as_predictor_variable (Optional[bool]) – Whether we should also use the mitigation feature as in input to the modeler just like
any other categorical used for training, i.e. do we want the model to “train on” this
feature in addition to using it for bias mitigation
use_case (UseCase | string, optional) – A single UseCase object or ID to add this new Project to. Must be a kwarg.
Returns:project – The newly created and initialized project.
Returns the projects associated with this account.
Parameters:
search_params (dict, optional.) –
If not None, the returned projects are filtered by lookup.
Currently you can query projects by:
* project_name
* use_cases (Union[UseCase, List[UseCase], str, List[str]], optional.) – If not None, the returned projects are filtered to those associated
with a specific Use Case or Use Cases. Accepts either the entity or the ID.
* offset (Optional[int]) – If provided, specifies the number of results to skip.
* limit (Optional[int]) – If provided, specifies the maximum number of results to return. If not provided,
returns a maximum of 1000 results.
* Returns:projects – Contains a list of projects associated with this user
account.
* Return type:list of Project instances
* Raises:TypeError – Raised if search_params parameter is provided,
but is not of supported type.
Set target variable of an existing project and begin the autopilot process or send data to DataRobot
for feature analysis only if manual mode is specified.
Any options saved using set_options will be used if nothing is passed to advanced_options.
However, saved options will be ignored if advanced_options are passed.
Target setting is an asynchronous process, which means that after
initial request we will keep polling status of async process
that is responsible for target setting until it’s finished.
For SDK users this only means that this method might raise
exceptions related to it’s async nature.
When execution returns to the caller, the autopilot process will already have commenced
(again, unless manual mode is specified).
Parameters:
target (Optional[str]) – The name of the target column in the uploaded file. Should not be provided if
unsupervised_mode is True.
mode (Optional[str]) –
You can use AUTOPILOT_MODE enum to choose between
* AUTOPILOT_MODE.FULL_AUTO
* AUTOPILOT_MODE.MANUAL
* AUTOPILOT_MODE.QUICK
* AUTOPILOT_MODE.COMPREHENSIVE: Runs all blueprints in the repository (warning:
this may be extremely slow).
If unspecified, QUICK is used. If the MANUAL value is used, the model
creation process will need to be started by executing the start_autopilot
function with the desired featurelist. It will start immediately otherwise.
* metric (Optional[str]) – Name of the metric to use for evaluating models. You can query
the metrics available for the target by way of
Project.get_metrics. If none is specified, then the default
recommended by DataRobot is used.
* worker_count (Optional[int]) – The number of concurrent workers to request for this project. If
None, then the default is used.
(New in version v2.14) Setting this to -1 will request the maximum number
available to your account.
* partitioning_method (PartitioningMethod object, optional) – Instance of one of the Partition Classes defined in
datarobot.helpers.partitioning_methods. As an alternative, use
Project.set_partitioning_method
or Project.set_datetime_partitioning
to set the partitioning for the project.
* positive_class (str, float, or int; optional) – Specifies a level of the target column that should be treated as the
positive class for binary classification. May only be specified
for binary classification targets.
* featurelist_id (Optional[str]) – Specifies which feature list to use.
* advanced_options (AdvancedOptions, optional) – Used to set advanced options of project creation. Will override any options saved using set_options.
* max_wait (Optional[int]) – Time in seconds after which target setting is considered
unsuccessful.
* target_type (Optional[str]) – Override the automatically selected target_type. An example usage would be setting the
target_type=’Multiclass’ when you want to preform a multiclass classification task on a
numeric column that has a low cardinality. You can use TARGET_TYPE enum.
* credentials (Optional[List]) – a list of credentials for the datasets used in relationship configuration
(previously graphs).
* feature_engineering_prediction_point (Optional[str]) – additional aim parameter.
* unsupervised_mode (boolean, defaultFalse) – (New in version v2.20) Specifies whether to create an unsupervised project. If True,
target may not be provided.
* relationships_configuration_id (Optional[str]) – (New in version v2.21) ID of the relationships configuration to use.
* segmentation_task_id (str or SegmentationTask, optional) – (New in version v2.28) The segmentation task that should be used to split the project
for segmented modeling.
* unsupervised_type (UnsupervisedTypeEnum, optional) – (New in version v2.27) Specifies whether an unsupervised project is anomaly detection
or clustering.
* autopilot_cluster_list (list(int), optional) – (New in version v2.27) Specifies the list of clusters to build for each model during
Autopilot. Specifying multiple values in a list will build models with each number
of clusters for the Leaderboard.
* use_gpu (Optional[bool]) – (New in version v3.2) Specifies whether project should use GPUs
* Returns:project – The instance with updated attributes.
* Return type:Project
* Raises:
* AsyncFailureError – Polling for status of async process resulted in response
with unsupported status code
* AsyncProcessUnsuccessfulError – Raised if target setting was unsuccessful
* AsyncTimeoutError – Raised if target setting took more time, than specified
by max_wait parameter
* TypeError – Raised if advanced_options, partitioning_method or target_type is
provided, but is not of supported type
datarobot.models.Project.start
: combines project creation, file upload, and target selection. Provides fewer options, but is useful for getting started quickly.
Set target variable of an existing project and begin the Autopilot process (unless manual
mode is specified).
Target setting is an asynchronous process, which means that after
initial request DataRobot keeps polling status of an async process
that is responsible for target setting until it’s finished.
For SDK users, this method might raise
exceptions related to its async nature.
When execution returns to the caller, the Autopilot process will already have commenced
(again, unless manual mode is specified).
Parameters:
target (Optional[str]) – The name of the target column in the uploaded file. Should not be provided if
unsupervised_mode is True.
mode (Optional[str]) –
You can use AUTOPILOT_MODE enum to choose between
* AUTOPILOT_MODE.FULL_AUTO
* AUTOPILOT_MODE.MANUAL
* AUTOPILOT_MODE.QUICK
* AUTOPILOT_MODE.COMPREHENSIVE: Runs all blueprints in the repository (warning:
this may be extremely slow).
If unspecified, QUICK mode is used. If the MANUAL value is used, the model
creation process needs to be started by executing the start_autopilot
function with the desired feature list. It will start immediately otherwise.
* metric (Optional[str]) – Name of the metric to use for evaluating models. You can query
the metrics available for the target by way of
Project.get_metrics. If none is specified, then the default
recommended by DataRobot is used.
* worker_count (Optional[int]) – The number of concurrent workers to request for this project. If
None, then the default is used.
(New in version v2.14) Setting this to -1 will request the maximum number
available to your account.
* positive_class (str, float, or int; optional) – Specifies a level of the target column that should be treated as the
positive class for binary classification. May only be specified
for binary classification targets.
* partitioning_method (PartitioningMethod object, optional) – Instance of one of the Partition Classes defined in
datarobot.helpers.partitioning_methods. As an alternative, use
Project.set_partitioning_method
or Project.set_datetime_partitioning
to set the partitioning for the project.
* featurelist_id (Optional[str]) – Specifies which feature list to use.
* advanced_options (AdvancedOptions, optional) – Used to set advanced options of project creation.
* max_wait (Optional[int]) – Time in seconds after which target setting is considered
unsuccessful.
* target_type (Optional[str]) – Override the automatically selected target_type. An example usage would be setting the
target_type=Multiclass’ when you want to preform a multiclass classification task on a
numeric column that has a low cardinality. You can use ``TARGET_TYPE` enum.
* credentials (Optional[List]) – A list of credentials for the datasets used in relationship configuration
(previously graphs).
* feature_engineering_prediction_point (Optional[str]) – For time-aware Feature Engineering, this parameter specifies the column from the
primary dataset to use as the prediction point.
* unsupervised_mode (boolean, defaultFalse) – (New in version v2.20) Specifies whether to create an unsupervised project. If True,
target may not be provided.
* relationships_configuration_id (Optional[str]) – (New in version v2.21) ID of the relationships configuration to use.
* class_mapping_aggregation_settings (ClassMappingAggregationSettings, optional) – Instance of datarobot.helpers.ClassMappingAggregationSettings
* segmentation_task_id (str or SegmentationTask, optional) – (New in version v2.28) The segmentation task that should be used to split the project
for segmented modeling.
* unsupervised_type (Optional[UnsupervisedTypeEnum]) – (New in version v2.27) Specifies whether an unsupervised project is anomaly detection
or clustering.
* autopilot_cluster_list (Optional[List[int]]) – (New in version v2.27) Specifies the list of clusters to build for each model during
Autopilot. Specifying multiple values in a list will build models with each number
of clusters for the Leaderboard.
* Returns:project – The instance with updated attributes.
* Return type:Project
* Raises:
* AsyncFailureError – Polling for status of async process resulted in response
with unsupported status code.
* AsyncProcessUnsuccessfulError – Raised if target setting was unsuccessful.
* AsyncTimeoutError – Raised if target setting took more time, than specified
by max_wait parameter.
* TypeError – Raised if advanced_options, partitioning_method or target_type is
provided, but is not of supported type.
datarobot.models.Project.start
: Combines project creation, file upload, and target selection. Provides fewer options, but is useful for getting started quickly.
Retrieve paginated model records, sorted by scores, with optional filtering.
Parameters:
sort_by_partition (str, one of validation, backtesting, crossValidation or holdout) – Set the partition to use for sorted (by score) list of models. validation is the default.
sort_by_metric (str) –
Set the project metric to use for model sorting. DataRobot-selected project optimization metric
: is the default.
* with_metric (str) – For a single-metric list of results, specify that project metric.
* search_term (str) – If specified, only models containing the term in their name or processes are returned.
* featurelists (List[str]) – If specified, only models trained on selected featurelists are returned.
* families (List[str]) – If specified, only models belonging to selected families are returned.
* blueprints (List[str]) – If specified, only models trained on specified blueprint IDs are returned.
* labels (List[str], starred or prepared for deployment) – If specified, only models tagged with all listed labels are returned.
* characteristics (List[str]) – If specified, only models matching all listed characteristics are returned. Possible values
“frozen”,”trained on gpu”,”with exportable coefficients”,”with mono constraints”,”with rating table”,
“with scoring code”,”new series optimized”
* training_filters (List[str]) – If specified, only models matching at least one of the listed training conditions are returned.
The following formats are supported for autoML and datetime partitioned projects:
- number of rows in training subset
For datetime partitioned projects:
- , example P6Y0M0D
- -- Example: P6Y0M0D-78-Random,
(returns models trained on 6 years of data, sampling rate 78%, random sampling).
- Start/end date
- Project settings
* number_of_clusters (list of int) – Filter models by number of clusters. Applicable only in unsupervised clustering projects.
* limit (int)
* offset (int)
* Returns:generic_models
* Return type:list of GenericModel
If not None, the returned models are ordered by this
attribute. If None, the default return is the order of
default project metric.
Allowed attributes to sort by are:
* metric
* sample_pct
If the sort attribute is preceded by a hyphen, models will be sorted in descending
order, otherwise in ascending order.
Multiple sort attributes can be included as a comma-delimited string or in a list
e.g. order_by=`sample_pct,-metric` or order_by=[sample_pct, -metric]
Using metric to sort by will result in models being sorted according to their
validation score by how well they did according to the project metric.
* search_params (dict, optional.) –
If not None, the returned models are filtered by lookup.
Currently you can query models by:
* name
* sample_pct
* is_starred
* with_metric (Optional[str].) – If not None, the returned models will only have scores for this
metric. Otherwise all the metrics are returned.
* use_new_models_retrieval (bool, False by default) – If true, new retrieval route is used, which supports filtering and returns fewer attributes per
individual model. Following attributes are absent and could be retrieved from the blueprint level:
monotonic_increasing_featurelist_id, monotonic_decreasing_featurelist_id, supports_composable_ml
and supports_monotonic_constraints. Following attributes are absent and could be retrieved from
the individual model level: has_empty_clusters, is_n_clusters_dynamically_determined,
prediction_threshold and prediction_threshold_read_only. Attribute n_clusters in Model is
renamed to number_of_clusters in GenericModel and is returned for unsupervised clustering models.
* Returns:models – All models trained in the project.
* Return type:a list of Model or a list of GenericModel ifuse_new_models_retrievalis True.\
* Raises:TypeError – Raised if order_by or search_params parameter is provided,
but is not of supported type.
Examples
Project.get('pid').get_models(order_by=['-sample_pct','metric'])# Getting models that contain "Ridge" in nameProject.get('pid').get_models(search_params={'name':"Ridge"})# Filtering models based on 'starred' flag:Project.get('pid').get_models(search_params={'is_starred':True})
# retrieve additional attributes for the modelmodel_records=project.get_models(use_new_models_retrieval=True)model_record=model_records[0]blueprint_id=model_record.blueprint_idblueprint=dr.Blueprint.get(project.id,blueprint_id)model_record.number_of_clustersblueprint.supports_composable_mlblueprint.supports_monotonic_constraintsblueprint.monotonic_decreasing_featurelist_idblueprint.monotonic_increasing_featurelist_idmodel=dr.Model.get(project.id,model_record.id)model.prediction_thresholdmodel.prediction_threshold_read_onlymodel.has_empty_clustersmodel.is_n_clusters_dynamically_determined
Raises:ValueError – Raised if the project is unsupervised.
Raised if the project has no target set.
Raised if no metric was passed or the project has no metric.
Raised if the metric passed is not used by the models on the leaderboard.
sourcedata (str, file or pandas.DataFrame) – Data to be used for predictions. If string, can be either a path to a local file,
a publicly accessible URL (starting with http://, https://, file://), or
raw file content. If using a file on disk, the filename must consist of ASCII
characters only.
max_wait (Optional[int]) – The maximum number of seconds to wait for the uploaded dataset to be processed before
raising an error.
read_timeout (Optional[int]) – The maximum number of seconds to wait for the server to respond indicating that the
initial upload is complete
forecast_point (datetime.datetime or None, optional) – (New in version v2.8) May only be specified for time series projects, otherwise the
upload will be rejected. The time in the dataset relative to which predictions should be
generated in a time series project. See the Time Series documentation for more information. If not provided, will default to using the
latest forecast point in the dataset.
predictions_start_date (datetime.datetime or None, optional) – (New in version v2.11) May only be specified for time series projects. The start date
for bulk predictions. Note that this parameter is for generating historical predictions
using the training data. This parameter should be provided in conjunction with
predictions_end_date. Cannot be provided with the forecast_point parameter.
predictions_end_date (datetime.datetime or None, optional) – (New in version v2.11) May only be specified for time series projects. The end date
for bulk predictions, exclusive. Note that this parameter is for generating
historical predictions using the training data. This parameter should be provided in
conjunction with predictions_start_date.
Cannot be provided with the forecast_point parameter.
actual_value_column (string, optional) – (New in version v2.21) Actual value column name, valid for the prediction
files if the project is unsupervised and the dataset is considered as bulk predictions
dataset. Cannot be provided with the forecast_point parameter.
dataset_filename (string or None, optional) – (New in version v2.14) File name to use for the dataset.
Ignored for url and file path sources.
relax_known_in_advance_features_check (Optional[bool]) – (New in version v2.15) For time series projects only. If True, missing values in the
known in advance features are allowed in the forecast window at the prediction time.
If omitted or False, missing values are not allowed.
credentials (Optional[List] a list of credentials for the datasets used) – in Feature discovery project
secondary_datasets_config_id (string or None, optional) – (New in version v2.23) The Id of the alternative secondary dataset config
to use during prediction for Feature discovery project.
Upload a new dataset from a data source to make predictions against
Parameters:
data_source_id (str) – The identifier of the data source.
username (str) – The username for database authentication.
password (str) – The password for database authentication. The password is encrypted
at server side and never saved / stored.
max_wait (Optional[int]) – Optional, the maximum number of seconds to wait before giving up.
forecast_point (datetime.datetime or None, optional) – (New in version v2.8) For time series projects only. This is the default point relative
to which predictions will be generated, based on the forecast window of the project. See
the time series prediction documentation for more
information.
relax_known_in_advance_features_check (Optional[bool]) – (New in version v2.15) For time series projects only. If True, missing values in the
known in advance features are allowed in the forecast window at the prediction time.
If omitted or False, missing values are not allowed.
credentials (Optional[List] a list of credentials for the datasets used) – in Feature discovery project
predictions_start_date (datetime.datetime or None, optional) – (New in version v2.20) For time series projects only. The start date for bulk
predictions. Note that this parameter is for generating historical predictions using the
training data. This parameter should be provided in conjunction with
predictions_end_date. Can’t be provided with the forecast_point parameter.
predictions_end_date (datetime.datetime or None, optional) – (New in version v2.20) For time series projects only. The end date for bulk predictions,
exclusive. Note that this parameter is for generating historical predictions using the
training data. This parameter should be provided in conjunction with
predictions_start_date. Can’t be provided with the forecast_point parameter.
actual_value_column (string, optional) – (New in version v2.21) Actual value column name, valid for the prediction
files if the project is unsupervised and the dataset is considered as bulk predictions
dataset. Cannot be provided with the forecast_point parameter.
secondary_datasets_config_id (string or None, optional) – (New in version v2.23) The Id of the alternative secondary dataset config
to use during prediction for Feature discovery project.
Credential data of the catalog dataset to upload. credential_data can be in
one of the following forms:
Basic Credentials:
: - credentialType (str)
: The credential type. For basic credentials, this value must be CredentialTypes.BASIC.
- user (str)
: The username for database authentication.
- password (str)
: The password for database authentication.
The password is encrypted at rest and never saved or stored.
S3 Credentials
: - credentialType (str)
: The credential type. For S3 credentials, this value must be CredentialTypes.S3.
- awsAccessKeyId (Optional[str])
: The S3 AWS access key ID.
- awsSecretAccessKey (Optional[str])
: The S3 AWS secret access key.
- awsSessionToken (Optional[str])
: The S3 AWS session token.
- config_id (Optional[str])
: The ID of the saved shared secure configuration. If specified, cannot include awsAccessKeyId,
awsSecretAccessKey or awsSessionToken.
OAuth Credentials
: - credentialType (str)
: The credential type. For OAuth credentials, this value must be CredentialTypes.OAUTH.
- oauthRefreshToken (str)
: The oauth refresh token.
- oauthClientId (str)
: The oauth client ID.
- oauthClientSecret (str)
: The oauth client secret.
- oauthAccessToken (str)
: The oauth access token.
Snowflake Key Pair Credentials
: - credentialType (str)
: The credential type. For Snowflake Key Pair, this value must be
CredentialTypes.SNOWFLAKE_KEY_PAIR_AUTH.
- user (Optional[str])
: The Snowflake login name.
- privateKeyStr (Optional[str])
: The private key copied exactly from user private key file. Since it contains
multiple lines, when assign to a variable,
put the key string inside triple-quotes
- passphrase (Optional[str])
: The string used to encrypt the private key.
- configId (Optional[str])
: The ID of the saved shared secure configuration. If specified, cannot include user,
privateKeyStr or passphrase.
Databricks Access Token Credentials
: - credentialType (str)
: The credential type. For a Databricks access token, this value must be
CredentialTypes.DATABRICKS_ACCESS_TOKEN.
- databricksAccessToken (str)
: The Databricks personal access token.
Databricks Service Principal Credentials
: - credentialType (str)
: The credential type. For Databricks service principal, this value must be
CredentialTypes.DATABRICKS_SERVICE_PRINCIPAL.
- clientId (Optional[str])
: The client ID for Databricks service principal.
- clientSecret (Optional[str])
: The client secret for Databricks service principal.
- configId (Optional[str])
: The ID of the saved shared secure configuration. If specified, cannot include clientId
and clientSecret.
Azure Service Principal Credentials
: - credentialType (str)
: The credential type. For Azure service principal, this value must be
CredentialTypes.AZURE_SERVICE_PRINCIPAL.
- clientId (Optional[str])
: The client ID for Azure service principal.
- clientSecret (Optional[str])
: The client secret for Azure service principal.
- azureTenantId (Optional[str])
: The azure tenant ID for Azure service principal.
- configId (Optional[str])
: The ID of the saved shared secure configuration. If specified, cannot include clientId
and clientSecret.
* dataset_version_id (Optional[str]) – The version id of the dataset to use.
* max_wait (Optional[int]) – Optional, the maximum number of seconds to wait before giving up.
* forecast_point (datetime.datetime or None, optional) – For time series projects only. This is the default point relative
to which predictions will be generated, based on the forecast window of the project. See
the time series prediction documentation for more
information.
* relax_known_in_advance_features_check (Optional[bool]) – For time series projects only. If True, missing values in the
known in advance features are allowed in the forecast window at the prediction time.
If omitted or False, missing values are not allowed.
* credentials (list[BasicCredentialsDict | CredentialIdCredentialsDict], optional) –
A list of credentials for the datasets used in Feature discovery project.
Items in credentials can have the following forms:
Basic Credentials
: - user (str)
: The username for database authentication.
- password (str)
: The password (in cleartext) for database authentication. The password
will be encrypted on the server side in scope of HTTP request
and never saved or stored.
Credential ID
: - credentialId (str)
: The ID of the set of credentials to use instead of user and password.
Note that with this change, username and password will become optional.
* predictions_start_date (datetime.datetime or None, optional) – For time series projects only. The start date for bulk
predictions. Note that this parameter is for generating historical predictions using the
training data. This parameter should be provided in conjunction with
predictions_end_date. Can’t be provided with the forecast_point parameter.
* predictions_end_date (datetime.datetime or None, optional) – For time series projects only. The end date for bulk predictions,
exclusive. Note that this parameter is for generating historical predictions using the
training data. This parameter should be provided in conjunction with
predictions_start_date. Can’t be provided with the forecast_point parameter.
* actual_value_column (string, optional) – Actual value column name, valid for the prediction
files if the project is unsupervised and the dataset is considered as bulk predictions
dataset. Cannot be provided with the forecast_point parameter.
* secondary_datasets_config_id (string or None, optional) – The Id of the alternative secondary dataset config
to use during prediction for Feature discovery project.
* Returns:dataset – the newly uploaded dataset
* Return type:PredictionDataset
Only available once the target and partitioning settings have been set. For more
information on the distinction between input and modeling features, see the
time series documentation.
Parameters:batch_size (Optional[int]) – The number of features to retrieve in a single API call. If specified, the client may
make multiple calls to retrieve the full list of features. If not specified, an
appropriate default will be chosen by the server.
Get a sample of the actual values used to measure the association
between a pair of features
Added in version v2.17.
Parameters:
feature1 (str) – Feature name for the first feature of interest
feature2 (str) – Feature name for the second feature of interest
Returns:
dict – This data has 3 keys: chart_type, features, values, and types
chart_type (str) – Type of plotting the pair of features gets in the UI.
e.g. ‘HORIZONTAL_BOX’, ‘VERTICAL_BOX’, ‘SCATTER’ or ‘CONTINGENCY’
values (list) – A list of triplet lists e.g.
{“values”: [[460.0, 428.5, 0.001], [1679.3, 259.0, 0.001], …]
The first entry of each list is a value of feature1, the second entry of
each list is a value of feature2, and the third is the relative frequency of
the pair of datapoints in the sample.
features (List[str]) – A list of the passed features, [feature1, feature2]
types (List[str]) – A list of the passed features’ types inferred by DataRobot.
e.g. [‘NUMERIC’, ‘CATEGORICAL’]
List all modeling featurelists created for this project
Modeling featurelists can only be created after the target and partitioning options have
been set for a project. In time series projects, these are the featurelists that can be
used for modeling; in other projects, they behave the same as regular featurelists.
Parameters:batch_size (Optional[int]) – The number of featurelists to retrieve in a single API call. If specified, the client
may make multiple calls to retrieve the full list of features. If not specified, an
appropriate default will be chosen by the server.
Returns:
all modeling featurelists in this project
Create a new feature by transforming the type of an existing feature in the project
Note that only the following transformations are supported:
Text to categorical or numeric
Categorical to text or numeric
Numeric to categorical
Date to categorical or numeric
Notes
Special considerations when casting numeric to categorical
There are two parameters which can be used for variableType to convert numeric
data to categorical levels. These differ in the assumptions they make about the input
data, and are very important when considering the data that will be used to make
predictions. The assumptions that each makes are:
categorical : The data in the column is all integral, and there are no missing
values. If either of these conditions do not hold in the training set, the
transformation will be rejected. During predictions, if any of the values in the
parent column are missing, the predictions will error.
categoricalInt : New in v2.6
All of the data in the column should be considered categorical in its string form when
cast to an int by truncation. For example the value 3 will be cast as the string
3 and the value 3.14 will also be cast as the string 3. Further, the
value -3.6 will become the string -3.
Missing values will still be recognized as missing.
For convenience these are represented in the enum VARIABLE_TYPE_TRANSFORM with the
names CATEGORICAL and CATEGORICAL_INT.
Parameters:
name (str) – The name to give to the new feature
parent_name (str) – The name of the feature to transform
variable_type (str) – The type the new column should have. See the values within
datarobot.enums.VARIABLE_TYPE_TRANSFORM.
replacement (str or Optional[float]) – The value that missing or unconvertable data should have
date_extraction (Optional[str]) – Must be specified when parent_name is a date column (and left None otherwise).
Specifies which value from a date should be extracted. See the list of values in
datarobot.enums.DATE_EXTRACTION
max_wait (Optional[int]) – The maximum amount of time to wait for DataRobot to finish processing the new column.
This process can take more time with more data to process. If this operation times
out, an AsyncTimeoutError will occur. DataRobot continues the processing and the
new column may successfully be constructed.
name (Optional[str]) – The name to give to this new featurelist. Names must be unique, so
an error will be returned from the server if this name has already
been used in this project. We dynamically create a name if none is
provided.
features (list of Optional[str]) – The names of the features. Each feature must exist in the project
already.
starting_featurelist (Featurelist, optional) – The featurelist to use as the basis when creating a new featurelist.
starting_featurelist.features will be read to get the list of features
that we will manipulate.
starting_featurelist_id (Optional[str]) – The featurelist ID used instead of passing an object instance.
starting_featurelist_name (Optional[str]) – The featurelist name like “Informative Features” to find a featurelist
via the API, and use to fetch features.
features_to_include (list of Optional[str]) – The list of the feature names to include in new featurelist. Throws an
error if an item in this list is not in the featurelist that was passed,
or that was retrieved from the API. If nothing is passed, all features
are included from the starting featurelist.
features_to_exclude (list of Optional[str]) – The list of the feature names to exclude in the new featurelist. Throws
an error if an item in this list is not in the featurelist that was
passed, also throws an error if a feature is in this list as well as
features_to_include. Method cannot use both at the same time.
InvalidUsageError – Raised method is called with incompatible arguments
Examples
project=Project.get('5223deadbeefdeadbeef0101')flists=project.get_featurelists()# Create a new featurelist using a subset of features from an# existing featurelistflist=flists[0]features=flist.features[::2]# Half of the featuresnew_flist=project.create_featurelist(name='Feature Subset',features=features,)
project=Project.get('5223deadbeefdeadbeef0101')# Create a new featurelist using a subset of features from an# existing featurelist by using features_to_exclude paramnew_flist=project.create_featurelist(name='Feature Subset of Existing Featurelist',starting_featurelist_name="Informative Features",features_to_exclude=["metformin","weight","age"],)
Modeling featurelists can only be created after the target and partitioning options have
been set for a project. In time series projects, these are the featurelists that can be
used for modeling; in other projects, they behave the same as regular featurelists.
name (str) – the name of the modeling featurelist to create. Names must be unique within the
project, or the server will return an error.
features (List[str]) – the names of the features to include in the modeling featurelist. Each feature must
be a modeling feature.
skip_datetime_partition_column (boolean, optional) – False by default. If True, featurelist will not contain datetime partition column.
Use to create monotonic feature lists in Time Series projects. Setting makes no difference for
not Time Series projects. Monotonic featurelists can not be used for modeling.
Returns:featurelist – the newly created featurelist
project=Project.get('1234deadbeeffeeddead4321')modeling_features=project.get_modeling_features()selected_features=[feat.nameforfeatinmodeling_features][:5]# select first fivenew_flist=project.create_modeling_featurelist('Model This',selected_features)
Get the metrics recommended for modeling on the given feature.
Parameters:feature_name (str) – The name of the feature to query regarding which metrics are
recommended for modeling.
Returns:
feature_name (str) – The name of the feature that was looked up
available_metrics (List[str]) – An array of strings representing the appropriate metrics. If the feature
cannot be selected as the target, then this array will be empty.
metric_details (list of dict) – The list of metricDetails objects
metric_name: str
: Name of the metric
supports_timeseries: boolean
: This metric is valid for timeseries
supports_multiclass: boolean
: This metric is valid for multiclass classification
supports_binary: boolean
: This metric is valid for binary classification
supports_regression: boolean
: This metric is valid for regression
ascending: boolean
: Should the metric be sorted in ascending order
Start Autopilot on provided featurelist with the specified Autopilot settings,
halting the current Autopilot run.
Only one autopilot can be running at the time.
That’s why any ongoing autopilot on a different featurelist will
be halted - modeling jobs in queue would not
be affected but new jobs would not be added to queue by
the halted autopilot.
Parameters:
featurelist_id (str) – Identifier of featurelist that should be used for autopilot
mode (Optional[str]) –
The Autopilot mode to run. You can use AUTOPILOT_MODE enum to choose between
* AUTOPILOT_MODE.FULL_AUTO
* AUTOPILOT_MODE.QUICK
* AUTOPILOT_MODE.COMPREHENSIVE
If unspecified, AUTOPILOT_MODE.QUICK is used.
* blend_best_models (Optional[bool]) – Blend best models during Autopilot run. This option is not supported in SHAP-only ‘
‘mode.
* scoring_code_only (Optional[bool]) – Keep only models that can be converted to scorable java code during Autopilot run.
* prepare_model_for_deployment (Optional[bool]) – Prepare model for deployment during Autopilot run. The preparation includes creating
reduced feature list models, retraining best model on higher sample size,
computing insights and assigning “RECOMMENDED FOR DEPLOYMENT” label.
* consider_blenders_in_recommendation (Optional[bool]) – Include blenders when selecting a model to prepare for deployment in an Autopilot Run.
This option is not supported in SHAP-only mode or for multilabel projects.
* run_leakage_removed_feature_list (Optional[bool]) – Run Autopilot on Leakage Removed feature list (if exists).
* autopilot_cluster_list (list of Optional[int]) – (New in v2.27) A list of integers, where each value will be used as the number of
clusters in Autopilot model(s) for unsupervised clustering projects. Cannot be specified
unless project unsupervisedMode is true and unsupervisedType is set to ‘clustering’.
* Raises:AppPlatformError – Raised project’s target was not selected or the settings for Autopilot are invalid
for the project project.
* Return type:None
Either sample_pct or training_row_count can be used to specify the amount of data to
use, but not both. If neither are specified, a default of the maximum amount of data that
can safely be used to train any blueprint without going into the validation data will be
selected.
In smart-sampled projects, sample_pct and training_row_count are assumed to be in terms
of rows of the minority class.
For str, this is assumed to be a blueprint_id. If no
source_project_id is provided, the project_id will be assumed
to be the project that this instance represents.
Otherwise, for a Blueprint, it contains the
blueprint_id and source_project_id that we want
to use. featurelist_id will assume the default for this project
if not provided, and sample_pct will default to using the maximum
training value allowed for this project’s partition setup.
source_project_id will be ignored if a
Blueprint instance is used for this parameter
* sample_pct (Optional[float]) – The amount of data to use for training, as a percentage of the project dataset from 0
to 100.
* featurelist_id (Optional[str]) – The identifier of the featurelist to use. If not defined, the
default for this project is used.
* source_project_id (Optional[str]) – Which project created this blueprint_id. If None, it defaults
to looking in this project. Note that you must have read
permissions in this project.
* scoring_type (Optional[str]) – Either validation or crossValidation (also dr.SCORING_TYPE.validation
or dr.SCORING_TYPE.cross_validation). validation is available for every
partitioning type, and indicates that the default model validation should be
used for the project.
If the project uses a form of cross-validation partitioning,
crossValidation can also be used to indicate
that all of the available training/validation combinations
should be used to evaluate the model.
* training_row_count (Optional[int]) – The number of rows to use to train the requested model.
* monotonic_increasing_featurelist_id (Optional[str]) – (new in version 2.11) the id of the featurelist that defines the set of features with
a monotonically increasing relationship to the target. Passing None disables
increasing monotonicity constraint. Default
(dr.enums.MONOTONICITY_FEATURELIST_DEFAULT) is the one specified by the blueprint.
* monotonic_decreasing_featurelist_id (Optional[str]) – (new in version 2.11) the id of the featurelist that defines the set of features with
a monotonically decreasing relationship to the target. Passing None disables
decreasing monotonicity constraint. Default
(dr.enums.MONOTONICITY_FEATURELIST_DEFAULT) is the one specified by the blueprint.
* n_clusters (Optional[int]) – (new in version 2.27) Number of clusters to use in an unsupervised clustering model.
This parameter is used only for unsupervised clustering models that don’t automatically
determine the number of clusters.
* Returns:model_job_id – id of created job, can be used as parameter to ModelJob.get
method or wait_for_async_model_creation function
* Return type:str
Use a blueprint_id, which is a string. In the first case, it is
assumed that the blueprint was created by this project. If you are
using a blueprint used by another project, you will need to pass the
id of that other project as well.
blueprint_id (str) – the blueprint to use to train the model
featurelist_id (Optional[str]) – the featurelist to use to train the model. If not specified, the project default will
be used.
training_row_count (Optional[int]) – the number of rows of data that should be used to train the model. If specified,
neither training_duration nor use_project_settings may be specified.
training_duration (Optional[str]) – a duration string specifying what time range the data used to train the model should
span. If specified, neither training_row_count nor use_project_settings may be
specified.
sampling_method (Optional[str]) – (New in version v2.23) defines the way training data is selected. Can be either
random or latest. In combination with training_row_count defines how rows
are selected from backtest (latest by default). When training data is defined using
time range (training_duration or use_project_settings) this setting changes the
way time_window_sample_pct is applied (random by default). Applicable to OTV
projects only.
use_project_settings (Optional[bool]) – (New in version v2.20) defaults to False. If True, indicates that the custom
backtest partitioning settings specified by the user will be used to train the model and
evaluate backtest scores. If specified, neither training_row_count nor
training_duration may be specified.
source_project_id (Optional[str]) – the id of the project this blueprint comes from, if not this project. If left
unspecified, the blueprint must belong to this project.
monotonic_increasing_featurelist_id (Optional[str]) – (New in version v2.18) optional, the id of the featurelist that defines
the set of features with a monotonically increasing relationship to the target.
Passing None disables increasing monotonicity constraint. Default
(dr.enums.MONOTONICITY_FEATURELIST_DEFAULT) is the one specified by the blueprint.
monotonic_decreasing_featurelist_id (Optional[str]) – (New in version v2.18) optional, the id of the featurelist that defines
the set of features with a monotonically decreasing relationship to the target.
Passing None disables decreasing monotonicity constraint. Default
(dr.enums.MONOTONICITY_FEATURELIST_DEFAULT) is the one specified by the blueprint.
n_clusters (Optional[int]) – The number of clusters to use in the specified unsupervised clustering model.
ONLY VALID IN UNSUPERVISED CLUSTERING PROJECTS
Submit a job for creating blender model. Upon success, the new job will
be added to the end of the queue.
Parameters:
model_ids (List[str]) – List of model ids that will be used to create blender. These models should have
completed validation stage without errors, and can’t be blenders or DataRobot Prime
blender_method (str) – Chosen blend method, one from datarobot.enums.BLENDER_METHOD. If this is a time
series project, only methods in datarobot.enums.TS_BLENDER_METHOD are allowed.
Returns:model_job – New ModelJob instance for the blender creation job in queue.
Check if the specified models can be successfully blended
Parameters:
model_ids (List[str]) – List of model ids that will be used to create blender. These models should have
completed validation stage without errors, and can’t be blenders or DataRobot Prime
blender_method (str) – Chosen blend method, one from datarobot.enums.BLENDER_METHOD. If this is a time
series project, only methods in datarobot.enums.TS_BLENDER_METHOD are allowed.
The requested model will be trained on the maximum autopilot size then go through the
recommendation stages. For datetime partitioned projects, this includes the feature impact
stage, retraining on a reduced feature list, and retraining the best of the reduced
feature list model and the max autopilot original model on recent data. For non-datetime
partitioned projects, this includes the feature impact stage, retraining on a reduced
feature list, retraining the best of the reduced feature list model and the max autopilot
original model up to the holdout size, then retraining the up-to-the holdout model on the
full dataset.
Parameters:model_id (str) – The model to prepare for deployment.
Retrieve a list of all models belonging to the segments/child projects
of the segmented project.
Parameters:combined_model_id (Optional[str]) – Id of the combined model to get segments for. If there is only a single
combined model it can be retrieved automatically, but this must be
specified when there are > 1 combined models.
Returns:segments_models – A list of dictionaries containing all of the segments/child projects,
each with a list of their models ordered by metric from best to worst.
If called with QUEUE_STATUS.INPROGRESS, will return the modeling jobs
that are currently running.
If called with QUEUE_STATUS.QUEUE, will return the modeling jobs that
are waiting to be run.
If called with QUEUE_STATUS.ERROR, will return the modeling jobs that
have errored.
If no value is provided, will return all modeling jobs currently running
or waiting to be run.
* Returns:jobs – Each is an instance of ModelJob
* Return type:list
If called with QUEUE_STATUS.INPROGRESS, will return the prediction jobs
that are currently running.
If called with QUEUE_STATUS.QUEUE, will return the prediction jobs that
are waiting to be run.
If called with QUEUE_STATUS.ERROR, will return the prediction jobs that
have errored.
If called without a status, will return all prediction jobs currently running
or waiting to be run.
* Returns:jobs – Each is an instance of PredictJob
* Return type:list
Blocks until autopilot is finished. This will raise an exception if the autopilot
mode is changed from AUTOPILOT_MODE.FULL_AUTO.
It makes API calls to sync the project state with the server and to look at
which jobs are enqueued.
Parameters:
check_interval (float or int) – The maximum time (in seconds) to wait between checks for whether autopilot is finished
timeout (float or int or None) – After this long (in seconds), we give up. If None, never timeout.
verbosity (Union[int, Enum]) – This should be VERBOSITY_LEVEL.SILENT or VERBOSITY_LEVEL.VERBOSE.
For VERBOSITY_LEVEL.SILENT, nothing will be displayed about progress.
For VERBOSITY_LEVEL.VERBOSE, the number of jobs in progress or queued is shown.
Note that new jobs are added to the queue along the way.
Raises:
AsyncTimeoutError – If autopilot does not finished in the amount of time specified
RuntimeError – If a condition is detected that indicates that autopilot will not complete
on its own
Sets the number of workers allocated to this project.
Note that this value is limited to the number allowed by your account.
Lowering the number will not stop currently running jobs, but will
cause the queue to wait for the appropriate number of jobs to finish
before attempting to run more jobs.
Parameters:worker_count (int) – The number of concurrent workers to request from the pool of workers.
(New in version v2.14) Setting this to -1 will update the number of workers to the
maximum available to your account.
Project options will not be stored at the database level, so the options
set via this method will only be attached to a project instance for the lifetime of a
client session (if you quit your session and reopen a new one before running autopilot,
the advanced options will be lost).
Either accepts an AdvancedOptions object to replace all advanced options or individual keyword
arguments. This is an inplace update, not a new object. The options set will only remain for the
life of this project instance within a given session.
Parameters:
advanced_options (AdvancedOptions, optional) – AdvancedOptions instance as an alternative to passing individual parameters.
weights (string, optional) – The name of a column indicating the weight of each row
response_cap (float in [0.5, 1), optional) – Quantile of the response distribution to use for response capping.
blueprint_threshold (Optional[int]) – Number of hours models are permitted to run before being excluded from later autopilot
stages
Minimum 1
seed (Optional[int]) – a seed to use for randomization
smart_downsampled (Optional[bool]) – whether to use smart downsampling to throw away excess rows of the majority class. Only
applicable to classification and zero-boosted regression projects.
majority_downsampling_rate (Optional[float]) – The percentage between 0 and 100 of the majority rows that should be kept. Specify only if
using smart downsampling. May not cause the majority class to become smaller than the
minority class.
offset (list of Optional[str]) – (New in version v2.6) the list of the names of the columns containing the offset
of each row
exposure (string, optional) – (New in version v2.6) the name of a column containing the exposure of each row
accuracy_optimized_mb (Optional[bool]) – (New in version v2.6) Include additional, longer-running models that will be run by the
autopilot and available to run manually.
events_count (string, optional) – (New in version v2.8) the name of a column specifying events count.
monotonic_increasing_featurelist_id (string, optional) – (new in version 2.11) the id of the featurelist that defines the set of features
with a monotonically increasing relationship to the target. If None,
no such constraints are enforced. When specified, this will set a default for the project
that can be overridden at model submission time if desired.
monotonic_decreasing_featurelist_id (string, optional) – (new in version 2.11) the id of the featurelist that defines the set of features
with a monotonically decreasing relationship to the target. If None,
no such constraints are enforced. When specified, this will set a default for the project
that can be overridden at model submission time if desired.
only_include_monotonic_blueprints (Optional[bool]) – (new in version 2.11) when true, only blueprints that support enforcing
monotonic constraints will be available in the project or selected for the autopilot.
allowed_pairwise_interaction_groups (list of tuple, optional) – (New in version v2.19) For GA2M models - specify groups of columns for which pairwise
interactions will be allowed. E.g. if set to [(A, B, C), (C, D)] then GA2M models will
allow interactions between columns A x B, B x C, A x C, C x D. All others (A x D, B x D) will
not be considered.
blend_best_models (Optional[bool]) – (New in version v2.19) blend best models during Autopilot run
scoring_code_only (Optional[bool]) – (New in version v2.19) Keep only models that can be converted to scorable java code
during Autopilot run
shap_only_mode (Optional[bool]) – (New in version v2.21) Keep only models that support SHAP values during Autopilot run. Use
SHAP-based insights wherever possible. Defaults to False.
prepare_model_for_deployment (Optional[bool]) – (New in version v2.19) Prepare model for deployment during Autopilot run.
The preparation includes creating reduced feature list models, retraining best model
on higher sample size, computing insights and assigning “RECOMMENDED FOR DEPLOYMENT” label.
consider_blenders_in_recommendation (Optional[bool]) – (New in version 2.22.0) Include blenders when selecting a model to prepare for
deployment in an Autopilot Run. Defaults to False.
min_secondary_validation_model_count (Optional[int]) – (New in version v2.19) Compute “All backtest” scores (datetime models) or cross validation
scores for the specified number of highest ranking models on the Leaderboard,
if over the Autopilot default.
autopilot_data_sampling_method (Optional[str]) – (New in version v2.23) one of datarobot.enums.DATETIME_AUTOPILOT_DATA_SAMPLING_METHOD.
Applicable for OTV projects only, defines if autopilot uses “random” or “latest” sampling
when iteratively building models on various training samples. Defaults to “random” for
duration-based projects and to “latest” for row-based projects.
run_leakage_removed_feature_list (Optional[bool]) – (New in version v2.23) Run Autopilot on Leakage Removed feature list (if exists).
autopilot_with_feature_discovery (Optional[bool].) – (New in version v2.23) If true, autopilot will run on a feature list that includes features
found via search for interactions.
feature_discovery_supervised_feature_reduction (Optional[bool]) – (New in version v2.23) Run supervised feature reduction for feature discovery projects.
exponentially_weighted_moving_alpha (Optional[float]) – (New in version v2.26) defaults to None, value between 0 and 1 (inclusive), indicates
alpha parameter used in exponentially weighted moving average within feature derivation
window.
external_time_series_baseline_dataset_id (Optional[str].) – (New in version v2.26) If provided, will generate metrics scaled by external model
predictions metric for time series projects. The external predictions catalog
must be validated before autopilot starts, see
Project.validate_external_time_series_baseline and
external baseline predictions documentation
for further explanation.
use_supervised_feature_reduction (bool, defaultTrue optional) – Time Series only. When true, during feature generation DataRobot runs a supervised
algorithm to retain only qualifying features. Setting to false can
severely impact autopilot duration, especially for datasets with many features.
primary_location_column (Optional[str].) – The name of primary location column.
protected_features (list of Optional[str].) – (New in version v2.24) A list of project features to mark as protected for
Bias and Fairness testing calculations. Max number of protected features allowed is 10.
preferable_target_value (Optional[str].) – (New in version v2.24) A target value that should be treated as a favorable outcome
for the prediction. For example, if we want to check gender discrimination for
giving a loan and our target is named is_bad, then the positive outcome for
the prediction would be No, which means that the loan is good and that’s
what we treat as a favorable result for the loaner.
fairness_metrics_set (Optional[str].) – (New in version v2.24) Metric to use for calculating fairness.
Can be one of proportionalParity, equalParity, predictionBalance,
trueFavorableAndUnfavorableRateParity or
favorableAndUnfavorablePredictiveValueParity.
Used and required only if Bias & Fairness in AutoML feature is enabled.
fairness_threshold (Optional[str].) – (New in version v2.24) Threshold value for the fairness metric.
Can be in a range of [0.0, 1.0]. If the relative (i.e. normalized) fairness
score is below the threshold, then the user will see a visual indication on the
bias_mitigation_feature_name (Optional[str]) – The feature from protected features that will be used in a bias mitigation task to
mitigate bias
bias_mitigation_technique (Optional[str]) – One of datarobot.enums.BiasMitigationTechnique
Options:
‘preprocessingReweighing’
‘postProcessingRejectionOptionBasedClassification’
The technique by which we’ll mitigate bias, which will inform which bias mitigation task
we insert into blueprints
include_bias_mitigation_feature_as_predictor_variable (Optional[bool]) – Whether we should also use the mitigation feature as in input to the modeler just like
any other categorical used for training, i.e. do we want the model to “train on” this
feature in addition to using it for bias mitigation
series_id (string, optional) – (New in version v3.6) The name of a column containing the series ID for each row.
forecast_distance (string, optional) – (New in version v3.6) The name of a column containing the forecast distance for each row.
forecast_offsets (list of Optional[str]) – (New in version v3.6) The list of the names of the columns containing the forecast offsets
for each row.
incremental_learning_only_mode (Optional[bool]) – (New in version v3.4) Keep only models that support incremental learning during Autopilot run.
incremental_learning_on_best_model (Optional[bool]) – (New in version v3.4) Run incremental learning on the best model during Autopilot run.
chunk_definition_id (string, optional) – (New in version v3.4) Unique definition for chunks needed to run automated incremental learning.
incremental_learning_early_stopping_rounds (Optional[int]) – (New in version v3.4) Early stopping rounds used in the automated incremental learning service.
number_of_incremental_learning_iterations_before_best_model_selection (Optional[int] = None) – (New in version v3.6) Number of iterations top 5 models complete prior to best model selection.
The minimum is 1, which means no additional iterations after the first iteration (initial model)
will be run. The maximum is 10.
Configures the partitioning method for this project.
If this project does not already have a partitioning method set, creates
a new configuration based on provided args.
If the partitioning_method arg is set, that configuration will instead be used.
Notes
This is an inplace update, not a new object. The options set will only remain for the
life of this project instance within a given session. You must still callset_target
to make this change permanent for the project. Calling refresh without first calling
set_target will invalidate this configuration. Similarly, calling get to retrieve a
second copy of the project will not include this configuration.
Added in version v3.0.
Parameters:
cv_method (str) – The partitioning method used. Supported values can be found in datarobot.enums.CV_METHOD.
validation_type (str) – May be “CV” (K-fold cross-validation) or “TVH” (Training, validation, and holdout).
seed (int) – A seed to use for randomization.
reps (int) – Number of cross validation folds to use.
user_partition_col (str) – The name of the column containing the partition assignments.
training_level (Union[str,int]) – The value of the partition column indicating a row is part of the training set.
validation_level (Union[str,int]) – The value of the partition column indicating a row is part of the validation set.
holdout_level (Union[str,int]) – The value of the partition column indicating a row is part of the holdout set (use
None if you want no holdout set).
cv_holdout_level (Union[str,int]) – The value of the partition column indicating a row is part of the holdout set.
validation_pct (int) – The desired percentage of dataset to assign to validation set.
holdout_pct (int) – The desired percentage of dataset to assign to holdout set.
partition_key_cols (list) – A list containing a single string, where the string is the name of the column whose
values should remain together in partitioning.
partitioning_method (PartitioningMethod, optional) – An instance of datarobot.helpers.partitioning_methods.PartitioningMethod that will
be used instead of creating a new instance from the other args.
Raises:
TypeError – If cv_method or validation_type are not set and partitioning_method is not set.
InvalidUsageError – If invoked after project.set_target or project.start, or
if invoked with the wrong combination of args for a given partitioning method.
Returns:project – The instance with updated attributes.
send_notification (boolean, defaultNone) – (New in version v2.21) optional, whether or not an email notification should be sent,
default to None
include_feature_discovery_entities (boolean, defaultNone) – (New in version v2.21) optional (default: None), whether or not to share all the
related entities i.e., datasets for a project with Feature Discovery enabled
Return type:None
Raises:datarobot.ClientError : – if you do not have permission to share this project, if the user you’re sharing with
doesn’t exist, if the same user appears multiple times in the access_list, or if these
changes would leave the project without an owner
Create new features by transforming the type of existing ones.
Added in version v2.17.
Notes
The following transformations are only supported in batch mode:
Text to categorical or numeric
Categorical to text or numeric
Numeric to categorical
See {ref}`here ` for special considerations when casting
numeric to categorical.
Date to categorical or numeric transformations are not currently supported for batch
mode but can be performed individually using create_type_transform_feature.
Parameters:
parent_names (list[str]) – The list of variable names to be transformed.
variable_type (str) – The type new columns should have. Can be one of ‘categorical’, ‘categoricalInt’,
‘numeric’, and ‘text’ - supported values can be found in
datarobot.enums.VARIABLE_TYPE_TRANSFORM.
prefix (Optional[str]) –
Notes
Either prefix, suffix, or both must be provided.
The string that will preface all feature names. At least one of prefix and
suffix must be specified.
suffix (Optional[str]) –
Notes
Either prefix, suffix, or both must be provided.
The string that will be appended at the end to all feature names. At least one of
prefix and suffix must be specified.
max_wait (Optional[int]) – The maximum amount of time to wait for DataRobot to finish processing the new column.
This process can take more time with more data to process. If this operation times
out, an AsyncTimeoutError will occur. DataRobot continues the processing and the
new column may successfully be constructed.
Returns:
all features for this project after transformation.
ClientError – If requested Interaction feature can not be created. Possible reasons for example are:
\* one of features either does not exist or is of unsupported type
\* feature with requested name already exists
\* invalid separator character submitted.
file_name (str) – File path where dataset will be saved.
model_id (Optional[str]) – ID of the model to export SQL for.
If specified, QL to generate only features used by the model will be exported.
If not specified, SQL to generate all features will be exported.
max_wait (Optional[int]) – Time in seconds after which export is considered unsuccessful.
Raises:
ClientError – If requested SQL cannot be exported. Possible reason is the feature is not
available to user.
AsyncFailureError – If any of the responses from the server are unexpected.
The forecast windows settings, validation and holdout duration specified in the
datetime specification must be consistent with project settings as these parameters
are used to check whether the specified catalog version id has been validated or not.
See external baseline predictions documentation
for example usage.
Parameters:
catalog_version_id (str) – Id of the catalog version for validating external baseline predictions.
Instance of the DatetimePartitioning defined in
datarobot.helpers.partitioning_methods.
Attributes of the object used to check the validation are:
* datetime_partition_column
* forecast_window_start
* forecast_window_end
* holdout_start_date
* holdout_end_date
* backtests
* multiseries_id_columns
If the above attributes are different from the project settings, the catalog version
will not pass the validation check in the autopilot.
* max_wait (Optional[int]) – The maximum number of seconds to wait for the catalog version to be validated before
raising an error.
* Returns:external_baseline_validation_info – Validation result of the specified catalog version.
* Return type:ExternalBaselineValidationInfo
* Raises:AsyncTimeoutError – Raised if the catalog version validation took more time than specified
by the max_wait parameter.
Download multicategorical data format errors to the CSV file. If any format errors
where detected in potentially multicategorical features the resulting file will contain
at max 10 entries. CSV file content contains feature name, dataset index in which the
error was detected, row value and type of error detected. In case that there were no
errors or none of the features where potentially multicategorical the CSV file will be
empty containing only the header.
Parameters:file_name (str) – File path where CSV file will be saved.
For a multiseries timeseries project it returns all distinct entries in the
multiseries column. For a non timeseries project it will just return an empty list.
Returns:multiseries_names – List of all distinct entries in the multiseries column
Segment restart is allowed only for segments that haven’t reached modeling phase.
Restart will permanently remove previous project and trigger set up of a new one
for particular segment.
Apply bias mitigation to an existing model by training a version of that model but with
bias mitigation applied.
An error will be returned if the model does not support bias mitigation with the technique
requested.
Added in version v2.29.
Parameters:
bias_mitigation_parent_leaderboard_id (str) – The leaderboard id of the model to apply bias mitigation to
bias_mitigation_feature_name (str) – The feature name of the protected features that will be used in a bias mitigation task to
attempt to mitigate bias
bias_mitigation_technique (Optional[str]) – One of datarobot.enums.BiasMitigationTechnique
Options:
‘preprocessingReweighing’
‘postProcessingRejectionOptionBasedClassification’
The technique by which we’ll mitigate bias, which will inform which bias mitigation task
we insert into blueprints
include_bias_mitigation_feature_as_predictor_variable (bool) – Whether we should also use the mitigation feature as in input to the modeler just like
any other categorical used for training, i.e. do we want the model to “train on” this
feature in addition to using it for bias mitigation
Returns:
the job of the model with bias mitigation applied that was just submitted for training
Request a compute job for bias mitigation feature info for a given feature, which will
include
- if there are any rare classes
- if there are any combinations of the target values and the feature values that never occur
in the same row
- if the feature has a high number of missing values.
Note that this feature check is dependent on the current target selected for the project.
Added in version v2.29.
Parameters:bias_mitigation_feature_name (str) – The feature name of the protected features that will be used in a bias mitigation task to
attempt to mitigate bias
Returns:
Bias mitigation feature info model for the requested feature
Get the computed bias mitigation feature info for a given feature, which will include
- if there are any rare classes
- if there are any combinations of the target values and the feature values that never occur
in the same row
- if the feature has a high number of missing values.
Note that this feature check is dependent on the current target selected for the project.
If this info has not already been computed, this will raise a 404 error.
Added in version v2.29.
Parameters:bias_mitigation_feature_name (str) – The feature name of the protected features that will be used in a bias mitigation task to
attempt to mitigate bias
Returns:
Bias mitigation feature info model for the requested feature
Instantiate an object of this class using the data directly from the server,
meaning that the keys may have the wrong camel casing
Parameters:
data (dict) – The directly translated dict of JSON from the server. No casing fixes have
taken place
keep_attrs (iterable) – List, set or tuple of the dotted namespace notations for attributes to keep within the
object structure even if their values are None
Set the datetime partitioning method for a time series project by either passing in
a DatetimePartitioningSpecification instance or any individual attributes of that class.
Updates self.partitioning_method if already set previously (does not replace it).
Parameters:datetime_partition_spec (DatetimePartitioningSpecification) – DatetimePartitioningSpecification,
optional
The customizable aspects of datetime partitioning for a time series project. An alternative
to passing individual settings (attributes of the DatetimePartitioningSpecification class).
Returns:
Full partitioning including user-specified attributes as well as those determined by DR
based on the dataset.
This method makes an API call to retrieve settings from the DB if project is in the modeling
stage, i.e. if analyze_and_model (autopilot) has already been called.
If analyze_and_model has not yet been called, this method will instead simply print
settings from project.partitioning_method.
Added in version v3.0.
Return type:DatetimePartitioningSpecification or None
class datarobot.helpers.eligibility_result.EligibilityResult¶
Represents whether a particular operation is supported
For instance, a function to check whether a set of models can be blended can return an
EligibilityResult specifying whether or not blending is supported and why it may not be
supported.
Variables:
supported (bool) – whether the operation this result represents is supported
reason (str) – why the operation is or is not supported
Used when setting the target of a project to set advanced options of modeling process.
Parameters:
weights (Optional[str]) – The name of a column indicating the weight of each row
response_cap (Optional[bool] or Optional[float in [0.5, 1)]) – Defaults to none here, but server defaults to False.
If specified, it is the quantile of the response distribution to use for response capping.
blueprint_threshold (Optional[int]) – Number of hours models are permitted to run before being excluded from later autopilot
stages
Minimum 1
seed (Optional[int]) – a seed to use for randomization
smart_downsampled (Optional[bool]) – whether to use smart downsampling to throw away excess rows of the majority class. Only
applicable to classification and zero-boosted regression projects.
majority_downsampling_rate (Optional[float]) – the percentage between 0 and 100 of the majority rows that should be kept. Specify only if
using smart downsampling. May not cause the majority class to become smaller than the
minority class.
offset (list of Optional[str]) – (New in version v2.6) the list of the names of the columns containing the offset
of each row
exposure (Optional[str]) – (New in version v2.6) the name of a column containing the exposure of each row
accuracy_optimized_mb (Optional[bool]) – (New in version v2.6) Include additional, longer-running models that will be run by the
autopilot and available to run manually.
scaleout_modeling_mode (Optional[str]) – (Deprecated in 2.28. Will be removed in 2.30) DataRobot no longer supports scaleout models.
Please remove any usage of this parameter as it will be removed from the API soon.
events_count (Optional[str]) – (New in version v2.8) the name of a column specifying events count.
monotonic_increasing_featurelist_id (Optional[str]) – (new in version 2.11) the id of the featurelist that defines the set of features
with a monotonically increasing relationship to the target. If None,
no such constraints are enforced. When specified, this will set a default for the project
that can be overridden at model submission time if desired.
monotonic_decreasing_featurelist_id (Optional[str]) – (new in version 2.11) the id of the featurelist that defines the set of features
with a monotonically decreasing relationship to the target. If None,
no such constraints are enforced. When specified, this will set a default for the project
that can be overridden at model submission time if desired.
only_include_monotonic_blueprints (Optional[bool]) – (new in version 2.11) when true, only blueprints that support enforcing
monotonic constraints will be available in the project or selected for the autopilot.
allowed_pairwise_interaction_groups (Optional[List[Tuple[str, ]]]) – (New in version v2.19) For GA2M models - specify groups of columns for which pairwise
interactions will be allowed. E.g. if set to [(A, B, C), (C, D)] then GA2M models will
allow interactions between columns A x B, B x C, A x C, C x D. All others (A x D, B x D) will
not be considered.
blend_best_models (Optional[bool]) – (New in version v2.19) blend best models during Autopilot run.
scoring_code_only (Optional[bool]) – (New in version v2.19) Keep only models that can be converted to scorable java code
during Autopilot run
shap_only_mode (Optional[bool]) – (New in version v2.21) Keep only models that support SHAP values during Autopilot run. Use
SHAP-based insights wherever possible. Defaults to False.
prepare_model_for_deployment (Optional[bool]) – (New in version v2.19) Prepare model for deployment during Autopilot run.
The preparation includes creating reduced feature list models, retraining best model
on higher sample size, computing insights and assigning “RECOMMENDED FOR DEPLOYMENT” label.
consider_blenders_in_recommendation (Optional[bool]) – (New in version 2.22.0) Include blenders when selecting a model to prepare for
deployment in an Autopilot Run. Defaults to False.
min_secondary_validation_model_count (Optional[int]) – (New in version v2.19) Compute “All backtest” scores (datetime models) or cross validation
scores for the specified number of the highest ranking models on the Leaderboard,
if over the Autopilot default.
autopilot_data_sampling_method (Optional[str]) – (New in version v2.23) one of datarobot.enums.DATETIME_AUTOPILOT_DATA_SAMPLING_METHOD.
Applicable for OTV projects only, defines if autopilot uses “random” or “latest” sampling
when iteratively building models on various training samples. Defaults to “random” for
duration-based projects and to “latest” for row-based projects.
run_leakage_removed_feature_list (Optional[bool]) – (New in version v2.23) Run Autopilot on Leakage Removed feature list (if exists).
autopilot_with_feature_discovery (Optional[bool]) – default = False
(New in version v2.23) If true, autopilot will run on a feature list that includes features
found via search for interactions.
feature_discovery_supervised_feature_reduction (Optional[bool]) – (New in version v2.23) Run supervised feature reduction for feature discovery projects.
exponentially_weighted_moving_alpha (Optional[float]) – (New in version v2.26) defaults to None, value between 0 and 1 (inclusive), indicates
alpha parameter used in exponentially weighted moving average within feature derivation
window.
external_time_series_baseline_dataset_id (Optional[str]) – (New in version v2.26) If provided, will generate metrics scaled by external model
predictions metric for time series projects. The external predictions catalog
must be validated before autopilot starts, see
Project.validate_external_time_series_baseline and
external baseline predictions documentation
for further explanation.
use_supervised_feature_reduction (Optional[bool]) – defaults to True,
Time Series only. When true, during feature generation DataRobot runs a supervised
algorithm to retain only qualifying features. Setting to false can
severely impact autopilot duration, especially for datasets with many features.
primary_location_column (Optional[str].) – The name of primary location column.
protected_features (list of Optional[str].) – (New in version v2.24) A list of project features to mark as protected for
Bias and Fairness testing calculations. Max number of protected features allowed is 10.
preferable_target_value (Optional[str].) – (New in version v2.24) A target value that should be treated as a favorable outcome
for the prediction. For example, if we want to check gender discrimination for
giving a loan and our target is named is_bad, then the positive outcome for
the prediction would be No, which means that the loan is good and that’s
what we treat as a favorable result for the loaner.
fairness_metrics_set (Optional[str].) – (New in version v2.24) Metric to use for calculating fairness.
Can be one of proportionalParity, equalParity, predictionBalance,
trueFavorableAndUnfavorableRateParity or
favorableAndUnfavorablePredictiveValueParity.
Used and required only if Bias & Fairness in AutoML feature is enabled.
fairness_threshold (Optional[str].) – (New in version v2.24) Threshold value for the fairness metric.
Can be in a range of [0.0, 1.0]. If the relative (i.e. normalized) fairness
score is below the threshold, then the user will see a visual indication on the
bias_mitigation_feature_name (Optional[str]) – The feature from protected features that will be used in a bias mitigation task to
mitigate bias
bias_mitigation_technique (Optional[str]) – One of datarobot.enums.BiasMitigationTechnique
Options:
‘preprocessingReweighing’
‘postProcessingRejectionOptionBasedClassification’
The technique by which we’ll mitigate bias, which will inform which bias mitigation task
we insert into blueprints
include_bias_mitigation_feature_as_predictor_variable (Optional[bool]) – Whether we should also use the mitigation feature as in input to the modeler just like
any other categorical used for training, i.e. do we want the model to “train on” this
feature in addition to using it for bias mitigation
default_monotonic_increasing_featurelist_id (Optional[str]) – Returned from server on Project GET request - not able to be updated by user
default_monotonic_decreasing_featurelist_id (Optional[str]) – Returned from server on Project GET request - not able to be updated by user
model_group_id (Optional[str] = None) – (New in version v3.3) The name of a column containing the model group id for each row.
model_regime_id (Optional[str] = None) – (New in version v3.3) The name of a column containing the model regime id for each row.
model_baselines (Optional[List[str]] = None) – (New in version v3.3) The list of the names of the columns containing the model baselines
series_id (Optional[str] = None) – (New in version v3.6) The name of a column containing the series id for each row.
forecast_distance (Optional[str] = None) – (New in version v3.6) The name of a column containing the forecast distance for each row.
forecast_offsets (Optional[List[str]] = None) – (New in version v3.6) The list of the names of the columns containing the forecast offsets
for each row.
incremental_learning_only_mode (Optional[bool] = None) – (New in version v3.4) Keep only models that support incremental learning during Autopilot run.
incremental_learning_on_best_model (Optional[bool] = None) – (New in version v3.4) Run incremental learning on the best model during Autopilot run.
chunk_definition_id (Optional[str]) – (New in version v3.4) Unique definition for chunks needed to run automated incremental learning.
incremental_learning_early_stopping_rounds (Optional[int] = None) – (New in version v3.4) Early stopping rounds used in the automated incremental learning service.
number_of_incremental_learning_iterations_before_best_model_selection (Optional[int] = None) – (New in version v3.6) Number of iterations top 5 models complete prior to best model selection.
The minimum is 1, which means no additional iterations after the first iteration (initial model) will be run.
The maximum is 10.
feature_engineering_prediction_point (Optional[str] = None) – (New in version v3.7) The date column to be used as the prediction point for time-based feature engineering.
A partition in which observations are randomly assigned to cross-validation groups
and the holdout set, preserving in each group the same ratio of positive to negative cases as in
the original data.
Parameters:
holdout_pct (int) – the desired percentage of dataset to assign to holdout set
reps (int) – number of cross validation folds to use
A partition in which one column is specified, and rows sharing a common value
for that column are guaranteed to stay together in the partitioning into cross-validation
groups and the holdout set.
Parameters:
holdout_pct (int) – the desired percentage of dataset to assign to holdout set
reps (int) – number of cross validation folds to use
partition_key_cols (list) – a list containing a single string, where the string is the name of the column whose
values should remain together in partitioning
A partition in which observations are randomly assigned to train, validation, and
holdout sets, preserving in each group the same ratio of positive to negative cases as in the
original data.
Parameters:
holdout_pct (int) – the desired percentage of dataset to assign to holdout set
validation_pct (int) – the desired percentage of dataset to assign to validation set
A partition in which one column is specified, and rows sharing a common value
for that column are guaranteed to stay together in the partitioning into the training,
validation, and holdout sets.
Parameters:
holdout_pct (int) – the desired percentage of dataset to assign to holdout set
validation_pct (int) – the desired percentage of dataset to assign to validation set
partition_key_cols (list) – a list containing a single string, where the string is the name of the column whose
values should remain together in partitioning
seed (int) – a seed to use for randomization
class datarobot.DatetimePartitioningSpecification¶
Uniquely defines a DatetimePartitioning for some project
Includes only the attributes of DatetimePartitioning that are directly controllable by users,
not those determined by the DataRobot application based on the project dataset and the
user-controlled settings.
Note that either (holdout_start_date, holdout_duration) or (holdout_start_date,
holdout_end_date) can be used to specify holdout partitioning settings.
Variables:
datetime_partition_column (str) – the name of the column whose values as dates are used to assign a row
to a particular partition
autopilot_data_selection_method (str) – one of datarobot.enums.DATETIME_AUTOPILOT_DATA_SELECTION_METHOD. Whether models created
by the autopilot should use “rowCount” or “duration” as their data_selection_method.
validation_duration (str or None) – the default validation_duration for the backtests
holdout_start_date (datetime.datetime or None) – The start date of holdout scoring data. If holdout_start_date is specified,
either holdout_duration or holdout_end_date must also be specified. If
disable_holdout is set to True, holdout_start_date, holdout_duration, and
holdout_end_date may not be specified.
holdout_duration (str or None) – The duration of the holdout scoring data. If holdout_duration is specified,
holdout_start_date must also be specified. If disable_holdout is set to True,
holdout_duration, holdout_start_date, and holdout_end_date may not be specified.
holdout_end_date (datetime.datetime or None) – The end date of holdout scoring data. If holdout_end_date is specified,
holdout_start_date must also be specified. If disable_holdout is set to True,
holdout_end_date, holdout_start_date, and holdout_duration may not be specified.
disable_holdout (bool or None) – (New in version v2.8) Whether to suppress allocating a holdout fold.
If set to True, holdout_start_date, holdout_duration, and holdout_end_date
may not be specified.
gap_duration (str or None) – The duration of the gap between training and holdout scoring data
number_of_backtests (int or None) – the number of backtests to use
backtests (list of BacktestSpecification) – the exact specification of backtests to use. The indices of the specified backtests should
range from 0 to number_of_backtests - 1. If any backtest is left unspecified, a default
configuration will be chosen.
use_time_series (bool) – (New in version v2.8) Whether to create a time series project (if True) or an OTV
project which uses datetime partitioning (if False). The default behavior is to create
an OTV project.
default_to_known_in_advance (bool) – (New in version v2.11) Optional, default False. Used for time series projects only. Sets
whether all features default to being treated as known in advance. Known in advance features
are expected to be known for dates in the future when making predictions, e.g., “is this a
holiday?”. Individual features can be set to a value different than the default using the
feature_settings parameter.
default_to_do_not_derive (bool) – (New in v2.17) Optional, default False. Used for time series projects only. Sets whether
all features default to being treated as do-not-derive features, excluding them from feature
derivation. Individual features can be set to a value different than the default by using
the feature_settings parameter.
feature_derivation_window_start (int or None) – (New in version v2.8) Only used for time series projects. Offset into the past to define how
far back relative to the forecast point the feature derivation window should start.
Expressed in terms of the windows_basis_unit and should be negative value or zero.
feature_derivation_window_end (int or None) – (New in version v2.8) Only used for time series projects. Offset into the past to define how
far back relative to the forecast point the feature derivation window should end. Expressed
in terms of the windows_basis_unit and should be a negative value or zero.
feature_settings (list of FeatureSettings) – (New in version v2.9) Optional, a list specifying per feature settings, can be
left unspecified.
forecast_window_start (int or None) – (New in version v2.8) Only used for time series projects. Offset into the future to define
how far forward relative to the forecast point the forecast window should start. Expressed
in terms of the windows_basis_unit.
forecast_window_end (int or None) – (New in version v2.8) Only used for time series projects. Offset into the future to define
how far forward relative to the forecast point the forecast window should end. Expressed
in terms of the windows_basis_unit.
windows_basis_unit (string, optional) – (New in version v2.14) Only used for time series projects. Indicates which unit is
a basis for feature derivation window and forecast window. Valid options are detected time
unit (one of the datarobot.enums.TIME_UNITS) or “ROW”.
If omitted, the default value is the detected time unit.
treat_as_exponential (string, optional) – (New in version v2.9) defaults to “auto”. Used to specify whether to treat data
as exponential trend and apply transformations like log-transform. Use values from the
datarobot.enums.TREAT_AS_EXPONENTIAL enum.
differencing_method (string, optional) – (New in version v2.9) defaults to “auto”. Used to specify which differencing method to
apply of case if data is stationary. Use values from
datarobot.enums.DIFFERENCING_METHOD enum.
periodicities (list of Periodicity, optional) – (New in version v2.9) a list of datarobot.Periodicity. Periodicities units
should be “ROW”, if the windows_basis_unit is “ROW”.
multiseries_id_columns (List[str] or null) – (New in version v2.11) a list of the names of multiseries id columns to define series
within the training data. Currently only one multiseries id column is supported.
use_cross_series_features (bool) – (New in version v2.14) Whether to use cross series features.
aggregation_type (Optional[str]) – (New in version v2.14) The aggregation type to apply when creating
cross series features. Optional, must be one of “total” or “average”.
cross_series_group_by_columns (list of Optional[str]) – (New in version v2.15) List of columns (currently of length 1).
Optional setting that indicates how to further split series into
related groups. For example, if every series is sales of an individual product, the series
group-by could be the product category with values like “men’s clothing”,
“sports equipment”, etc.. Can only be used in a multiseries project with
use_cross_series_features set to True.
calendar_id (Optional[str]) – (New in version v2.15) The id of the CalendarFile to
use with this project.
unsupervised_mode (Optional[bool]) – (New in version v2.20) defaults to False, indicates whether partitioning should be
constructed for the unsupervised project.
model_splits (Optional[int]) – (New in version v2.21) Sets the cap on the number of jobs per model used when
building models to control number of jobs in the queue. Higher number of model splits
will allow for less downsampling leading to the use of more post-processed data.
allow_partial_history_time_series_predictions (Optional[bool]) – (New in version v2.24) Whether to allow time series models to make predictions using
partial historical data.
unsupervised_type (Optional[str]) – (New in version v3.2) The unsupervised project type, only valid if unsupervised_mode is
True. Use values from datarobot.enums.UnsupervisedTypeEnum enum.
If not specified then the project defaults to ‘anomaly’ when unsupervised_mode is True.
Uniquely defines a Backtest used in a DatetimePartitioning
Includes only the attributes of a backtest directly controllable by users. The other attributes
are assigned by the DataRobot application based on the project dataset and the user-controlled
settings.
There are two ways to specify an individual backtest:
Option 1: Use index, gap_duration, validation_start_date, and
validation_duration. All durations should be specified with a duration string such as those
returned by the partitioning_methods.construct_duration_string helper method.
importdatarobotasdrpartitioning_spec=dr.DatetimePartitioningSpecification(backtests=[# modify the first backtest using option 1dr.BacktestSpecification(index=0,gap_duration=dr.partitioning_methods.construct_duration_string(),validation_start_date=datetime(year=2010,month=1,day=1),validation_duration=dr.partitioning_methods.construct_duration_string(years=1),)],# other partitioning settings...)
Option 2 (New in version v2.20): Use index, primary_training_start_date,
primary_training_end_date, validation_start_date, and validation_end_date. In this
case, note that setting primary_training_end_date and validation_start_date to the same
timestamp will result with no gap being created.
importdatarobotasdrpartitioning_spec=dr.DatetimePartitioningSpecification(backtests=[# modify the first backtest using option 2dr.BacktestSpecification(index=0,primary_training_start_date=datetime(year=2005,month=1,day=1),primary_training_end_date=datetime(year=2010,month=1,day=1),validation_start_date=datetime(year=2010,month=1,day=1),validation_end_date=datetime(year=2011,month=1,day=1),)],# other partitioning settings...)
known_in_advance (bool) – (New in version v2.11) Optional, for time series projects
only. Sets whether the feature is known in advance, i.e., values for future dates are known
at prediction time. If not specified, the feature uses the value from the
default_to_known_in_advance flag.
do_not_derive (bool) – (New in v2.17) Optional, for time series projects only.
Sets whether the feature is excluded from feature derivation. If not
specified, the feature uses the value from the default_to_do_not_derive flag.
Includes both the attributes specified by the user, as well as those determined by the DataRobot
application based on the project dataset. In order to use a partitioning to set the target,
call to_specification and pass the
resulting
DatetimePartitioningSpecification to
Project.analyze_and_model via the partitioning_method
parameter.
The available training data corresponds to all the data available for training, while the
primary training data corresponds to the data that can be used to train while ensuring that all
backtests are available. If a model is trained with more data than is available in the primary
training data, then all backtests may not have scores available.
project_id (str) – the id of the project this partitioning applies to
datetime_partitioning_id (str or None) – the id of the datetime partitioning it is an optimized partitioning
datetime_partition_column (str) – the name of the column whose values as dates are used to assign a row
to a particular partition
date_format (str) – the format (e.g. “%Y-%m-%d %H:%M:%S”) by which the partition column was interpreted
(compatible with strftime)
autopilot_data_selection_method (str) – one of datarobot.enums.DATETIME_AUTOPILOT_DATA_SELECTION_METHOD. Whether models created
by the autopilot use “rowCount” or “duration” as their data_selection_method.
validation_duration (str or None) – the validation duration specified when initializing the partitioning - not directly
significant if the backtests have been modified, but used as the default validation_duration
for the backtests. Can be absent if this is a time series project with an irregular primary
date/time feature.
available_training_start_date (datetime.datetime) – The start date of the available training data for scoring the holdout
available_training_duration (str) – The duration of the available training data for scoring the holdout
available_training_row_count (int or None) – The number of rows in the available training data for scoring the holdout. Only available
when retrieving the partitioning after setting the target.
available_training_end_date (datetime.datetime) – The end date of the available training data for scoring the holdout
primary_training_start_date (datetime.datetime or None) – The start date of primary training data for scoring the holdout.
Unavailable when the holdout fold is disabled.
primary_training_duration (str) – The duration of the primary training data for scoring the holdout
primary_training_row_count (int or None) – The number of rows in the primary training data for scoring the holdout. Only available
when retrieving the partitioning after setting the target.
primary_training_end_date (datetime.datetime or None) – The end date of the primary training data for scoring the holdout.
Unavailable when the holdout fold is disabled.
gap_start_date (datetime.datetime or None) – The start date of the gap between training and holdout scoring data.
Unavailable when the holdout fold is disabled.
gap_duration (str) – The duration of the gap between training and holdout scoring data
gap_row_count (int or None) – The number of rows in the gap between training and holdout scoring data. Only available
when retrieving the partitioning after setting the target.
gap_end_date (datetime.datetime or None) – The end date of the gap between training and holdout scoring data.
Unavailable when the holdout fold is disabled.
disable_holdout (bool or None) – Whether to suppress allocating a holdout fold.
If set to True, holdout_start_date, holdout_duration, and holdout_end_date
may not be specified.
holdout_start_date (datetime.datetime or None) – The start date of holdout scoring data.
Unavailable when the holdout fold is disabled.
holdout_duration (str) – The duration of the holdout scoring data
holdout_row_count (int or None) – The number of rows in the holdout scoring data. Only available when retrieving the
partitioning after setting the target.
holdout_end_date (datetime.datetime or None) – The end date of the holdout scoring data. Unavailable when the holdout fold is disabled.
number_of_backtests (int) – the number of backtests used.
backtests (list of Backtest) – the configured backtests.
total_row_count (int) – the number of rows in the project dataset. Only available when retrieving the partitioning
after setting the target.
use_time_series (bool) – (New in version v2.8) Whether to create a time series project (if True) or an OTV
project which uses datetime partitioning (if False). The default behavior is to create
an OTV project.
default_to_known_in_advance (bool) – (New in version v2.11) Optional, default False. Used for time series projects only. Sets
whether all features default to being treated as known in advance. Known in advance features
are expected to be known for dates in the future when making predictions, e.g., “is this a
holiday?”. Individual features can be set to a value different from the default using the
feature_settings parameter.
default_to_do_not_derive (bool) – (New in v2.17) Optional, default False. Used for time series projects only. Sets whether
all features default to being treated as do-not-derive features, excluding them from feature
derivation. Individual features can be set to a value different from the default by using
the feature_settings parameter.
feature_derivation_window_start (int or None) – (New in version v2.8) Only used for time series projects. Offset into the past to define
how far back relative to the forecast point the feature derivation window should start.
Expressed in terms of the windows_basis_unit.
feature_derivation_window_end (int or None) – (New in version v2.8) Only used for time series projects. Offset into the past to define how
far back relative to the forecast point the feature derivation window should end. Expressed
in terms of the windows_basis_unit.
feature_settings (list of FeatureSettings) – (New in version v2.9) Optional, a list specifying per feature settings, can be
left unspecified.
forecast_window_start (int or None) – (New in version v2.8) Only used for time series projects. Offset into the future to define
how far forward relative to the forecast point the forecast window should start. Expressed
in terms of the windows_basis_unit.
forecast_window_end (int or None) – (New in version v2.8) Only used for time series projects. Offset into the future to define
how far forward relative to the forecast point the forecast window should end. Expressed in
terms of the windows_basis_unit.
windows_basis_unit (string, optional) – (New in version v2.14) Only used for time series projects. Indicates which unit is
a basis for feature derivation window and forecast window. Valid options are detected time
unit (one of the datarobot.enums.TIME_UNITS) or “ROW”.
If omitted, the default value is detected time unit.
treat_as_exponential (string, optional) – (New in version v2.9) defaults to “auto”. Used to specify whether to treat data
as exponential trend and apply transformations like log-transform. Use values from the
datarobot.enums.TREAT_AS_EXPONENTIAL enum.
differencing_method (string, optional) – (New in version v2.9) defaults to “auto”. Used to specify which differencing method to
apply of case if data is stationary. Use values from the
datarobot.enums.DIFFERENCING_METHOD enum.
periodicities (list of Periodicity, optional) – (New in version v2.9) a list of datarobot.Periodicity. Periodicities units
should be “ROW”, if the windows_basis_unit is “ROW”.
multiseries_id_columns (List[str] or null) – (New in version v2.11) a list of the names of multiseries id columns to define series
within the training data. Currently only one multiseries id column is supported.
number_of_known_in_advance_features (int) – (New in version v2.14) Number of features that are marked as known in advance.
number_of_do_not_derive_features (int) – (New in v2.17) Number of features that are excluded from derivation.
use_cross_series_features (bool) – (New in version v2.14) Whether to use cross series features.
aggregation_type (Optional[str]) – (New in version v2.14) The aggregation type to apply when creating cross series
features. Optional, must be one of “total” or “average”.
cross_series_group_by_columns (list of Optional[str]) – (New in version v2.15) List of columns (currently of length 1).
Optional setting that indicates how to further split series into
related groups. For example, if every series is sales of an individual product, the series
group-by could be the product category with values like “men’s clothing”,
“sports equipment”, etc.. Can only be used in a multiseries project with
use_cross_series_features set to True.
calendar_id (Optional[str]) – (New in version v2.15) Only available for time series projects. The id of the
CalendarFile to use with this project.
calendar_name (Optional[str]) – (New in version v2.17) Only available for time series projects. The name of the
CalendarFile used with this project.
model_splits (Optional[int]) – (New in version v2.21) Sets the cap on the number of jobs per model used when
building models to control number of jobs in the queue. Higher number of model splits
will allow for less downsampling leading to the use of more post-processed data.
allow_partial_history_time_series_predictions (Optional[bool]) – (New in version v2.24) Whether to allow time series models to make predictions using
partial historical data.
unsupervised_mode (Optional[bool]) – (New in version v3.1) Whether the date/time partitioning is for an unsupervised project
unsupervised_type (Optional[str]) – (New in version v3.2) The unsupervised project type, only valid if unsupervised_mode is
True. Use values from datarobot.enums.UnsupervisedTypeEnum enum.
If not specified then the project defaults to ‘anomaly’ when unsupervised_mode is True.
Preview the full partitioning determined by a DatetimePartitioningSpecification
Based on the project dataset and the partitioning specification, inspect the full
partitioning that would be used if the same specification were passed into
Project.analyze_and_model.
Parameters:
project_id (str) – the id of the project
spec (DatetimePartitioningSpec) – the desired partitioning
max_wait (Optional[int]) – For some settings (e.g. generating a partitioning preview for a multiseries project for
the first time), an asynchronous task must be run to analyze the dataset. max_wait
governs the maximum time (in seconds) to wait before giving up. In all non-multiseries
projects, this is unused.
target (Optional[str]) – the name of the target column. For unsupervised projects target may be None. Providing
a target will ensure that partitions are correctly optimized for your dataset.
Preview the full partitioning determined by a DatetimePartitioningSpecification
Based on the project dataset and the partitioning specification, inspect the full
partitioning that would be used if the same specification were passed into
Project.analyze_and_model.
Retrieve an Optimized DatetimePartitioning from a project for the specified
datetime_partitioning_id. A datetime_partitioning_id is created by using the
generate_optimized function.
Parameters:
project_id (str) – the id of the project to retrieve partitioning for
datetime_partitioning_id (ObjectId) – the ObjectId associated with the project to retrieve from Mongo
Retrieve the feature derivation log content and log length for a time series project.
The Time Series Feature Log provides details about the feature generation process for a
time series project. It includes information about which features are generated and their
priority, as well as the detected properties of the time series data such as whether the
series is stationary, and periodicities detected.
This route is only supported for time series projects that have finished partitioning.
The feature derivation log will include information about:
Detected stationarity of the series:
e.g. ‘Series detected as non-stationary’
Detected presence of multiplicative trend in the series:
e.g. ‘Multiplicative trend detected’
Detected presence of multiplicative trend in the series:
e.g. ‘Detected periodicities: 7 day’
Maximum number of feature to be generated:
e.g. ‘Maximum number of feature to be generated is 1440’
Window sizes used in rolling statistics / lag extractors
e.g. ‘The window sizes chosen to be: 2 months
(because the time step is 1 month and Feature Derivation Window is 2 months)’
Features that are specified as known-in-advance
e.g. ‘Variables treated as apriori: holiday’
Details about why certain variables are transformed in the input data
e.g. ‘Generating variable “y (log)” from “y” because multiplicative trend
is detected’
Details about features generated as timeseries features, and their priority
e.g. ‘Generating feature “date (actual)” from “date” (priority: 1)’
Parameters:
project_id (str) – project id to retrieve a feature derivation log for.
offset (int) – optional, defaults is 0, this many results will be skipped.
limit (int) – optional, defaults to 100, at most this many results are returned. To specify
no limit, use 0. The default may change without notice.
Retrieve the feature derivation log content and log length for a time series project.
The Time Series Feature Log provides details about the feature generation process for a
time series project. It includes information about which features are generated and their
priority, as well as the detected properties of the time series data such as whether the
series is stationary, and periodicities detected.
This route is only supported for time series projects that have finished partitioning.
The feature derivation log will include information about:
Detected stationarity of the series:
e.g. ‘Series detected as non-stationary’
Detected presence of multiplicative trend in the series:
e.g. ‘Multiplicative trend detected’
Detected presence of multiplicative trend in the series:
e.g. ‘Detected periodicities: 7 day’
Maximum number of feature to be generated:
e.g. ‘Maximum number of feature to be generated is 1440’
Window sizes used in rolling statistics / lag extractors
e.g. ‘The window sizes chosen to be: 2 months
(because the time step is 1 month and Feature Derivation Window is 2 months)’
Features that are specified as known-in-advance
e.g. ‘Variables treated as apriori: holiday’
Details about why certain variables are transformed in the input data
e.g. ‘Generating variable “y (log)” from “y” because multiplicative trend
is detected’
Details about features generated as timeseries features, and their priority
e.g. ‘Generating feature “date (actual)” from “date” (priority: 1)’
Parameters:project_id (str) – project id to retrieve a feature derivation log for.
The resulting specification can be used when setting the target, and contains only the
attributes directly controllable by users.
Parameters:
use_holdout_start_end_format (Optional[bool]) – Defaults to False. If True, will use holdout_end_date when configuring the
holdout partition. If False, will use holdout_duration instead.
use_backtest_start_end_format (Optional[bool]) – Defaults to False. If False, will use a duration-based approach for specifying
backtests (gap_duration, validation_start_date, and validation_duration).
If True, will use a start/end date approach for specifying
backtests (primary_training_start_date, primary_training_end_date,
validation_start_date, validation_end_date).
In contrast, projects created in the Web UI will use the start/end date approach for specifying
backtests. Set this parameter to True to mirror the behavior in the Web UI.
Render the partitioning settings as a dataframe for convenience of display
Excludes project_id, datetime_partition_column, date_format,
autopilot_data_selection_method, validation_duration,
and number_of_backtests, as well as the row count information, if present.
Also excludes the time series specific parameters for use_time_series,
default_to_known_in_advance, default_to_do_not_derive, and defining the feature
derivation and forecast windows.
Retrieve the datetime partitioning log content and log length for an optimized
datetime partitioning.
The Datetime Partitioning Log provides details about the partitioning process for an OTV
or Time Series project.
Parameters:
project_id (str) – project id of the project associated with the datetime partitioning.
datetime_partitioning_id (str) – id of the optimized datetime partitioning
offset (int or None) – optional, defaults is 0, this many results will be skipped.
limit (int or None) – optional, defaults to 100, at most this many results are returned. To specify
no limit, use 0. The default may change without notice.
Retrieve the input used to create an optimized DatetimePartitioning from a project for
the specified datetime_partitioning_id. A datetime_partitioning_id is created by using the
generate_optimized function.
Parameters:
project_id (str) – The ID of the project to retrieve partitioning for.
datetime_partitioning_id (ObjectId) – The ObjectId associated with the project to retrieve from Mongo.
Returns:DatetimePartitioningInput
Return type:The input to optimized datetime partitioning.
class datarobot.helpers.partitioning_methods.DatetimePartitioningId¶
Defines a DatetimePartitioningId used for datetime partitioning.
This class only includes the datetime_partitioning_id that identifies a previously
optimized datetime partitioning and the project_id for the associated project.
Update this instance, matching attributes to kwargs
Mainly used for the datetime partitioning spec but implemented in general for consistency
Return type:NoReturn
class datarobot.helpers.partitioning_methods.Backtest¶
A backtest used to evaluate models trained in a datetime partitioned project
When setting up a datetime partitioning project, backtests are specified by a
BacktestSpecification.
The available training data corresponds to all the data available for training, while the
primary training data corresponds to the data that can be used to train while ensuring that all
backtests are available. If a model is trained with more data than is available in the primary
training data, then all backtests may not have scores available.
available_training_start_date (datetime.datetime) – the start date of the available training data for this backtest
available_training_duration (str) – the duration of available training data for this backtest
available_training_row_count (int or None) – the number of rows of available training data for this backtest. Only available when
retrieving from a project where the target is set.
available_training_end_date (datetime.datetime) – the end date of the available training data for this backtest
primary_training_start_date (datetime.datetime) – the start date of the primary training data for this backtest
primary_training_duration (str) – the duration of the primary training data for this backtest
primary_training_row_count (int or None) – the number of rows of primary training data for this backtest. Only available when
retrieving from a project where the target is set.
primary_training_end_date (datetime.datetime) – the end date of the primary training data for this backtest
gap_start_date (datetime.datetime) – the start date of the gap between training and validation scoring data for this backtest
gap_duration (str) – the duration of the gap between training and validation scoring data for this backtest
gap_row_count (int or None) – the number of rows in the gap between training and validation scoring data for this
backtest. Only available when retrieving from a project where the target is set.
gap_end_date (datetime.datetime) – the end date of the gap between training and validation scoring data for this backtest
validation_start_date (datetime.datetime) – the start date of the validation scoring data for this backtest
validation_duration (str) – the duration of the validation scoring data for this backtest
validation_row_count (int or None) – the number of rows of validation scoring data for this backtest. Only available when
retrieving from a project where the target is set.
validation_end_date (datetime.datetime) – the end date of the validation scoring data for this backtest
total_row_count (int or None) – the number of rows in this backtest. Only available when retrieving from a project where
the target is set.
The resulting specification includes only the attributes users can directly control, not
those indirectly determined by the project dataset.
Parameters:use_start_end_format (bool) – Default False. If False, will use a duration-based approach for specifying
backtests (gap_duration, validation_start_date, and validation_duration).
If True, will use a start/end date approach for specifying
backtests (primary_training_start_date, primary_training_end_date,
validation_start_date, validation_end_date).
In contrast, projects created in the Web UI will use the start/end date approach for specifying
backtests. Set this parameter to True to mirror the behavior in the Web UI.
Wait for the job to complete, then attempt to convert the resulting json into an object of type
self.resource_type
:rtype: A newly created resource of type self.resource_type
Update a segment champion in a combined model by setting the model_id
that belongs to the child project_id as the champion.
Parameters:
project_id (str) – The project id for the child model that contains the model id.
model_id (str) – Id of the model to mark as the champion
clone (bool) – (New in version v2.29) optional, defaults to False.
Defines if combined model has to be cloned prior to setting champion
(champion will be set for new combined model if yes).
Returns:combined_model_id – Id of the combined model that was updated
A Segmentation Task is used for segmenting an existing project into multiple child
projects. Each child project (or segment) will be a separate autopilot run. Currently
only user defined segmentation is supported.
Example for creating a new SegmentationTask for Time Series segmentation with a
user defined id column:
fromdatarobotimportSegmentationTask# Create the SegmentationTasksegmentation_task_results=SegmentationTask.create(project_id=project.id,target=target,use_time_series=True,datetime_partition_column=datetime_partition_column,multiseries_id_columns=[multiseries_id_column],user_defined_segment_id_columns=[user_defined_segment_id_column])# Retrieve the completed SegmentationTask object from the job resultssegmentation_task=segmentation_task_results['completedJobs'][0]
Variables:
id (bson.ObjectId) – The id of the segmentation task.
project_id (bson.ObjectId) – The associated id of the parent project.
type (str) – What type of job the segmentation task is associated with, e.g. auto_ml or auto_ts.
created (datetime.datetime) – The date this segmentation task was created.
segments_count (int) – The number of segments the segmentation task generated.
segments (list[str]) – The segment names that the segmentation task generated.
metadata (dict) – List of features that help to identify the parameters used by the segmentation task.
data (dict) – Optional parameters that are associated with enabled metadata for the segmentation task.
Creates segmentation tasks for the project based on the defined parameters.
Parameters:
project_id (str) – The associated id of the parent project.
target (str) – The column that represents the target in the dataset.
use_time_series (bool) – Whether AutoTS or AutoML segmentations should be generated.
datetime_partition_column (str or null) – Required for Time Series.
The name of the column whose values as dates are used to assign a row
to a particular partition.
multiseries_id_columns (List[str] or null) – Required for Time Series.
A list of the names of multiseries id columns to define series within the training
data. Currently only one multiseries id column is supported.
user_defined_segment_id_columns (List[str] or null) – Required when using a column for segmentation.
A list of the segment id columns to use to define what columns are used to manually
segment data. Currently only one user defined segment id column is supported.
model_package_id (str) – Required when using automated segmentation.
The associated id of the model in the DataRobot Model Registry that will be used to
perform automated segmentation on a dataset.
max_wait (integer) – The number of seconds to wait
Returns:segmentation_tasks – Dictionary containing the numberOfJobs, completedJobs, and failedJobs. completedJobs
is a list of SegmentationTask objects, while failed jobs is a list of dictionaries
indicating problems with submitted tasks.
class datarobot.models.segmentation.SegmentationTask¶
A Segmentation Task is used for segmenting an existing project into multiple child
projects. Each child project (or segment) will be a separate autopilot run. Currently
only user defined segmentation is supported.
Example for creating a new SegmentationTask for Time Series segmentation with a
user defined id column:
fromdatarobotimportSegmentationTask# Create the SegmentationTasksegmentation_task_results=SegmentationTask.create(project_id=project.id,target=target,use_time_series=True,datetime_partition_column=datetime_partition_column,multiseries_id_columns=[multiseries_id_column],user_defined_segment_id_columns=[user_defined_segment_id_column])# Retrieve the completed SegmentationTask object from the job resultssegmentation_task=segmentation_task_results['completedJobs'][0]
Variables:
id (bson.ObjectId) – The id of the segmentation task.
project_id (bson.ObjectId) – The associated id of the parent project.
type (str) – What type of job the segmentation task is associated with, e.g. auto_ml or auto_ts.
created (datetime.datetime) – The date this segmentation task was created.
segments_count (int) – The number of segments the segmentation task generated.
segments (list[str]) – The segment names that the segmentation task generated.
metadata (dict) – List of features that help to identify the parameters used by the segmentation task.
data (dict) – Optional parameters that are associated with enabled metadata for the segmentation task.
Creates segmentation tasks for the project based on the defined parameters.
Parameters:
project_id (str) – The associated id of the parent project.
target (str) – The column that represents the target in the dataset.
use_time_series (bool) – Whether AutoTS or AutoML segmentations should be generated.
datetime_partition_column (str or null) – Required for Time Series.
The name of the column whose values as dates are used to assign a row
to a particular partition.
multiseries_id_columns (List[str] or null) – Required for Time Series.
A list of the names of multiseries id columns to define series within the training
data. Currently only one multiseries id column is supported.
user_defined_segment_id_columns (List[str] or null) – Required when using a column for segmentation.
A list of the segment id columns to use to define what columns are used to manually
segment data. Currently only one user defined segment id column is supported.
model_package_id (str) – Required when using automated segmentation.
The associated id of the model in the DataRobot Model Registry that will be used to
perform automated segmentation on a dataset.
max_wait (integer) – The number of seconds to wait
Returns:segmentation_tasks – Dictionary containing the numberOfJobs, completedJobs, and failedJobs. completedJobs
is a list of SegmentationTask objects, while failed jobs is a list of dictionaries
indicating problems with submitted tasks.
class datarobot.models.external_baseline_validation.ExternalBaselineValidationInfo¶
An object containing information about external time series baseline predictions
validation results.
Variables:
baseline_validation_job_id (str) – the identifier of the baseline validation job
project_id (str) – the identifier of the project
catalog_version_id (str) – the identifier of the catalog version used in the validation job
target (str) – the name of the target feature
datetime_partition_column (str) – the name of the column whose values as dates are used to assign a row
to a particular partition
is_external_baseline_dataset_valid (bool) – whether the external baseline dataset passes the validation check
multiseries_id_columns (List[str] or null) – a list of the names of multiseries id columns to define series
within the training data. Currently only one multiseries id column is supported.
holdout_start_date (str or None) – the start date of holdout scoring data
holdout_end_date (str or None) – the end date of holdout scoring data
backtests (list of dicts containing validation_start_date and validation_end_date or None) – the configured backtests of the time series project
forecast_window_start (int) – offset into the future to define how far forward relative to the forecast point the
forecast window should start.
forecast_window_end (int) – offset into the future to define how far forward relative to the forecast point the
forecast window should end.
message (str or None) – the description of the issue with external baseline validation job
calendar_start_date (str) – The earliest date in the calendar.
calendar_end_date (str) – The last date in the calendar.
created (str) – The date this calendar was created, i.e. uploaded to DR.
name (str) – The name of the calendar.
num_event_types (int) – The number of different event types.
num_events (int) – The number of events this calendar has.
project_ids (list of strings) – A list containing the projectIds of the projects using this calendar.
multiseries_id_columns (List[str] or None) – A list of columns in calendar which uniquely identify events for different series.
Currently, only one column is supported.
If multiseries id columns are not provided, calendar is considered to be single series.
role (str) – The access role the user has for this calendar.
A header row is required, and the “Series ID” and “Event Duration” columns are optional.
Once the CalendarFile has been created, pass its ID with
the DatetimePartitioningSpecification
when setting the target for a time series project in order to use it.
Parameters:
file_path (string) – A string representing a path to a local csv file.
calendar_name (string, optional) – A name to assign to the calendar. Defaults to the name of the file if not provided.
multiseries_id_columns (List[str] or None) – A list of the names of multiseries id columns to define which series an event
belongs to. Currently only one multiseries id column is supported.
Returns:calendar_file – Instance with initialized data.
# Creating a calendar with a specified namecal=dr.CalendarFile.create('/home/calendars/somecalendar.csv',calendar_name='Some Calendar Name')cal.id>>>5c1d4904211c0a061bc93013cal.name>>>SomeCalendarName# Creating a calendar without specifying a namecal=dr.CalendarFile.create('/home/calendars/somecalendar.csv')cal.id>>>5c1d4904211c0a061bc93012cal.name>>>somecalendar.csv# Creating a calendar with multiseries id columnscal=dr.CalendarFile.create('/home/calendars/somemultiseriescalendar.csv',calendar_name='Some Multiseries Calendar Name',multiseries_id_columns=['series_id'])cal.id>>>5da9bb21962d746f97e4daeecal.name>>>SomeMultiseriesCalendarNamecal.multiseries_id_columns>>>['series_id']
The “Series ID” and “Event Duration” columns are optional.
Once the CalendarFile has been created, pass its ID with
the DatetimePartitioningSpecification
when setting the target for a time series project in order to use it.
Parameters:
dataset_id (string) – The identifier of the dataset from which to create the calendar.
dataset_version_id (string, optional) – The identifier of the dataset version from which to create the calendar.
calendar_name (string, optional) – A name to assign to the calendar. Defaults to the name of the dataset if not provided.
multiseries_id_columns (list of Optional[str]) – A list of the names of multiseries id columns to define which series an event
belongs to. Currently only one multiseries id column is supported.
delete_on_error (boolean, optional) – Whether delete calendar file from Catalog if it’s not valid.
Returns:calendar_file – Instance with initialized data.
# Creating a calendar from a datasetdataset=dr.Dataset.create_from_file('/home/calendars/somecalendar.csv')cal=dr.CalendarFile.create_calendar_from_dataset(dataset.id,calendar_name='Some Calendar Name')cal.id>>>5c1d4904211c0a061bc93013cal.name>>>SomeCalendarName# Creating a calendar from a new dataset versionnew_dataset_version=dr.Dataset.create_version_from_file(dataset.id,'/home/calendars/anothercalendar.csv')cal=dr.CalendarFile.create(new_dataset_version.id,dataset_version_id=new_dataset_version.version_id)cal.id>>>5c1d4904211c0a061bc93012cal.name>>>anothercalendar.csv
Generates a calendar based on the provided country code and dataset start date and end
dates. The provided country code should be uppercase and 2-3 characters long. See
CalendarFile.get_allowed_country_codes for a list of allowed country codes.
Parameters:
country_code (string) – The country code for the country to use for generating the calendar.
start_date (datetime.datetime) – The earliest date to include in the generated calendar.
end_date (datetime.datetime) – The latest date to include in the generated calendar.
Returns:calendar_file – Instance with initialized data.
Gets the details of all calendars this user has view access for.
Parameters:
project_id (Optional[str]) – If provided, will filter for calendars associated only with the specified project.
batch_size (Optional[int]) – The number of calendars to retrieve in a single API call. If specified, the client may
make multiple calls to retrieve the full list of calendars. If not specified, an
appropriate default will be chosen by the server.
Returns:calendar_list – A list of CalendarFile objects.
Parameters:calendar_id (str) – The id of the calendar to delete.
The requester must have OWNER access for this calendar.
Raises:ClientError – Raised if an invalid calendar_id is provided.
Return type:None
Examples
# Deleting with a valid calendar_idstatus_code=dr.CalendarFile.delete(some_calendar_id)status_code>>>204dr.CalendarFile.get(some_calendar_id)>>>ClientError:Itemnotfound
Shares the calendar with the specified users, assigning the specified roles.
Parameters:
calendar_id (str) – The id of the calendar to update
access_list (List[SharingAccess]) – A list of dr.SharingAccess objects. Specify None for the role to delete a user’s
access from the specified CalendarFile. For more information on specific access levels,
see the sharing documentation.
Returns:status_code – 200 for success
Return type:int
Raises:
ClientError – Raised if unable to update permissions for a user.
AssertionError – Raised if access_list is invalid.
Examples
# assuming some_user is a valid user, share this calendar with some_usersharing_list=[dr.SharingAccess(some_user_username,dr.enums.SHARING_ROLE.READ_WRITE)]response=dr.CalendarFile.share(some_calendar_id,sharing_list)response.status_code>>>200# delete some_user from this calendar, assuming they have access of some kind alreadydelete_sharing_list=[dr.SharingAccess(some_user_username,None)]response=dr.CalendarFile.share(some_calendar_id,delete_sharing_list)response.status_code>>>200# Attempt to add an invalid user to a calendarinvalid_sharing_list=[dr.SharingAccess(invalid_username,dr.enums.SHARING_ROLE.READ_WRITE)]dr.CalendarFile.share(some_calendar_id,invalid_sharing_list)>>>ClientError:Unabletoupdateaccessforthiscalendar
Retrieve a list of users that have access to this calendar.
Parameters:
calendar_id (str) – The id of the calendar to retrieve the access list for.
batch_size (Optional[int]) – The number of access records to retrieve in a single API call. If specified, the client
may make multiple calls to retrieve the full list of calendars. If not specified, an
appropriate default will be chosen by the server.
Returns:access_control_list – A list of SharingAccess objects.