Configure datetime partitioning¶
This notebook outlines how to use datetime partitioning with version 3.0 of DataRobot's Python client.
When dividing your data for model training and validation, DataRobot randomly chooses a set of rows from the training dataset to assign among different cross-validation folds. This process verifies that you have not overfit your model to the training set and that the model can perform well on new data.
However, when your data has an intrinsic time-based component, you must be cautious about target leakage. Although DataRobot offers datetime partitioning to guard against target leakage, you should always use your domain expertise to evaluate features prior to modeling.
The project in this notebook simulates a project with a time-based component that uses out-of-time validation (OTV) modeling. Note that this is not the same as time series modeling, even though the way DataRobot defines backtests for time series is very similar.
- Python version 3.7+.
- DataRobot API version 3.0+.
- A Pandas dataframe (df) with an indicated target feature.
Find reference documentation for DataRobot's Python client here.
import datarobot as dr from datetime import datetime
dr.Client() # The `config_path` should only be specified if the config file is not in the default location described in the API Quickstart guide # dr.Client(config_path = 'path-to-drconfig.yaml')
<datarobot.rest.RESTClientObject at 0x7fbf184801c0>
spec = dr.DatetimePartitioningSpecification(datetime_partition_column = 'Date', holdout_start_date=datetime(2017,1,2), holdout_duration='P1Y0M0DT0H0M0S', number_of_backtests = 2, use_time_series = False) # Generate a preview based on your project's data partitioning_preview = dr.DatetimePartitioning.generate(project.id, spec)
As of v3.0,
Project.list_datetime_partition_spec() are available as an alternative:
# View partitioning settings project.list_datetime_partition_spec() # Uncomment to disable holdout before you begin modeling # project.set_datetime_partitioning(disable_holdout=True)
Create backtest specifications¶
DataRobot provides further control to specify the validation start date as well as the duration. You can view an example in the following cells. The method below is applicable to both time series and out-of-time validation projects. The snippet provided uses
use_time_series = False in the
dr.DatetimePartitioningSpecification() method to initiate an OTV project.
The methods used in the snippet below change the backtest specification for the first and second backtests. DataRobot recommends taking advantage of automated partitioning by setting
use_time_series=True after you specify the number of backtests.
# Set duration of the validation backtests duration_1y = 'P1Y0M0DT0H0M0S' duration_0s = 'P0Y0M0DT0H0M0S' # Note that the dates are not project-specific; they are example dates spec.backtests=[dr.BacktestSpecification(0,gap_duration = 'P0Y0M0DT0H0M0S', validation_start_date = datetime(2016,1,2), validation_duration = duration_1y), dr.BacktestSpecification(1,gap_duration = 'P0Y0M0DT0H0M0S', validation_start_date = datetime(2015,1,2), validation_duration = duration_0s)] # Uncomment if you want more backtests # spec.number_of_backtests = 5 # Use the lines below to initiate the project project = dr.Project.create(sourcedata = df, project_name = 'Project Name') project.analyze_and_model('target_column', partitioning_method = spec)
Once backtests are configured for your project, you can proceed to modeling. See the use case for predicting CO₂ levels as an example.