Configure datetime partitioning¶
This notebook outlines how to use datetime partitioning with DataRobot's Python client.
When dividing your data for model training and validation, DataRobot randomly chooses a set of rows from the training dataset to assign among different cross-validation folds. This process verifies that you have not overfit your model to the training set and that the model can perform well on new data.
However, when your data has an intrinsic time-based component, you must be cautious about target leakage. Although DataRobot offers datetime partitioning to guard against target leakage, you should always use your domain expertise to evaluate features prior to modeling.
The project in this notebook simulates a project with a time-based component that uses out-of-time validation (OTV) modeling. Note that this is not the same as Time Series modeling, even though the way DataRobot defines backtests for time series is very similar. You can download the notebook from the API user guide home page.
- Python version 3.7.3+.
- DataRobot API version 2.20.0+.
- A Pandas dataframe (df) with an indicated target feature.
Reference documentation for DataRobot's Python client here.
import datarobot as dr from datetime import datetime
dr.Client() # The `config_path` should only be specified if the config file is not in the default location described in the API Quickstart guide # dr.Client(config_path = 'path-to-drconfig.yaml')
<datarobot.rest.RESTClientObject at 0x7fbf184801c0>
spec = dr.DatetimePartitioningSpecification(datetime_partition_column = 'Date', holdout_start_date=datetime(2017,1,2), holdout_duration='P1Y0M0DT0H0M0S', number_of_backtests = 2, use_time_series = False)
DataRobot recommends taking advantage of the automated partitioning by setting
use_time_series=True after you specify the number of backtests. DataRobot provides further control to specify the validation start date as well as the duration. You can view an example in the following cells.
Create backtest specifications¶
The method below is applicable to both time series and out-of-time validation projects. The snippet provided uses
use_time_series = False in the
dr.DatetimePartitioningSpecification() method to initiate an OTV project.
The methods used in the snippet below change the backtest specification for the first and second backtests.
# Note that the dates are not project-specific; they are example dates spec.backtests=[dr.BacktestSpecification(0,gap_duration = 'P0Y0M0DT0H0M0S', validation_start_date = datetime(2016,1,2), validation_duration = 'P1Y0M0DT0H0M0S'), dr.BacktestSpecification(1,gap_duration = 'P0Y0M0DT0H0M0S', validation_start_date = datetime(2015,1,2), validation_duration = 'P1Y0M0DT0H0M0S')] # Use the lines below to initiate the project project = dr.Project.create(sourcedata = df, project_name = 'Project Name') project.set_target('target_column',partitioning_method = spec)