Configure datetime partitioning¶
This notebook outlines how to use datetime partitioning with DataRobot's Python client.
When dividing your data for model training and validation, DataRobot randomly chooses a set of rows from the training dataset to assign among different cross-validation folds. This process verifies that you have not overfit your model to the training set and that the model can perform well on new data.
However, when your data has an intrinsic time-based component, you must be cautious about target leakage. Although DataRobot offers datetime partitioning to guard against target leakage, you should always use your domain expertise to evaluate features prior to modeling.
The project in this notebook simulates a project with a time-based component that uses out-of-time validation (OTV) modeling. Note that this is not the same as Time Series modeling, even though the way DataRobot defines backtests for time series is very similar. You can download the notebook from the API user guide home page.
Requirements¶
- Python version 3.7.3+.
- DataRobot API version 2.20.0+.
- A Pandas dataframe (df) with an indicated target feature.
Reference documentation for DataRobot's Python client here.
Import libraries¶
from datetime import datetime
import datarobot as dr
Connect to DataRobot¶
Read more about different options for connecting to DataRobot from the client.
# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call
# dr.Client(config_path='path-to-drconfig.yaml')
spec = dr.DatetimePartitioningSpecification(
datetime_partition_column="Date",
holdout_start_date=datetime(2017, 1, 2),
holdout_duration="P1Y0M0DT0H0M0S",
number_of_backtests=2,
use_time_series=False,
)
DataRobot recommends taking advantage of the automated partitioning by setting use_time_series=True
after you specify the number of backtests. DataRobot provides further control to specify the validation start date as well as the duration. You can view an example in the following cells.
Create backtest specifications¶
The method below is applicable to both time series and out-of-time validation projects. The snippet provided uses use_time_series = False
in the dr.DatetimePartitioningSpecification()
method to initiate an OTV project.
The methods used in the snippet below change the backtest specification for the first and second backtests.
# Note that the dates are not project-specific; they are example dates
spec.backtests = [
dr.BacktestSpecification(
0,
gap_duration="P0Y0M0DT0H0M0S",
validation_start_date=datetime(2016, 1, 2),
validation_duration="P1Y0M0DT0H0M0S",
),
dr.BacktestSpecification(
1,
gap_duration="P0Y0M0DT0H0M0S",
validation_start_date=datetime(2015, 1, 2),
validation_duration="P1Y0M0DT0H0M0S",
),
]
# Use the lines below to initiate the project
project = dr.Project.create(sourcedata=df, project_name="Project Name")
project.set_target("target_column", partitioning_method=spec)