Configure datetime partitioning¶
This notebook outlines how to use datetime partitioning with version 3.0 of DataRobot's Python client.
When dividing your data for model training and validation, DataRobot randomly chooses a set of rows from the training dataset to assign among different cross-validation folds. This process verifies that you have not overfit your model to the training set and that the model can perform well on new data.
However, when your data has an intrinsic time-based component, you must be cautious about target leakage. Although DataRobot offers datetime partitioning to guard against target leakage, you should always use your domain expertise to evaluate features prior to modeling.
The project in this notebook simulates a project with a time-based component that uses out-of-time validation (OTV) modeling). Note that this is not the same as time series modeling, even though the way DataRobot defines backtests for time series is very similar.
Requirements¶
- Python version 3.7+.
- DataRobot API version 3.0+.
- A Pandas dataframe (df) with an indicated target feature.
Find reference documentation for DataRobot's Python client here.
Import libraries¶
from datetime import datetime
import datarobot as dr
Connect to DataRobot¶
Read more about different options for connecting to DataRobot from the client.
# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call
# dr.Client(config_path='path-to-drconfig.yaml')
spec = dr.DatetimePartitioningSpecification(
datetime_partition_column="Date",
holdout_start_date=datetime(2017, 1, 2),
holdout_duration="P1Y0M0DT0H0M0S",
number_of_backtests=2,
use_time_series=False,
)
# Generate a preview based on your project's data
partitioning_preview = dr.DatetimePartitioning.generate(project.id, spec)
As of v3.0, Project.set_datetime_partitioning()
and Project.list_datetime_partition_spec()
are available as an alternative:
# View partitioning settings
project.list_datetime_partition_spec()
# Uncomment to disable holdout before you begin modeling
# project.set_datetime_partitioning(disable_holdout=True)
Create backtest specifications¶
DataRobot provides further control to specify the validation start date as well as the duration. You can view an example in the following cells. The method below is applicable to both time series and out-of-time validation projects. The snippet provided uses use_time_series = False
in the dr.DatetimePartitioningSpecification()
method to initiate an OTV project.
The methods used in the snippet below change the backtest specification for the first and second backtests. DataRobot recommends taking advantage of automated partitioning by setting use_time_series=True
after you specify the number of backtests.
# Set duration of the validation backtests
duration_1y = "P1Y0M0DT0H0M0S"
duration_0s = "P0Y0M0DT0H0M0S"
# Note that the dates are not project-specific; they are example dates
spec.backtests = [
dr.BacktestSpecification(
0,
gap_duration="P0Y0M0DT0H0M0S",
validation_start_date=datetime(2016, 1, 2),
validation_duration=duration_1y,
),
dr.BacktestSpecification(
1,
gap_duration="P0Y0M0DT0H0M0S",
validation_start_date=datetime(2015, 1, 2),
validation_duration=duration_0s,
),
]
# Uncomment if you want more backtests
# spec.number_of_backtests = 5
# Use the lines below to initiate the project
project = dr.Project.create(sourcedata=df, project_name="Project Name")
project.analyze_and_model("target_column", partitioning_method=spec)
Once backtests are configured for your project, you can proceed to modeling. See the use case for predicting CO₂ levels as an example.