Time-aware feature engineering¶
Time-based feature engineering in Feature Discovery projects involves use of a date feature in the primary table. This date prevents any feature derivation beyond the prediction point. In time-aware projects, the partition date column is, by default, used as the prediction point.
How time series features are derived
For information about how DataRobot derives time series features, see time series feature derivation. See also specific details about derived date features.
In non time-aware projects, where there is no partition date feature, you can set a prediction point to enable time-aware feature engineering.
Prediction point and time indexes¶
In most cases, the primary dataset includes a prediction point feature, describing when it would have been needed to make the prediction. For example, in a loan request the prediction point feature might be the "loan request date," because each time a customer requests a loan the model must generate a prediction to decide whether to approve or decline.
In some cases, the primary dataset is built using one or multiple extracts done at some regular point in the past. For instance, to predict on the first of the month, you would want a monthly prediction point when building the training dataset (e.g., 2019-10-01, 2019-11-01, etc.). In this example, the prediction point feature might be “extract_date.”
In both cases, you want to avoid using information from secondary datasets that was not available before the prediction point (for example, transactions that happened after the loan request). To avoid this "time travel paradox," DataRobot integrates time-aware feature engineering capabilities and allows you to configure a feature derivation window (FDW), which defines a rolling window of past values that models use to generate features before the prediction point. With Feature Discovery, setting FDWs from the relationship editor can be understood as:
Using the loan application example, the loan request date would be the prediction point. If you only have a date (e.g., 02-14-20), and not a timestamp set, you don't know whether an event happened before or after the time of the specific loan request (in terms of the actual hour/minute/etc.). To be conservative, DataRobot excludes everything on that exact date so that the model doesn't coincidentally include data that happened "too early." Using time-aware settings, you can set a rolling window to ensure that the most relevant data is included.
Configure time-aware feature engineering¶
After a join is saved, if the added dataset has a date feature and if you set a prediction point, the Save and configure time-aware option becomes available. Click to open the Time-aware feature engineering editor.
Set the date/time feature of the secondary dataset to ensure that it only uses records happening before the prediction point to generate features. Once set, the FDW settings can be modified.
Set the boundaries of the FDW to determine how much historical data to use. By default, DataRobot sets the window to 30 to 0 days (e.g., transactions that happened in the 30 days before the "loan request date" of now). You can change the boundary by both entering a new value and also setting the increment. Keep in mind that using a larger FDW will slow down the Feature Discovery process.
DataRobot also automatically calculates additional, smaller FDWs for the project, in addition to the window you specify. For example, if you set the FDW parameters to "30 to 0 Days," DataRobot selects additional candidate durations (perhaps 1 to 0 weeks, 1 to 0 days, and 6 to 0 hours) and derives features from those windows. The new candidate window sizes are based on an internal algorithm that:
- Chooses additional windows between 50% and 0.5% of the original FDW size.
- Ensures the additional windows do not use a time unit with a smaller granularity than what is relevant for the primary date/time feature format.
If the time index doesn’t reflect the time when data is accessible, you can change the FDW end boundary to reflect the delay. For example, perhaps a secondary dataset is provided by an external data provider and that provider gives you access with a two day delay. You can specify a gap of two days (before the prediction point).
The FDW is reflected in the dataset tile:
Prediction point rounding¶
If a prediction point has many distinct values, the Feature Discovery process may be slow. To speed up processing, DataRobot, by default, rounds down the prediction point to the nearest minute. For example, if a loan has a prediction point ("loan_request_date") of 2020-01-15 08:13:53, DataRobot will round that value down to 2020-01-15 08:13, dropping the
While rounding makes the Feature Discovery process faster, it does come at a cost of potentially losing fresh secondary dataset records. In this example, records that happened between 2020-01-15 08:13:00 and 2020-01-15 08:13:53.
If your project is sensitive to that level of record loss, you can change the default rounding from nearest minute to a more suitable selection:
Determine the final cutoff¶
Once Feature Discovery applies prediction point rounding and the FDW end, DataRobot derives the final "cutoff" used for time-aware engineering. The cutoff point is the point at which DataRobot will not go forward when generating features. In other words, the FDW (the rolling window of past values) is comprised of the furthest time back and the nearest time, both modified by the rounding selection.
For example, this setting:
Can be understood conceptually as: