Data prep for time series¶
The time series data preparation tool, available as a public preview feature, is off by default. Contact your DataRobot representative or administrator for information on enabling the feature.
Feature flag: Enable Time Series Data Prep
When starting a time series project, DataRobot's data quality detection evaluates whether the time step is irregular. This can result in significant gaps in some series and precludes the use of seasonal differencing and cross-series features that can improve accuracy. To avoid the inaccurate rolling statistics these gaps can cause, you can:
- Let DataRobot use row-based partitioning.
- Fix the gaps with the time series data prep tool by using duration-based partitioning.
Generally speaking, the data prep tool first aggregates the dataset to the selected time step, and, if there are still missing rows, imputes the target value. It allows you to choose aggregation methods for numeric, categorical, and text values. You can also use it to explore modeling at different time scales. The resulting dataset is then published to the AI Catalog.
Access the data prep tool¶
The data prep tool can be accessed from the Start screen or directly from the AI Catalog.
From the Start screen¶
The data prep tools becomes available after initial set up (target, date/time feature, forecasting or nowcasting, series ID, if applicable). Click Fix Gaps For Duration-Based to use the tool when DataRobot detects that the time steps are irregular:
Or, even if the time steps are regular, use it to apply dataset customizations:
Click Time series data prep to open and modify the dataset in the AI Catalog.
If you choose to continue to the catalog from the Start screen, any manually created feature transformations or feature lists are lost and must be created from the new project.
From the AI Catalog¶
The method to modify a dataset in the AI Catalog is the same regardless of whether you started from the Start screen (described above) or from the catalog. In the catalog, open the dataset from the inventory and, from the menu, select Prepare time series dataset:
There are two mechanisms for modifying a dataset using the data prep tool:
- Manually, using dropdowns and selectors to generate code that sets the aggregation and imputation methods.
- Modifying a prepopulated Spark SQL query. (To create a dataset from a blank Spark SQL query, use the AI Catalog's Prepare data with Spark SQL functionality.)
Use Manual settings to generate code that prepares the modeling dataset. For further customization, complete each field and then click the Spark SQL query link to pre-populate the SQL query with these values.
For the Prepare time series dataset option to be enabled, you must have permission to modify it. Additionally, the dataset must:
- Have a status of static or Spark.
- Have at least one date/time feature.
- Have at least one numeric feature.
Set manual options¶
Once selected, the Manual settings page opens.
Complete the fields that will be used as the basis for the imputation and aggregation that DataRobot computes. You cannot save the query or edit it in Spark SQL until all required fields are complete.
|Target feature||Numeric column in the dataset to predict.||Yes|
|Primary date/time feature||Time feature used as the basis for partitioning. Use the dropdown or select from the identified features.||Yes|
|Series ID||Column containing the series identifier, which allows DataRobot to process the dataset as a separate time series.||No|
|Series start date (only available once series ID is set)||Basis for the series start date, either the earliest date for each series (per series) or the earliest date found for any series (global).||Defaults to per-series|
|Series end date (only available once series ID is set)||Basis for the series end date, either the last entry date for each series (per series) or the latest date found for any series (global).||Defaults to per-series|
|Target and numeric feature aggregation & imputation||Aggregate the target using either mean & most recent or sum and zero. In other words, the time step's aggregation is created using the either the sum or the mean of the values. If there are still missing target values after aggregating, those values are imputed with zero (if
|Categorical feature aggregation & imputation||Aggregate categorical features using the most frequent value or the last value within the aggregation time step. Imputation only applies to features that are constant within a series (for example, the cross-series groupby column) which is imputed so that they remain constant within the series.||Yes|
|Text feature aggregation & imputation (only available if text features are present.)||Choose
|Time Step: The components—frequency and unit—that make up the detected median time delta between rows in the new dataset. For example, 15 (frequency) days (unit).|
|Frequency||Number of (time) units that comprise the time step.||Defaults to detected|
|Unit||Time unit (seconds, days, months, etc.) that comprise the time step. from the dropdown.||Defaults to the detected unit|
The data prep tool only imputes the target. Additional feature imputation is handled in each blueprint.
When the target is imputed with zeros, DataRobot randomly imputes known in advance (KA) features while deriving features in a time series project created from the prepared dataset. This prevents a missing feature value from becoming a known in advance indicator that would leak the zero-target value and lead to overly optimistic accuracy metrics.
Once all required fields are complete, three options become available:
- Click Run to preview the first 10,000 results of the query (the resulting dataset).
- Click Save to create a new Spark SQL dataset in the AI Catalog. DataRobot opens the Info tab for that dataset; the dataset is available to be used to create a new project or any other options available for a Spark SQL dataset in the AI Catalog. If the dataset has greater than 50% or more imputed rows, DataRobot provides a warning message.
- Click Edit Spark SQL query to open the Spark SQL editor and modify the initial query.
Edit the Spark SQL query¶
When you complete the Manual settings and click to open the Spark SQL editor, DataRobot populates the edit window with an initial query based on the manual settings. The script is customizable, just like any other Spark SQL query, allowing you to create a new dataset or a new version of the existing dataset.
When you have finished with changes, click Run to preview the results. If satisfied, click Save to add the new dataset to the AI Catalog. Or, click Back to manual settings to return to the dropdown-based entry. Because switching back to Manual settings from the Spark SQL query configuration results in losing all Spark SQL dataset preparation, you can use it as a method of undoing modifications. If the dataset has greater than 50% or more imputed rows, DataRobot provides a warning message.
When changes made with the data prep tool result in more than 50% of target rows being imputed, DataRobot alerts you with both:
An alert on the catalog item's Info page
A badge on the dataset in the AI Catalog inventory:
It is best practice to review the dataset before using it for training or predictions to ensure that changes will not impact accuracy.
When a project is created from a dataset that was modified by the data prep tool, you can automatically apply the transformations to a corresponding prediction dataset. On the Make Predictions tab, toggle the option to make your selection:
When on, DataRobot applies the same transformations to the dataset that you upload. Click Review transformations in AI Catalog to view a read-only version of the manual and Spark SQL settings, for example:
Consider the following when doing gap handling and aggregation:
Data Prep is not supported for deployments or for use with the API.
Only numeric targets are supported.
Only numeric, categorical, text, and primary date columns are included in the output.
The smallest allowed time step for aggregation is 1 minute.