Time series modeling¶
Contact your DataRobot representative for information on enabling automated time series (AutoTS) modeling.
See the documented file requirements for information on file size and series limit considerations.
The following sections provide information on using the time series feature of DataRobot's forecasting (modeling to predict future values) and nowcasting (modeling to determine current values) functionality. What follows is a brief overview of time series modeling and then a detailed workflow.
See this article for a more technical discussion of the general framework for developing time series models, including generating features and preprocessing the data as well as automating the process to apply advanced machine learning algorithms to almost any time series problem.
Time series overview¶
When working with time series data, ask yourself: How long do I want to look into the past and how far into the future do I want to predict? Once you determine those answers, you can configure DataRobot so that your time-sensitive data uses advanced DataRobot modeling techniques to create forecasts from your data.
DataRobot automatically creates and selects time series features in the modeling data. You can constrain the features (for example, minimum and maximum lags, etc.) by configuring the time series framework on the Start screen. Based on your settings and the analysis of the raw dataset, DataRobot derives new features and creates a modeling dataset. Because time shifts, lags, and features have already been applied, DataRobot can use general machine learning algorithms to build models with the new modeling dataset.
In general, the time series model building process is as follows:
- Upload your raw data; DataRobot runs EDA1.
- Set window parameters, such as the feature derivation window and forecast window.
- DataRobot applies that framework to the dataset and creates a new modeling dataset with time series features.
First, though, be certain that your data is the correct type to employ forecasting or nowcasting. DataRobot categorizes data based on the time step—the typical time difference between rows—as one of three types:
|Regular||Regularly spaced events||Monday through Sunday|
|Semi-regular||Data that is mostly regularly spaced||Every business day but not weekends.|
|Irregular||No consistent time step||Random birthdays|
Assuming a regular or semi-regular time step, DataRobot's time series functionality works by encoding time-sensitive components as features, transforming your original input dataset into a modeling dataset that can use conventional machine learning techniques. (Note that a time step is different than a time interval, which is described below.) For each original row of your data, the modeling dataset includes both:
- new rows representing examples of predicting different distances into the future
- for each input feature, new columns of lagged features and rolling statistics for predicting that new distance.
DataRobot’s time series modeling supports both regression and binary classification projects. Each type has a full selection of models available from Autopilot or the Repository, specific to the project type. Both types have generally the same workflow and options, with the following differences found in binary classification projects:
- Treat as exponential trend? and Apply differencing? Advanced options are disabled, as is the Exposure setting.
- Simple and seasonal differencing are not applied.
- Only classification metrics are supported.
- No differencing is performed, so feature lists using a differenced target are not created. By default, Autopilot runs on
Baseline only (average baseline)and
Time Series Informative Features. Note that "average baseline" refers to the average of the target in the feature derivation window.
- Classification blueprints do not use naive predictions as offset in modeling.
Time series forecast modeling is based on the following framework; see the reference section for a description of the framework elements. See the section on nowcasting to better understand that framework.
The following provides detailed steps for enabling time series modeling:
After uploading a time series-friendly dataset, select a target and click on Set up time-aware modeling:
From the dropdown, select the primary date/time feature. The dropdown lists all date/time features that DataRobot detected during EDA1.
After selecting a feature (note that DataRobot detects the time unit), DataRobot computes and then loads a histogram of the time feature plotted against the target feature (feature-over-time). Note that if your dataset qualifies for multiseries modeling, this histogram represents the average of the time feature values across all series plotted against the target feature. Review the histogram:
This example plots sales, per week, over time. In this two years of data, you can see seasonal spikes and that the business is growing over all.
Select forecasting or nowcasting as the time-series approach you would like to apply:
If DataRobot detects multiple series in your dataset:
- To enable multiseries modeling, set the series identifier.
- If DataRobot does not detect a series but your dataset qualifies, set the series identifier using Advanced options.
To enable segmented modeling, after selecting the series identifier, click to change the value of Segmentation method from None to your segment ID.
Then, return to the next step to complete the time series configuration.
Configure the time series model, i.e., set the windows of time DataRobot will use to derive features and the window basis.
If using nowcasting, these window settings differ.
Set the training window format, either Duration or Row Count, to specify how Autopilot chooses training periods when building models. Before setting this value, see the details of row count vs. duration and how they apply to different folds. Note that, for irregular datasets, the setting defaults to Row Count. While you can change this setting, it is highly recommended that you leave it, as changing to Duration may result in unexpected training windows or model errors.
Consider whether to set "known in advance" (KA) features or upload an event calendar in the advanced options. Here you can identify the features to be treated as KA variables, setting them to be used unlagged when making predictions. Also, you can specify a calendar listing the events for DataRobot to use when automatically deriving time series features (setting features as unlagged when making predictions).
Then, explore what a feature looks like over time to view its trends and determine whether there are gaps in your data (which is a data flaw you need to know about). To access the histogram, expand a numeric feature and click the Over time option:
In this example, you can see a strong weekly pattern as well as a seasonal pattern. You can also change the resolution to see how the data aggregates at different intervals. The binned data (blue bars at the bottom of the plot) represents the number of rows per bin. Visualization of data density can provide information about potential missing values.
If desired, set Advanced options > Time Series.
Click Start. DataRobot then takes the framework you configured and engineers new features to create the time series modeling dataset.
Display the Data page to watch the new features as they are created. By default DataRobot displays the Derived Modeling Data panel; to see your original data, click Original Time Series Data.
After reviewing the dataset, consider whether you want to restore any features that were pruned by the feature reduction process.
Consider the Leaderboard¶
Once modeling begins, DataRobot displays models on the Leaderboard as they complete. Because time series modeling uses date/time partitioning, you can run backtests, change window sampling, change training periods, etc. from the Leaderboard (described here).
Some notes on time series models:
DataRobot builds both the standard algorithms and special time series blueprints to run specific models for time series. As always, you can run any time series models that DataRobot did not run from the Repository.
DataRobot generates both traditional time series models (e.g., the ARIMA family) and advanced time series models (e.g., XGBoost).
For models with the suffix "with Forecast Distance Modeling," DataRobot builds a different model for each distance in the future, each having a unique blueprint to make that prediction.
The "Baseline prediction using most recent value" model (also known as "naive predictions") uses the most recent value or seasonal differences as the prediction; this model can be used as a baseline for judging performance.
Make Predictions tab¶
There are two methods for making predictions with time series models:
For prediction datasets that are less than 1GB, use the Make Predictions tab from the Leaderboard. This is the method described below.
Be aware that using a forecasting range with time series predictions can result in a significant increase over the original dataset size. Use the batch predictions capabilities to avoid out-of-memory errors.
The Leaderboard Make Predictions tab works slightly differently than with traditional modeling. The following describes, briefly, using Make Predictions with time series; see the full Make Predictions tab details for more information.
ARIMA model blueprints must be provided with full history when making batch predictions.
The Make Predictions tab provides summaries to help determine how much recent data—either time unit or rows, depending on how you configured your feature derivation and forecast point windows—is required in the prediction dataset and to review the forecast rows and KA settings. Note that the list of features displayed as KA only includes those KA features that are part of the feature list used to build the current model. The Forecast Settings tab provides an overview of the prediction dataset for help in changing settings as well as access to the auto-generated prediction file template.
In this example, the prediction dataset needs at least 28 days of historical data and can predict (return) up to 7 rows. (Although the model was configured for 21 days before the forecast point, seven days are added to the required history because the model uses seven-day differencing.)
The following provides an overview to making predictions with time series modeling:
Once you have selected a model to use for predictions, if you haven't already done so you are prompted to unlock holdout and retrain the model. It's a good idea to complete this step so that the model uses the most recent data, but it is not required.
Optionally, change the forecast point—the date to begin making predictions from—from the DataRobot default.
Create a prediction-ready dataset¶
If you choose to manually create a prediction dataset, use the provided summary to determine the number of historical rows needed. Optionally, open Forecast Settings to change the forecast point, making sure that the historical row requirements from your new forecast point are met in the prediction dataset. If needed, click See an example dataset for a visual representation of the format required for the CSV file.
The following example shows that you would leave the target and non-KA values in rows 7 through 9 (the "Forecast rows") blank; DataRobot fills in those rows with the prediction values when you compute predictions.
When your prediction dataset is in the appropriate format, click Import data from to select and upload it into DataRobot. Then, compute predictions.
While KA features can have missing values in the prediction data inside of the forecast window, that configuration may affect prediction accuracy. DataRobot surfaces a warning and also an information message beneath the affected dataset. Also, if you have missing history when picking a forecast point that is later than the default, DataRobot will still allow you to compute predictions.
Prediction file template¶
If your forecast point setting requires additional forecast rows be added to the original prediction dataset, DataRobot automatically generates a template file that appends those needed rows. Use the auto-generated prediction template as-is or download and make modifications. To create the template, click Import data from to select and upload the intended dataset. DataRobot generates the template if it does not find at least one row after the default forecast point that does not include a target value (no empty forecast rows) and therefore can be a forecast row.
For example, let's say your forecast window is
+5 ... +6 and the default forecast point is
t6 are missing, but points
t are present. In this case, DataRobot generates the extended file because it found no forecast rows that satisfy
t6 after the default forecast point.
For DataRobot to generate a template, the following conditions must be met:
- There are no supported forecast rows (empty target rows that fall within the forecast window).
- The generated template file size is less than the upload file limit.
Use the template as-is¶
Use the template as-is if you do not need to modify the forecast rows or add any KA features. DataRobot will set the forecast point and add the full number of rows required to satisfy the project's forecast window configuration.
Use the default auto-expansion if you are using the most recent data as your forecast point, have no gaps, and want the full number of rows. In this case, you can upload the dataset and compute predictions.
Modify the template¶
DataRobot generates the prediction file template as soon as you upload a prediction dataset. However, there are cases where you may want to modify that template before computing predictions:
You have identified a column as a KA feature and need to enter relevant information in the forecast rows.
You have multiple series and want to predict on fewer than every series in the dataset. (DataRobot adds the necessary number of rows for each series in the dataset.)
Based on your settings DataRobot would have generated several additional rows but you want to predict on fewer.
To modify a template:
Click Forecast Settings (Forecast Point Predictions tab), expand the Advanced options link, and download the auto-generated prediction file template:
Open the template and add any required information to the new forecast rows or remove rows you don't need as they will only slow predictions.
Save the modified template and upload it back into DataRobot using Import data from.
Optionally, set the forecast point to something other than the default.
The Forecast Settings modal provides configuration options for making two kinds of predictions:
Use Forecast Point Predictions to select the specific date (forecast point) from which you want to begin making predictions. By default, the forecast point is the most recent valid timestamp that maximizes the usage of time history within the feature derivation window. You can select any date shown since DataRobot trains models using all potential forecast points. Be sure, if you select a different forecast point, that your dataset has enough history.
Use Forecast Range Predictions for making predictions on all forecast distances within the selected date range. This option provides bulk predictions on an external dataset, including all forecast distance predictions for all rows in the dataset. Use the results for validating the model, not for making future predictions.
Forecast Point Predictions¶
The Forecast Settings > Forecast Point Predictions modal provides help in setting a forecast point different from the default point set by DataRobot:
Elements of the modal are described in the table below:
|Prediction type selector (1)||Selects either Forecast Point (this page) or Forecast Range (bulk predictions).|
|Advanced options (2)||Expands to download the prediction file template (if created).|
|Row summary (3)||The same summary information as that on the Make Predictions tab. Colors correspond to the visualization below (6), showing the historical and forecast rows set during original project creation.|
|Valid forecast point range (4)||In the context of the date span for the entire dataset (5), the colored bar above the full range indicates the range of dates that are valid forecast point settings (dates that will produce valid predictions). While the entire bar indicates possible valid options, dates within the yellow range are those that extend beyond DataRobot's suggested forecast point because they have missing history or KA features. Also, if there are gaps inside this range, the predictions may still fail (due to insufficient time history or no forecast row). See more date information.|
|Dataset start and end (5)||The full range of dates found in the dataset. In cases where DataRobot created a prediction file template, the dataset end date and template file end date are both represented. If the dataset end and max forecast distance are the same, the display does not show the dataset end. The historical and forecast rows summarized above (3) are also overlaid on the span. The overlay moves as the forecast point setting changes. See more date information.|
|Historical and forecast zoom (6)||A zoomed view of the relevant historical rows and forecast rows, intended to simplify selecting a forecast point (7).|
|Forecast point selector (7)||A calendar picker for setting the forecast point. Invalid dates—those not indicated in the valid forecast range (4)—are disabled in the calendar. See more date information.|
|Close modal options (8)||Initiate prediction computation (same as Compute Predictions on the Make Predictions page). Or, save the settings and close the modal without computing predictions. New settings are reflected on the Make Predictions page, and clicking Compute Predictions from there at any future time will use these settings. Alternatively, click the X to close without saving changes.|
The default forecast point (1) is either the most recent row in the dataset that contains a valid target value or, if you configured gaps during project setup, it is the row in the dataset that satisfies the feature derivation window’s history requirements. Open Forecast settings (2) to customize the forecast point.
You must use the default forecast point for fractional-second forecasts.
Forecast Range Predictions¶
Forecast Range Predictions are helpful for validating model accuracy. DataRobot extracts the actual values for all points in time from the dataset. Set the prediction start and end dates to define the historical range of time for which you want bulk predictions. Because this model evaluation process uses actual values, DataRobot only generates predictions for timestamps that can support predictions for every forecast distance.
Understand dates in forecast settings¶
When you upload a prediction dataset, DataRobot detects the range of dates (the valid forecast range) available for use as the forecast point. It also determines a default forecast point, which is the latest timestamp available for making predictions with full history.
The following timestamps are marked in the visualization:
- Data start is the timestamp of the first row detected in the dataset.
- Data end is the timestamp of the last row detected in the dataset, whether it is the original or the auto-generated template.
- Max forecast distance is the timestamp of the last possible forecast distance in the dataset.
Before modifying the forecast point, review the basic time series modeling framework.
Some things to consider:
What is the most recent valid forecast point? The most recent valid forecast point is the maximum forecast point that can be used to run predictions without error. It may differ from the default forecast point because the default forecast point takes the time history usage into consideration.
Based on the forecast window, what is the timestamp of the last prediction that was output? The forecast window is defined relative to the forecast point; the last prediction timestamp is a function of both the forecast window and the timestamp inside the prediction dataset.
For example, consider a forecast window from 1 to 7 days. The forecast point is 2001-01-01, but the max date in the dataset is 2001-01-05. In this case, the max forecast timestamp is 2001-01-05 as there are no rows from 2001-01-06 to 2001-01-08.
Consider the length of your forecast window. That is, after the final row with actual values, do you have at least one forecast row (within the boundaries of the forecast window)? If you do, DataRobot will not generate a template; if you do not, DataRobot will generate forecast rows based on the project configuration.
Use the Forecast settings modal to get an overview of the prediction dataset, which aids in choosing settings like the forecast point and prediction start and end dates. In addition, DataRobot generates forecast rows after the final row with actual values (if there are no forecast rows based on the default forecast point), simplifying the prediction workflow. The actual values are the data taken from the last row of each and every series ID and duplicated to the forecast rows.
Time series prediction dataset validation
DataRobot validates a time series prediction dataset once it is uploaded, checking whether there are sufficient historical rows to produce the engineered features required by the project.
If seasonality is detected in the project, additional historical rows—longer than the feature derivation window (FDW)—are required. For example, a project with an FDW of [-14, 0] and 7-day seasonality will require 21 historical days in the prediction dataset to accommodate target differenced features (such as
target (7 day diff) (mean)) and differencing features (such as
target (14 day max) (diff 7 day mean)). If multiple seasonalities are detected, the longest seasonality is used to perform the validation check.
DataRobot does not require the presence of all historical rows when computing window statistics features (for example,
target (7 day mean) or
feature (14 day max)). Depending on the FDW settings, DataRobot predetermines the minimum required historical rows for predictions. If there are too many missing historical rows in the prediction dataset, predictions will error.
If a multiplicative trend is detected, DataRobot requires all historical target values in the prediction dataset to be strictly positive (> 0). Zero or negative target value(s) violate the model assumption that the dataset is multiplicative and the prediction generates an error. To correct it, check whether the training dataset is representative of the use case during prediction time or disable the advanced option Treat as exponential trend and recreate the project.
Compute and access predictions¶
When the forecast point is set and the dataset is in the correct format and successfully uploaded, it's time to compute predictions.
There are two methods for computing predictions. Click either:
- the Compute Predictions button on the Forecast Settings modal.
- the Compute Predictions link (next to the Forecast Settings link) on the Make Predictions page.
When processing completes, preview the historical data and predictions from the dataset or download a CSV of your predictions. To download, click Download to access predictions:
Notes on prediction output:
• Depending on your permissions, you may see the column, "Original Format Timestamp". This provides the same values provided by the "Timestamp" column but uses the timestamp format from the original prediction dataset. Your administrator can enable this permission for you.
• When working with downloaded predictions, be aware that in time series projects,
row_id does not represent the row position from the original project data (for training predictions) or uploaded prediction data for a given timestamp and/or
series_id. Instead it is a derived value specific to the project.
With some spreadsheet software you could go on to graph your prediction output. For example, the sample data shows predicted sales for the next day through the next 7 days, which can then be acted on for inventory and staffing decisions.
After you have computed predictions, click the Preview link to display a plot of the predictions over time, in the context of the historical data. This plot shows the prediction for each forecast distance at once, relative to a single forecast point.
By default, the prediction interval (shaded in blue) represents the area in which 80% of predictions fall. The intervals estimate the range of values DataRobot expects actual values of the target to fall within. They are similar to a prediction's confidence interval, but are instead based on the residual errors measured during the model's backtesting.
For charts meeting the following criteria, the chart displays an estimated prediction interval:
All backtests must be trained. In this way, DataRobot can use all available validation rows and prevent different interval values based on the available information.
There must be at least 10 data points per forecast distance value.
If the above criteria are not met, DataRobot displays only the prediction values (orange points).
You can specify a prediction interval size, which specifies the desired probability of actual values falling within the interval range. Larger values are less precise, but more conservative. For example, the default value of 80% results in a lower bound of 10% and an upper bound of 90%. To change the predictions interval, click the Options link and DataRobot recalculates the display:
You can also set the prediction interval when making predictions.
Prediction intervals are estimated based on the quantiles of the out-of-sample residuals and as a result may not be symmetrical. DataRobot calculates, independently, per series (if applicable) and per forecast distance, so intervals may increase with distance, and/or have a range specific to each series. If you predict on a new series, or a series in which there was no overlap with validation, DataRobot uses the average across all series.
Hover over a point in the preview graph, left of the forecast point, to display the value from the historical data:
Or to the right of the forecast point to view the forecast (prediction):
When used with multiseries modeling, you have an option to select which series to preview. This overview indicates how the target, feature, or accuracy changes over time for an individual series and provides a forecast for that series. From the dropdown, select a series. Or, page through the series options using the left and right arrows. By comparing the prediction intervals for each series, you can better identify the series with that provide the most accurate predictions.
Note that you can also download predictions from within the preview plot.
The following sections provide some additional background discussion relevant to time-aware modeling:
- Understanding the time series framework
- Setting the window values
- Setting duration or row count
- Understanding the derived modeling dataset
- Allowable time intervals
- Time series feature lists
- Looking for common patterns in time series data
Set window values¶
Use the Feature Derivation Window (FDW) and Forecast Window (FW) to configure how DataRobot derives features for the modeling dataset.
On the left, the FDW (1), constrains the time history. That is, it defines how many values to look at (no further back than x, no more recent than y), which determines how much data you need to provide to make a prediction. In the example above, DataRobot will use the most recent 28 days of data.
On the right, the FW (2) sets the feature range the model outputs. The example configures DataRobot to make predictions on days 1 through 7 after the forecast point. Note that the time unit displayed (days, in this case) is based on the unit detected when you selected a date/time feature.
You can specify either the time unit detected or a number of rows for the windows (they are synchronized to be the same). DataRobot calculates rolling statistics using that selection (e.g.,
Price (7 days average) or
Price (7 rows average)). Note that when you configure for row-based windows, DataRobot does not detect common event patterns or seasonalities. DataRobot provides special handling for datasets with irregularly spaced date/time features, however. If your dataset is irregular, the window settings default to row-based.
You can change these values (and notice that the visualization updates to reflect your change). For example, you may not have real-time access to the data or don't want the model to be dependent on data that is too new. In that case, change the FDW. If you don't care about tomorrow's prediction because it is too soon to take action on, change the FW to the point from which you want predictions forward. This changes how DataRobot optimizes models and ranks them on the Leaderboard, as it only compares for accuracy against the configured range.
Create non-forecasting time series models¶
There are times when you may want to create time series models that predict current values, not future values. For example, in an anomaly detection project you may want to answer the question, "is the observation I see right now an anomaly?" Or, in some situations you might want to use time series values to understand the current value of the target given the current parameters (features) and their recent values. For this type of project, use DataRobot's nowcasting capabilities.
Duration and Row Count¶
If your data is evenly spaced, Duration and Row Count give the same results. It is not uncommon, however, for date/time datasets to have unevenly spaced data with noticeable gaps along the time axis. This can impact how Duration and Row Count are handled by DataRobot. If the data has gaps:
- Row Count results in an even number of rows per backtest (although some of them may cover longer time periods). Row Count models can, in certain situations, use more RAM than Duration models over the same number of rows.
- Duration results in a consistent length-of-time per backtest (but some may have more or fewer rows).
Additionally, these values have different meanings depending on whether they are being applied to training or validation.
For irregular datasets, note that the setting for Training Window Format defaults to Row Count. Although you can change the setting to Duration, it is highly recommended that you leave it, as changing may result in unexpected training windows or model errors.
Handle training folds¶
The values for Duration and Row Count in training data are set in the training window format section of the Time Series Modeling configuration.
When you select Duration, DataRobot selects a default fold size—a particular period of time—to train models, based on the duration of your training data. For example, you can tell DataRobot "always use three months of data." With Row Count, models use a specific number of rows (e.g., always use 1000 rows) for training models. The training data will have exactly that many rows.
For example, consider a dataset that includes fraudulent and non-fraudulent transactions where the frequency of transactions is increasing over time (the number is increasing per time period). Set Row Count if you want to keep the number of training examples constant through the backtests in the training data. It may be that the first backtest is only trained on a short time period. Select Duration to keep the time period constant between backtests, regardless of the number of rows. In either case, models will not be trained into data more recent than the start of the holdout data.
Handle the validation fold¶
Validation is always set in terms of duration (even if training is specified in terms of rows). When you select Row Count, DataRobot sets the Validation Length based on the row count.
Time series interval units¶
Although many of the examples in this documentation show a time unit of "days," DataRobot supports several intervals for time series and multiseries modeling. Currently, DataRobot supports time steps that are integer multiples of the following units:
For example, the time step between rows can be every 15 minutes (a multiple of minutes) but cannot be a fraction such as 13.23 minutes. DataRobot automatically detects the time unit and time step, and if it cannot, rejects the dataset as irregular. Datasets using milliseconds as a time unit must specify training and partitioning boundaries at the second level, and must span multiple seconds, for partitioning to operate correctly. Additionally, they must use the default forecast point to use a fractional-second forecast point.
Common patterns of time series data¶
Time series models are built based on consideration of common patterns in time series data:
Linearity : A specific type of trend. Searching on the term "machine learning," you see an increase over time. The following shows the linear trends (you can also view as a non-linear trend) created by the search term, showing that interest may fluctuate but is growing over time:
Seasonality: Searching on the term "Thanksgiving" shows periodicity. In other words, spikes and dips are closely related to calendar events (for example, each year starting to grow in July, falling in late November):
Cycles: Cycles are similar to seasonality, except that they do not necessarily have a fixed period and are generally require a minimum of four years of data to be qualified as such. Usually related to global macroeconoimc events or changes in the political landscapes, cycles can be seen as a series of expansions and recessions:
Combinations: Data can combine patterns as well. Consider searching the term "gym." Search interest spikes every January with lows over the holidays. Interest, however, increases over time. In this example you can see both seasonality with linear a trend: