What is time-based modeling?¶
DataRobot offers two mechanisms for time-aware modeling, both of which are implemented using date/time partitioning:
- Use out-of-time validation (OTV) when your data is time-relevant but you are not forecasting (instead, you are predicting the target value on each individual row). "How do I interpret this housing data?"
- Use time series (single series or multiseries) when you want to:
- forecast multiple future values of the target—"What will sales be like next week, Monday through Friday?"
- "nowcast" an unknown current value of a time series—"What is this month's inflation rate based on recent history?"
Contact your DataRobot representative for information on enabling time series modeling.
See below for more specific information on reasons to use time-aware modeling and putting it in context with supervised learning. Follow the suggested reading path tTo help locate the documentation appropriate to your understanding and requirements.
See the file size documentation for information on file size and series limit considerations.
Why use it?¶
People frequently use time-aware models to predict future events while training those models on past data. A major difference between time-aware and conventional modeling is in how validation data—used to judge performance—is selected. For conventional modeling it is common practice to select rows from the dataset for validation, without regard to their time period. This practice is modified for time-aware modeling, to prevent validation scores that are overly optimistic and misleading (and potentially lead to damaging conclusions and actions). Time-aware modeling does not assume that the relationship between predictors and the target is constant over time.
A simple example: Let’s say you want to forecast housing prices. You have a variety of data about each house in your dataset and plan to use that data to predict the sales price. You will build a model using some of the data and make predictions using other parts of the data. The problem is, randomly selecting sale prices from your dataset suggests you are randomly selecting across time as well. In other words, the resulting model doesn't predict the future from the past. Using time-aware modeling, you can train and test models using time-based folds, which assures that your models are always validated on future house price data (the purpose of your forecast). It isn’t necessary to use the most recent data to make predictions—only to use data that is more recent than the data used for model training—to ensure that model predictions about the future hold up.
With time-aware modeling, you think of data in terms of time. When determining how much data you need to build an accurate model, the answer, for example, is in days or months or most recent x number of rows. “How long of a data history will I need and how much will my model improve with more time?” DataRobot partitions the data so that it can evaluate models with an awareness of the data’s time component, providing:
- Improved performance through better model selection
- More accurate validation scores
- Improved support for date variables as predictors
Supervised learning models¶
In conventional supervised learning, you work with raw training data—with labels or features. DataRobot trains models to predict a specified target based on those features. DataRobot creates a model, tunes it, and then tests it on unseen (out-of-sample) data. That test results in a validation score which can be considered a measure of confidence in how ready the model is for deployment. Once deployed, you can score new data with the model. Feed the new data into DataRobot, where the application extracts features from the data and feeds them into the model. The model then makes predictions on those features to provide information about the target.
When DataRobot trains a model, it makes some decisions based on the training data. By making assumptions about the function or the data, for example, DataRobot can estimate parameter values based on those assumptions. Different modeling approaches make different assumptions. DataRobot's large repository of available models exercises many different functions (aspects), allowing you to pick the model type that best suit the data.
Supervised learning in time-aware mode¶
Supervised learning assumes that training examples are independent and identically distributed (IID). That kind of modeling makes predictions based on each row of the dataset, without taking the neighboring rows into account. The assumption is that training samples are independent of each other. Another problematic assumption with the supervised learning is that the data you train on and your future will have the same distribution.
With time-dependent data, the traditional machine-learning assumptions don't work. Consider Google search trends for the term "DataRobot" in the period of July through November, 2017. The search interest is fairly uniform:
If you check the same search trend across the life of DataRobot, you can notice that the time series behaves very differently toward the more recent dates. If you trained a model on the earlier data, say 2013-2016, the model will be ineffective since the data does not follow the same distribution.
Suggested reading path¶
To help locate the documentation appropriate to your understanding and requirements, the following table describes the pages in the time-aware modeling workflow:
|Why use time-aware modeling? (this page)||A simple description of time-aware modeling application and advantages.||OTV and time series|
|Workflow overview||General steps of the time-aware modeling workflow.||OTV and time series|
|Date/time partitioning||A detailed description of the partitioning method used to implement both OTV and time series. Describes the workflow in full detail as well as background information on components of the method.||OTV and time series|
|Out-of-time validation (OTV)||Implemented in full by date/time partitioning.||OTV|
|Multistep OTV||Implemented in full by date/time partitioning.||OTV|
|Time series modeling||A detailed description of the time series workflow, as well as background material and important differences from conventional supervised learning.||Time series|
|Multiseries modeling||A section specific to the multiseries modeling workflow where it differs from the general time series modeling description (above). Multiseries is applicable if the data contains separate groups of rows that represent the time sequence of different objects (e.g., multiple stores).||Time series|
|Feature engineering reference||A detailed reference, with examples, of the feature derivation process.||Time series|
|Glossary||Quick definitions of terminology referred to in these time-aware modeling pages.||OTV and time series|