What is time-aware modeling?¶
Contact your DataRobot representative for information on enabling time series modeling.
DataRobot offers two mechanisms for time-aware modeling—time series and OTV—both of which are implemented using date/time partitioning:
Use the following types of time series modeling when you want to:
- Time series: Forecast multiple future values of the target—"What will sales be like next week, Monday through Friday?"
- Multiseries: Model datasets that contain multiple time series based on a common set of input features.
- Segmented: Group your series into segments to improve demand forecasting—"What will sales of avocados look like in the northeast in January?"
- "Nowcast": Have an unknown current value of a time series—"What is this month's inflation rate based on recent history?"
Use out-of-time validation (OTV) when your data is time-relevant but you are not forecasting (instead, you are predicting the target value on each individual row). "How do I interpret this housing data?" This type of time-aware modeling is described in the OTV specialized workflow section.
See below for more specific information on reasons to use time-aware modeling and how to put it in context with supervised learning. Follow the suggested reading path to help locate the documentation appropriate to your understanding and requirements.
Why use it?¶
People frequently use time-aware models to predict future events while training those models on past data. A major difference between time-aware and conventional modeling is in how validation data—used to judge performance—is selected. For conventional modeling it is common practice to select rows from the dataset for validation, without regard to their time period. This practice is modified for time-aware modeling, to prevent validation scores that are overly optimistic and misleading (and potentially lead to damaging conclusions and actions). Time-aware modeling does not assume that the relationship between predictors and the target is constant over time.
A simple example: Let’s say you want to forecast housing prices. You have a variety of data about each house in your dataset and plan to use that data to predict the sales price. You will build a model using some of the data and make predictions using other parts of the data. The problem is, randomly selecting sale prices from your dataset suggests you are randomly selecting across time as well. In other words, the resulting model doesn't predict the future from the past. Using time-aware modeling, you can train and test models using time-based folds, which assures that your models are always validated on future house price data (the purpose of your forecast). It isn’t necessary to use the most recent data to make predictions—only to use data that is more recent than the data used for model training—to ensure that model predictions about the future hold up.
With time-aware modeling, you think of data in terms of time. When determining how much data you need to build an accurate model, the answer, for example, is in days or months or most recent x number of rows. “How long of a data history will I need and how much will my model improve with more time?” DataRobot partitions the data so that it can evaluate models with an awareness of the data’s time component, providing:
- Improved performance through better model selection
- More accurate validation scores
- Improved support for date variables as predictors
Time series overview¶
When working with time series data, ask yourself: How long do I want to look into the past and how far into the future do I want to predict? Once you determine those answers, you can configure DataRobot so that your time-sensitive data uses advanced DataRobot modeling techniques to create forecasts from your data. (See also the section on why to use time series modeling.
DataRobot automatically creates and selects time series features in the modeling data. You can constrain the features (for example, minimum and maximum lags, etc.) by configuring the time series framework on the Start screen. Based on your settings and the analysis of the raw dataset, DataRobot derives new features and creates a modeling dataset. Because time shifts, lags, and features have already been applied, DataRobot can use general machine learning algorithms to build models with the new modeling dataset.
Supervised learning models¶
In conventional supervised learning, you work with raw training data—with labels or features. DataRobot trains models to predict a specified target based on those features. DataRobot creates a model, tunes it, and then tests it on unseen (out-of-sample) data. That test results in a validation score which can be considered a measure of confidence in how ready the model is for deployment. Once deployed, you can score new data with the model. Feed the new data into DataRobot, where the application extracts features from the data and feeds them into the model. The model then makes predictions on those features to provide information about the target.
When DataRobot trains a model, it makes some decisions based on the training data. By making assumptions about the function or the data, for example, DataRobot can estimate parameter values based on those assumptions. Different modeling approaches make different assumptions. DataRobot's large repository of available models exercises many different functions (aspects), allowing you to pick the model type that best suit the data.
Supervised learning in time-aware mode¶
Supervised learning assumes that training examples are independent and identically distributed (IID). That kind of modeling makes predictions based on each row of the dataset, without taking the neighboring rows into account. The assumption is that training samples are independent of each other. Another problematic assumption with the supervised learning is that the data you train on and your future will have the same distribution.
With time-dependent data, the traditional machine-learning assumptions don't work. Consider Google search trends for the term "DataRobot" in the period of July through November, 2017. The search interest is fairly uniform:
If you check the same search trend across the life of DataRobot, you can notice that the time series behaves very differently toward the more recent dates. If you trained a model on the earlier data, say 2013-2016, the model will be ineffective since the data does not follow the same distribution.