Forecast sales with multiseries modeling¶
The use case provided in this notebook forecasts future sales for multiple stores using multiseries modeling. Multiseries modeling allows you to model datasets that contain multiple time series based on a common set of input features. In other words, a dataset that could be thought of as consisting of multiple individual time-series datasets with one column of labels indicating which series each row belongs to. This column is known as the series ID column.
Multiseries is useful for large chain businesses that want to create a forecast to correctly order inventory and staff stores with the needed number of people for the predicted store volume. An analyst managing the stores uses DataRobot to build time series models that predict daily sales.
Import libraries¶
import datarobot as dr
from datarobot import Project, Deployment
import pandas as pd
from pandas.io.json import json_normalize
import numpy as np
import datetime as dt
from datetime import datetime
import dateutil.parser
import os
import re
from importlib import reload
Connect to DataRobot¶
Read more about different options for connecting to DataRobot from the client.
# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call
# dr.Client(config_path='path-to-drconfig.yaml')
data_path = "https://docs.datarobot.com/en/docs/api/guide/common-case/DR_Demo_Sales_Multiseries_training.csv"
df = pd.read_csv(data_path,
infer_datetime_format=True,
parse_dates=['Date'],
engine='c'
)
df.head(5)
Store | Date | Sales | Store_Size | Num_Employees | Num_Customers | Returns_Pct | Pct_On_Sale | Pct_Promotional | Marketing | TouristEvent | Econ_ChangeGDP | EconJobsChange | AnnualizedCPI | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Louisville | 2012-07-01 | 109673 | 20100 | 68 | 531 | 1.03 | 9.96 | 0.000047 | July In Store Credit Card Signup Discount; In ... | No | 0.5 | NaN | 0.02 |
1 | Louisville | 2012-07-02 | 131791 | 20100 | 34 | 476 | 0.41 | 8.65 | 0.000047 | July In Store Credit Card Signup Discount; In ... | No | NaN | NaN | NaN |
2 | Louisville | 2012-07-03 | 134711 | 20100 | 42 | 578 | 0.31 | 8.96 | 0.000047 | July In Store Credit Card Signup Discount; In ... | No | NaN | NaN | NaN |
3 | Louisville | 2012-07-04 | 97640 | 20100 | 54 | 569 | 0.83 | 10.08 | 0.000047 | July In Store Credit Card Signup Discount; In ... | No | NaN | NaN | NaN |
4 | Louisville | 2012-07-05 | 129538 | 20100 | 62 | 486 | 0.51 | 9.80 | 0.000047 | July In Store Credit Card Signup Discount; ID5... | No | NaN | NaN | NaN |
Plot the sales of each store¶
df.pivot(index='Date', columns='Store', values='Sales').plot(figsize=(18, 8));
# Defaults
HOLDOUT_START_DATE = None # pd.to_datetime('2014-04-12')
HOLDOUT_DURATION = None # dr.helpers.partitioning_methods.construct_duration_string(years=0, months=0, days=64)
# Known In Advance columns
KIA_VARS = ['Store_Size', 'Marketing', 'TouristEvent']
FEATURE_SETTINGS = []
for column in KIA_VARS:
FEATURE_SETTINGS.append(dr.FeatureSettings(column, known_in_advance=True, do_not_derive=False))
# Create a calendar from a dataset
data_path = 'https://docs.datarobot.com/en/docs/api/guide/common-case/Calendar.csv'
dataset = dr.Dataset.create_from_url(data_path)
CAL = dr.CalendarFile.create_calendar_from_dataset(
dataset.id
)
CAL_ID = CAL.id
print(FEATURE_SETTINGS)
print(' ')
# print(CAL_ID)
[FeatureSettings(feature_name='Store_Size', known_in_advance=True, do_not_derive=False), FeatureSettings(feature_name='Marketing', known_in_advance=True, do_not_derive=False), FeatureSettings(feature_name='TouristEvent', known_in_advance=True, do_not_derive=False)]
Configure date/time partitioning¶
The snippet below outlines how to configure date/time partitioning for modeling. Specify the date and series IDs, along with the calendar ID and any feature settings (e.g., Known in Advance features).
time_partition = dr.DatetimePartitioningSpecification(
use_time_series = True,
datetime_partition_column = 'Date',
multiseries_id_columns = ['Store'],
feature_settings = FEATURE_SETTINGS,
calendar_id = CAL_ID
)
Create a project¶
Use the snippet below to create a project, upload the dataset, and set the project name.
project = dr.Project.create(
project_name = 'Sales_Forecast',
sourcedata = df,
dataset_filename = 'Sales_Multiseries_training.csv'
)
Initiate modeling¶
This snippet starts modeling for the project using Autopilot in Quick mode, using the partitioning specifications configured in previous steps.
project.set_target(
target = 'Sales',
mode = dr.AUTOPILOT_MODE.QUICK , # To use regular Autopilot, replace with dr.AUTOPILOT_MODE.FULL_AUTO,
partitioning_method = time_partition,
worker_count = -1 # Use all available workers
)
Project(Sales_Forecast)
Retrieve results from the Leaderboard¶
When Autopilot completes, pull the results from the Leaderboard. The code below pulls the model blueprints and accuracy scores. It then adds them to a Pandas dataframe named "scores".
models = []
scores = pd.DataFrame()
lb = project.get_datetime_models()
best_models = sorted(
[model for model in lb if model.metrics[project.metric]['backtesting']],
key=lambda m: m.metrics[project.metric]['backtesting'],
)
for m in best_models:
backtest_scores = pd.DataFrame(
[
{
'Project_Name': project.project_name,
'Project_ID': project.id,
'Model_ID': m.id,
'Model_Type': m.model_type,
'Featurelist': m.featurelist_name,
'Optimization_Metric': project.metric,
'Scores': m.metrics,
}
]
)
scores = scores.append(backtest_scores, sort=False).reset_index(drop=True)
scores = scores.join(pd.json_normalize(scores["Scores"].tolist())).drop(labels=['Scores'], axis=1)
# Drop empty columns
scores = scores[scores.columns.drop(list(scores.filter(regex='crossValidation$')))]
# Rename columns
scores.columns = scores.columns.str.replace(".backtesting", "_All_BT")
scores.columns = scores.columns.str.replace(".holdout", "_Holdout")
scores.columns = scores.columns.str.replace(".validation", "_BT_1")
scores.columns = scores.columns.str.replace(' ', '_')
scores = scores[scores.columns.drop(list(scores.filter(regex='_All_BTScores$')))]
# Filter accuracy metrics
METRICS = scores.filter(regex='MASE|RMSE').columns.to_list()
PROJECT = ['Project_Name', 'Project_ID', 'Model_ID', 'Model_Type', 'Featurelist']
COLS = PROJECT + METRICS
scores = scores[COLS]
scores
Get the top-performing model¶
hrmse = scores.loc[scores['RMSE_All_BT'].notnull()]
best_model = pd.DataFrame(hrmse.loc[hrmse.RMSE_All_BT.idxmin()]).transpose()
best_model
Test predictions¶
After determining the top-performing model from the Leaderboard (added to the best_model
dataframe), upload the prediction test dataset to verify that the model generates predictions successfully before deploying the model to a production environment. After the model generates predictions, download the results as a pandas dataframe.
%%time
PID = best_model['Project_ID'].values[0]
MID = best_model['Model_ID'].values[0]
project = dr.Project.get(PID)
model = dr.Model.get(PID, MID)
dataset = project.upload_dataset('DR_Demo_Sales_Multiseries_prediction.csv')
pred_job = model.request_predictions(dataset_id = dataset.id)
preds = pred_job.get_result_when_complete()
preds.head(5)
CPU times: user 138 ms, sys: 18.4 ms, total: 156 ms Wall time: 1min 26s
series_id | forecast_point | row_id | timestamp | forecast_distance | prediction | |
---|---|---|---|---|---|---|
0 | Louisville | 2014-06-14T00:00:00.000000Z | 50 | 2014-06-15T00:00:00.000000Z | 1 | 139721.407523 |
1 | Louisville | 2014-06-14T00:00:00.000000Z | 51 | 2014-06-16T00:00:00.000000Z | 2 | 129506.536886 |
2 | Louisville | 2014-06-14T00:00:00.000000Z | 52 | 2014-06-17T00:00:00.000000Z | 3 | 130143.459674 |
3 | Louisville | 2014-06-14T00:00:00.000000Z | 53 | 2014-06-18T00:00:00.000000Z | 4 | 132023.333215 |
4 | Louisville | 2014-06-14T00:00:00.000000Z | 54 | 2014-06-19T00:00:00.000000Z | 5 | 131219.489447 |
Deploy the model¶
Deploy the model to a prediction server to generate predictions in a production environment. The dedicated server only serves predictions.
from datetime import date, datetime
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
# DataRobot Project
project = dr.Project.get(project_id=PID)
# Model ID
model_id = MID
# Deploy Model
prediction_server = dr.PredictionServer.list()[0]
DATAROBOT_KEY = prediction_server.datarobot_key
PREDICTIONSENDPOINT = prediction_server.url
PREDICTIONSHEADERS = {'Content-Type': 'application/json', 'datarobot-key': '%s' % DATAROBOT_KEY}
deployment = dr.Deployment.create_from_learning_model(
model_id = MID,
label = 'Store Sales - Deployment ' + str(date.today()) + " " + str(current_time),
description = 'Store Sales - Example Deployment ' + str(date.today()),
default_prediction_server_id = prediction_server.id)
DEPLOYMENT_ID = deployment.id
# Write the Deployment ID to a text file in the current working directory to reference
# later if needed
os.system("echo " + str(deployment.id) + "> ./deployment_id.txt")
print('Deployment Name is: ', deployment.label)
# print('Deployment ID is: ' + str(DEPLOYMENT_ID))
Deployment Name is: Store Sales - Deployment 2021-06-10 13:46:18
Configure batch predictions¶
Once the model is successfully deployed, access your deployment from the DataRobot application to make predictions. To do so, navigate to the Deployments page, select the new deployment, and go to the Predictions > Prediction API tab. Select the "Batch" in the Prediction Type field and "API Client" in the Interface field.
Once your predictions are configured, copy the script and paste it into a .py file. Save it as datarobot-predict.py
.
from IPython import display
display.Image("./pred-script.png")