DataRobot API resources > API user guide > Common use cases > Python v2.x use cases > Forecast sales with multiseries modeling

Forecast sales with multiseries modeling¶

The use case provided in this notebook forecasts future sales for multiple stores using multiseries modeling. Multiseries modeling allows you to model datasets that contain multiple time series based on a common set of input features. In other words, a dataset that could be thought of as consisting of multiple individual time-series datasets with one column of labels indicating which series each row belongs to. This column is known as the series ID column.

Multiseries is useful for large chain businesses that want to create a forecast to correctly order inventory and staff stores with the needed number of people for the predicted store volume. An analyst managing the stores uses DataRobot to build time series models that predict daily sales.

Import libraries¶

In [1]:

Copied!





import datetime as dt
from datetime import datetime
from importlib import reload
import os
import re

import datarobot as dr
from datarobot import Deployment, Project
import dateutil.parser
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
import datetime as dt
from datetime import datetime
from importlib import reload
import os
import re

import datarobot as dr
from datarobot import Deployment, Project
import dateutil.parser
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize

Connect to DataRobot¶

Read more about different options for connecting to DataRobot from the client.

In [ ]:

Copied!

# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call
# dr.Client(config_path='path-to-drconfig.yaml')
# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call
# dr.Client(config_path='path-to-drconfig.yaml')

Create a dataframe¶

Use the snippet below to display your dataset as a dataframe. You can download the sample dataset used below here.

In [3]:

Copied!

data_path = "https://docs.datarobot.com/en/docs/api/guide/common-case/DR_Demo_Sales_Multiseries_training.csv"

df = pd.read_csv(data_path, infer_datetime_format=True, parse_dates=["Date"], engine="c")

df.head(5)
data_path = "https://docs.datarobot.com/en/docs/api/guide/common-case/DR_Demo_Sales_Multiseries_training.csv"

df = pd.read_csv(data_path, infer_datetime_format=True, parse_dates=["Date"], engine="c")

df.head(5)

Out[3]:

	Store	Date	Sales	Store_Size	Num_Employees	Num_Customers	Returns_Pct	Pct_On_Sale	Pct_Promotional	Marketing	TouristEvent	Econ_ChangeGDP	EconJobsChange	AnnualizedCPI
0	Louisville	2012-07-01	109673	20100	68	531	1.03	9.96	0.000047	July In Store Credit Card Signup Discount; In ...	No	0.5	NaN	0.02
1	Louisville	2012-07-02	131791	20100	34	476	0.41	8.65	0.000047	July In Store Credit Card Signup Discount; In ...	No	NaN	NaN	NaN
2	Louisville	2012-07-03	134711	20100	42	578	0.31	8.96	0.000047	July In Store Credit Card Signup Discount; In ...	No	NaN	NaN	NaN
3	Louisville	2012-07-04	97640	20100	54	569	0.83	10.08	0.000047	July In Store Credit Card Signup Discount; In ...	No	NaN	NaN	NaN
4	Louisville	2012-07-05	129538	20100	62	486	0.51	9.80	0.000047	July In Store Credit Card Signup Discount; ID5...	No	NaN	NaN	NaN

Plot the sales of each store¶

In [4]:

Copied!

df.pivot(index="Date", columns="Store", values="Sales").plot(figsize=(18, 8));
df.pivot(index="Date", columns="Store", values="Sales").plot(figsize=(18, 8));

No description has been provided for this image

Configure time series settings¶

The following snippet sets time series variables necessary for multiseries modeling. These are Forecast Distance (FD), Feature Derivation Window (FDW), Project Name, Date, Target, Series Identifier, Known In Advance (KIA), and the calendar file (download here).

In [40]:

Copied!





# Defaults
HOLDOUT_START_DATE = None  # pd.to_datetime('2014-04-12')
HOLDOUT_DURATION = (
    None  # dr.helpers.partitioning_methods.construct_duration_string(years=0, months=0, days=64)
)

# Known In Advance columns
KIA_VARS = ["Store_Size", "Marketing", "TouristEvent"]

FEATURE_SETTINGS = []
for column in KIA_VARS:
    FEATURE_SETTINGS.append(dr.FeatureSettings(column, known_in_advance=True, do_not_derive=False))

# Create a calendar from a dataset
data_path = "https://docs.datarobot.com/en/docs/api/guide/common-case/Calendar.csv"
dataset = dr.Dataset.create_from_url(data_path)
CAL = dr.CalendarFile.create_calendar_from_dataset(dataset.id)
CAL_ID = CAL.id

print(FEATURE_SETTINGS)
print(" ")
# print(CAL_ID)
# Defaults
HOLDOUT_START_DATE = None  # pd.to_datetime('2014-04-12')
HOLDOUT_DURATION = (
    None  # dr.helpers.partitioning_methods.construct_duration_string(years=0, months=0, days=64)
)

# Known In Advance columns
KIA_VARS = ["Store_Size", "Marketing", "TouristEvent"]

FEATURE_SETTINGS = []
for column in KIA_VARS:
    FEATURE_SETTINGS.append(dr.FeatureSettings(column, known_in_advance=True, do_not_derive=False))

# Create a calendar from a dataset
data_path = "https://docs.datarobot.com/en/docs/api/guide/common-case/Calendar.csv"
dataset = dr.Dataset.create_from_url(data_path)
CAL = dr.CalendarFile.create_calendar_from_dataset(dataset.id)
CAL_ID = CAL.id

print(FEATURE_SETTINGS)
print(" ")
# print(CAL_ID)

[FeatureSettings(feature_name='Store_Size', known_in_advance=True, do_not_derive=False), FeatureSettings(feature_name='Marketing', known_in_advance=True, do_not_derive=False), FeatureSettings(feature_name='TouristEvent', known_in_advance=True, do_not_derive=False)]

Configure date/time partitioning¶

The snippet below outlines how to configure date/time partitioning for modeling. Specify the date and series IDs, along with the calendar ID and any feature settings (e.g., Known in Advance features).

In [6]:

Copied!





time_partition = dr.DatetimePartitioningSpecification(
    use_time_series=True,
    datetime_partition_column="Date",
    multiseries_id_columns=["Store"],
    feature_settings=FEATURE_SETTINGS,
    calendar_id=CAL_ID,
)
time_partition = dr.DatetimePartitioningSpecification(
    use_time_series=True,
    datetime_partition_column="Date",
    multiseries_id_columns=["Store"],
    feature_settings=FEATURE_SETTINGS,
    calendar_id=CAL_ID,
)

Create a project¶

Use the snippet below to create a project, upload the dataset, and set the project name.

In [7]:

Copied!

project = dr.Project.create(
    project_name="Sales_Forecast", sourcedata=df, dataset_filename="Sales_Multiseries_training.csv"
)
project = dr.Project.create(
    project_name="Sales_Forecast", sourcedata=df, dataset_filename="Sales_Multiseries_training.csv"
)

Initiate modeling¶

This snippet starts modeling for the project using Autopilot in Quick mode, using the partitioning specifications configured in previous steps.

In [8]:

Copied!





project.set_target(
    target="Sales",
    mode=dr.AUTOPILOT_MODE.QUICK,  # To use regular Autopilot, replace with dr.AUTOPILOT_MODE.FULL_AUTO,
    partitioning_method=time_partition,
    worker_count=-1,  # Use all available workers
)
project.set_target(
    target="Sales",
    mode=dr.AUTOPILOT_MODE.QUICK,  # To use regular Autopilot, replace with dr.AUTOPILOT_MODE.FULL_AUTO,
    partitioning_method=time_partition,
    worker_count=-1,  # Use all available workers
)

Out[8]:

Project(Sales_Forecast)

Retrieve results from the Leaderboard¶

When Autopilot completes, pull the results from the Leaderboard. The code below pulls the model blueprints and accuracy scores. It then adds them to a Pandas dataframe named "scores".

In [24]:

Copied!





models = []
scores = pd.DataFrame()

lb = project.get_datetime_models()
best_models = sorted(
    [model for model in lb if model.metrics[project.metric]["backtesting"]],
    key=lambda m: m.metrics[project.metric]["backtesting"],
)

for m in best_models:
    backtest_scores = pd.DataFrame(
        [
            {
                "Project_Name": project.project_name,
                "Project_ID": project.id,
                "Model_ID": m.id,
                "Model_Type": m.model_type,
                "Featurelist": m.featurelist_name,
                "Optimization_Metric": project.metric,
                "Scores": m.metrics,
            }
        ]
    )
    scores = scores.append(backtest_scores, sort=False).reset_index(drop=True)


scores = scores.join(pd.json_normalize(scores["Scores"].tolist())).drop(labels=["Scores"], axis=1)

# Drop empty columns
scores = scores[scores.columns.drop(list(scores.filter(regex="crossValidation$")))]

# Rename columns
scores.columns = scores.columns.str.replace(".backtesting", "_All_BT")
scores.columns = scores.columns.str.replace(".holdout", "_Holdout")
scores.columns = scores.columns.str.replace(".validation", "_BT_1")
scores.columns = scores.columns.str.replace(" ", "_")

scores = scores[scores.columns.drop(list(scores.filter(regex="_All_BTScores$")))]

# Filter accuracy metrics
METRICS = scores.filter(regex="MASE|RMSE").columns.to_list()
PROJECT = ["Project_Name", "Project_ID", "Model_ID", "Model_Type", "Featurelist"]
COLS = PROJECT + METRICS
scores = scores[COLS]

scores
models = []
scores = pd.DataFrame()

lb = project.get_datetime_models()
best_models = sorted(
    [model for model in lb if model.metrics[project.metric]["backtesting"]],
    key=lambda m: m.metrics[project.metric]["backtesting"],
)

for m in best_models:
    backtest_scores = pd.DataFrame(
        [
            {
                "Project_Name": project.project_name,
                "Project_ID": project.id,
                "Model_ID": m.id,
                "Model_Type": m.model_type,
                "Featurelist": m.featurelist_name,
                "Optimization_Metric": project.metric,
                "Scores": m.metrics,
            }
        ]
    )
    scores = scores.append(backtest_scores, sort=False).reset_index(drop=True)


scores = scores.join(pd.json_normalize(scores["Scores"].tolist())).drop(labels=["Scores"], axis=1)

# Drop empty columns
scores = scores[scores.columns.drop(list(scores.filter(regex="crossValidation")))]

# Rename columns
scores.columns = scores.columns.str.replace(".backtesting", "_All_BT")
scores.columns = scores.columns.str.replace(".holdout", "_Holdout")
scores.columns = scores.columns.str.replace(".validation", "_BT_1")
scores.columns = scores.columns.str.replace(" ", "_")

scores = scores[scores.columns.drop(list(scores.filter(regex="_All_BTScores")))]# Rename columnsscores.columns = scores.columns.str.replace(".backtesting", "_All_BT")scores.columns = scores.columns.str.replace(".holdout", "_Holdout")scores.columns = scores.columns.str.replace(".validation", "_BT_1")scores.columns = scores.columns.str.replace(" ", "_")scores = scores[scores.columns.drop(list(scores.filter(regex="_All_BTScores")))]

# Filter accuracy metrics
METRICS = scores.filter(regex="MASE|RMSE").columns.to_list()
PROJECT = ["Project_Name", "Project_ID", "Model_ID", "Model_Type", "Featurelist"]
COLS = PROJECT + METRICS
scores = scores[COLS]

scores

Get the top-performing model¶

In [25]:

Copied!

hrmse = scores.loc[scores["RMSE_All_BT"].notnull()]

best_model = pd.DataFrame(hrmse.loc[hrmse.RMSE_All_BT.idxmin()]).transpose()

best_model
hrmse = scores.loc[scores["RMSE_All_BT"].notnull()]

best_model = pd.DataFrame(hrmse.loc[hrmse.RMSE_All_BT.idxmin()]).transpose()

best_model

Test predictions¶

After determining the top-performing model from the Leaderboard (added to the best_model dataframe), upload the prediction test dataset to verify that the model generates predictions successfully before deploying the model to a production environment. After the model generates predictions, download the results as a pandas dataframe.

In [26]:

Copied!

%%time

PID = best_model["Project_ID"].values[0]
MID = best_model["Model_ID"].values[0]

project = dr.Project.get(PID)
model = dr.Model.get(PID, MID)

dataset = project.upload_dataset("DR_Demo_Sales_Multiseries_prediction.csv")

pred_job = model.request_predictions(dataset_id=dataset.id)

preds = pred_job.get_result_when_complete()

preds.head(5)
%%time

PID = best_model["Project_ID"].values[0]
MID = best_model["Model_ID"].values[0]

project = dr.Project.get(PID)
model = dr.Model.get(PID, MID)

dataset = project.upload_dataset("DR_Demo_Sales_Multiseries_prediction.csv")

pred_job = model.request_predictions(dataset_id=dataset.id)

preds = pred_job.get_result_when_complete()

preds.head(5)

CPU times: user 138 ms, sys: 18.4 ms, total: 156 ms
Wall time: 1min 26s

Out[26]:

	series_id	forecast_point	row_id	timestamp	forecast_distance	prediction
0	Louisville	2014-06-14T00:00:00.000000Z	50	2014-06-15T00:00:00.000000Z	1	139721.407523
1	Louisville	2014-06-14T00:00:00.000000Z	51	2014-06-16T00:00:00.000000Z	2	129506.536886
2	Louisville	2014-06-14T00:00:00.000000Z	52	2014-06-17T00:00:00.000000Z	3	130143.459674
3	Louisville	2014-06-14T00:00:00.000000Z	53	2014-06-18T00:00:00.000000Z	4	132023.333215
4	Louisville	2014-06-14T00:00:00.000000Z	54	2014-06-19T00:00:00.000000Z	5	131219.489447

Deploy the model¶

Deploy the model to a prediction server to generate predictions in a production environment. The dedicated server only serves predictions.

In [27]:

Copied!





from datetime import date, datetime

now = datetime.now()
current_time = now.strftime("%H:%M:%S")

# DataRobot Project
project = dr.Project.get(project_id=PID)

# Model ID
model_id = MID

# Deploy Model
prediction_server = dr.PredictionServer.list()[0]
DATAROBOT_KEY = prediction_server.datarobot_key
PREDICTIONSENDPOINT = prediction_server.url
PREDICTIONSHEADERS = {"Content-Type": "application/json", "datarobot-key": "%s" % DATAROBOT_KEY}

deployment = dr.Deployment.create_from_learning_model(
    model_id=MID,
    label="Store Sales - Deployment " + str(date.today()) + " " + str(current_time),
    description="Store Sales - Example Deployment " + str(date.today()),
    default_prediction_server_id=prediction_server.id,
)

DEPLOYMENT_ID = deployment.id
# Write the Deployment ID to a text file in the current working directory to reference
# later if needed
os.system("echo " + str(deployment.id) + "> ./deployment_id.txt")

print("Deployment Name is: ", deployment.label)
# print('Deployment ID is:    ' + str(DEPLOYMENT_ID))
from datetime import date, datetime

now = datetime.now()
current_time = now.strftime("%H:%M:%S")

# DataRobot Project
project = dr.Project.get(project_id=PID)

# Model ID
model_id = MID

# Deploy Model
prediction_server = dr.PredictionServer.list()[0]
DATAROBOT_KEY = prediction_server.datarobot_key
PREDICTIONSENDPOINT = prediction_server.url
PREDICTIONSHEADERS = {"Content-Type": "application/json", "datarobot-key": "%s" % DATAROBOT_KEY}

deployment = dr.Deployment.create_from_learning_model(
    model_id=MID,
    label="Store Sales - Deployment " + str(date.today()) + " " + str(current_time),
    description="Store Sales - Example Deployment " + str(date.today()),
    default_prediction_server_id=prediction_server.id,
)

DEPLOYMENT_ID = deployment.id
# Write the Deployment ID to a text file in the current working directory to reference
# later if needed
os.system("echo " + str(deployment.id) + "> ./deployment_id.txt")

print("Deployment Name is: ", deployment.label)
# print('Deployment ID is:    ' + str(DEPLOYMENT_ID))

Deployment Name is:  Store Sales - Deployment 2021-06-10 13:46:18

Configure batch predictions¶

Once the model is successfully deployed, access your deployment from the DataRobot application to make predictions. To do so, navigate to the Deployments page, select the new deployment, and go to the Predictions > Prediction API tab. Select the "Batch" in the Prediction Type field and "API Client" in the Interface field.

Once your predictions are configured, copy the script and paste it into a .py file. Save it as datarobot-predict.py.

In [2]:

Copied!

from IPython import display

display.Image("./pred-script.png")
from IPython import display

display.Image("./pred-script.png")

Out[2]:

Set prediction Intervals¶

In [29]:

Copied!

PREDICTION_INTERVAL = 85
dr.Deployment.update_prediction_intervals_settings(deployment, [PREDICTION_INTERVAL])
PREDICTION_INTERVAL = 85
dr.Deployment.update_prediction_intervals_settings(deployment, [PREDICTION_INTERVAL])

In [30]:

Copied!

# Verify settings
dr.Deployment.get_prediction_intervals_settings(deployment)
# Verify settings
dr.Deployment.get_prediction_intervals_settings(deployment)

Out[30]:

{'percentiles': [85], 'enabled': True}

In [ ]:

Copied!

# Command outline that will need to be run
# python datarobot-predict.py --forecast_point <date> <input-file.csv> <output-file.csv>
# Command outline that will need to be run
# python datarobot-predict.py --forecast_point

The cell below contains the Python command to make predictions in Batch mode from the deployed model on the prediction server. Before proceeding, be sure to set the forecast point (the last date in which the target is populated).

In [33]:

Copied!

%%time

!python datarobot-predict.py --forecast_point infer data/DR_Demo_Sales_Multiseries_prediction.csv Store_Sales_predictions.csv
%%time

!python datarobot-predict.py --forecast_point infer data/DR_Demo_Sales_Multiseries_prediction.csv Store_Sales_predictions.csv

CPU times: user 659 ms, sys: 240 ms, total: 899 ms
Wall time: 26.6 s

Make predictions¶

In [34]:

Copied!

pred_path = "https://docs.datarobot.com/en/docs/api/guide/common-case/python2/DR_Demo_Sales_Multiseries_prediction.csv"

preds = pd.read_csv(pred_path, infer_datetime_format=True, parse_dates=["FORECAST_POINT", "Date"])

preds.head(5)
pred_path = "https://docs.datarobot.com/en/docs/api/guide/common-case/python2/DR_Demo_Sales_Multiseries_prediction.csv"

preds = pd.read_csv(pred_path, infer_datetime_format=True, parse_dates=["FORECAST_POINT", "Date"])

preds.head(5)

Out[34]:

	Store	FORECAST_POINT	Date	FORECAST_DISTANCE	Sales (actual)_PREDICTION	DEPLOYMENT_APPROVAL_STATUS	PREDICTION_85_PERCENTILE_LOW	PREDICTION_85_PERCENTILE_HIGH
0	Louisville	2014-06-14	2014-06-15	1	139721.407523	APPROVED	100959.818503	174779.849835
1	Louisville	2014-06-14	2014-06-16	2	129506.536886	APPROVED	94225.395808	166218.904625
2	Louisville	2014-06-14	2014-06-17	3	130143.459674	APPROVED	91791.204760	166994.653894
3	Louisville	2014-06-14	2014-06-18	4	132023.333215	APPROVED	94975.196124	170533.172518
4	Louisville	2014-06-14	2014-06-19	5	131219.489447	APPROVED	94285.752211	173783.939228

Delete the deployment¶

In [35]:

Copied!

deployment = Deployment.get(deployment_id=DEPLOYMENT_ID)

print("Deployment Name is: ", deployment.label)
# print('Deployment ID is:    ' + str(DEPLOYMENT_ID))
deployment = Deployment.get(deployment_id=DEPLOYMENT_ID)

print("Deployment Name is: ", deployment.label)
# print('Deployment ID is:    ' + str(DEPLOYMENT_ID))

Deployment Name is:  Store Sales - Deployment 2021-06-10 13:46:18

In [38]:

Copied!

# Delete the deployment
deployment.delete()
# Delete the deployment
deployment.delete()

In [ ]:

Copied!

# Verify that deployment was deleted
dr.Deployment.list(order_by=None, search=["Store Sales"])
# Verify that deployment was deleted
dr.Deployment.list(order_by=None, search=["Store Sales"])

Was this page helpful?

Great! Let us know what you found helpful.

What can we do to improve the content?

Thanks for your feedback!