Large scale demand forecasting¶
This notebook outlines how to perform large-scale demand forecasting using DataRobot's Python package. No single model can handle extreme data diversity or forecast the complexity of human buying patterns at a detailed level. Complex demand forecasting typically requires deep statistical know-how and lengthy development projects around big data architectures. This notebook builds a model factory to automate this requirement by creating multiple projects "under the hood."
For a sample dataset, DataRobot recommends using this sample demand forecasting dataset, hosted on Kaggle.
# !pip install --upgrade datarobot
# !pip install umap-learn
# !pip install -Iv hdbscan==0.8.24
# !pip install pandas-profiling
# !pip install ipywidgets
Import libraries¶
import datarobot as dr
from datarobot import Project, Deployment
import pandas as pd
from pandas.io.json import json_normalize
import numpy as np
from datetime import date, timedelta
from datetime import datetime
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import matplotlib.ticker as mtick
from matplotlib.ticker import FormatStrFormatter
import datetime as dt
from datetime import datetime
import dateutil.parser
import os
import re
from importlib import reload
import random
import math
import umap
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from scipy.signal import savgol_filter
from pandas_profiling import ProfileReport
import ipywidgets
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Set Pandas configuration to show all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.float_format', lambda x: '%.5f' % x)
import warnings
warnings.filterwarnings('ignore')
Connect to DataRobot¶
DataRobot recommends providing a configuration file containing your credentials (endpoint and API Key) to connect to DataRobot. For more information about authentication, reference the API Quickstart guide.
dr.Client()
# The `config_path` should only be specified if the config file is not in the default location described in the API Quickstart guide
# dr.Client(config_path = 'path-to-drconfig.yaml')
Upload and analyze data¶
The following steps in this notebook outline how to upload data, perform EDA, and visualize unique insights for your time series dataset.
data_path = "https://docs.datarobot.com/en/docs/api/guide/common-case/demand-forecast/train_data.csv"
df = pd.read_csv(data_path,
infer_datetime_format=True,
parse_dates=['Date'],
engine='c',
)
# Extract Month (To be used later for Clustering)
# df['Month'] = df['date'].dt.strftime('%b') # Month Name
# Drop columns
df.drop(labels=['Category'], axis=1, inplace=True)
df.tail(5)
Series | Date | SalesQty | AvgPrice | numTransactions | Weight | Advertised | FrontPg | OnDisplay | WeekNbr | Cat | Store | Item | Month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
341611 | 01115_Soft_Drinks_3722462659 | 2018-09-22 | 2.00000 | 4.99000 | 2.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 38.00000 | Soft_Drinks | 1115.00000 | 3722462659.00000 | September |
341612 | 01115_Soft_Drinks_3722462659 | 2018-09-23 | 3.00000 | 4.99000 | 3.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 38.00000 | Soft_Drinks | 1115.00000 | 3722462659.00000 | September |
341613 | 01115_Soft_Drinks_3722462659 | 2018-09-24 | 2.00000 | 4.99000 | 2.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 39.00000 | Soft_Drinks | 1115.00000 | 3722462659.00000 | September |
341614 | 01115_Soft_Drinks_3722462659 | 2018-09-25 | 0.00000 | 4.99000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 39.00000 | Soft_Drinks | 1115.00000 | 3722462659.00000 | September |
341615 | 01115_Soft_Drinks_3722462659 | 2018-09-26 | 2.00000 | 4.99000 | 2.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 39.00000 | Soft_Drinks | 1115.00000 | 3722462659.00000 | September |
# df.to_csv('data/Acme_Sampled_Original.csv', index=False)
Set variables¶
Use the code below to set the variables that are referenced throughout the rest of this notebook. Then use the code that follows to ensure that the variables are properly bound.
SERIES_ID = 'Series'
DATE_COL = 'Date'
TARGET = 'SalesQty'
FREQ = '1D'
print(df[DATE_COL].min())
print(df[DATE_COL].max())
STARTING_LEN = len(df)
print(STARTING_LEN)
Set the random state variable to ensure reproducability. Without this variable, the results will be different each time you run a cell, as a different partition will be used each time during modeling.
RANDOM_STATE = 1234
Perform EDA¶
Perform EDA using Pandas profiling. Pandas will visualize aspects of the data and help explain the distribution of each variable. Note that your specific dataset might require more analysis in certain areas.
profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
profile
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]