Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Setup

Use the following code to install the required libraries, connect to DataRobot, and curate the data for modeling later in the workflow.

For a sample dataset, DataRobot recommends using this sample demand forecasting dataset, hosted on Kaggle.

Install prerequisites

Uncomment and run the pip install commands listed below as needed.

# !pip install --upgrade datarobot
# !pip install umap-learn
# !pip install -Iv hdbscan==0.8.24
# !pip install pandas-profiling
# !pip install ipywidgets

Import libraries

import datarobot as dr
from datarobot import Project, Deployment
import pandas as pd
from pandas.io.json import json_normalize
import numpy as np
from datetime import date, timedelta
from datetime import datetime

import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import matplotlib.ticker as mtick
from matplotlib.ticker import FormatStrFormatter

import datetime as dt
from datetime import datetime
import dateutil.parser
import os
import re 
from importlib import reload
import random
import math
import umap

from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from scipy.signal import savgol_filter
from pandas_profiling import ProfileReport
import ipywidgets

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Set Pandas configuration to show all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

import warnings
warnings.filterwarnings('ignore')

Connect to DataRobot

DataRobot recommends providing a configuration file containing your credentials (endpoint and API Key) to connect to DataRobot. For more information about authentication, reference the API Quickstart guide.

dr.Client(config_path='/path/to/drconfig.yaml');

Upload and analyze data

The following steps in this notebook outline how to upload data, perform EDA, and visualize unique insights for your time series dataset.

df = pd.read_csv('Acme_Train.csv',
                 infer_datetime_format=True,
                 parse_dates=['Date'],
                 engine='c',
                 )

# Extract Month (To be used later for Clustering)
# df['Month'] = df['date'].dt.strftime('%b')          # Month Name

# Drop columns
df.drop(labels=['Category'], axis=1, inplace=True)

df.tail(5)
Series Date SalesQty AvgPrice numTransactions Weight Advertised FrontPg OnDisplay WeekNbr Cat Store Item Month
341611 01115_Soft_Drinks_3722462659 2018-09-22 2.00000 4.99000 2.00000 0.00000 0.00000 0.00000 0.00000 38.00000 Soft_Drinks 1115.00000 3722462659.00000 September
341612 01115_Soft_Drinks_3722462659 2018-09-23 3.00000 4.99000 3.00000 0.00000 0.00000 0.00000 0.00000 38.00000 Soft_Drinks 1115.00000 3722462659.00000 September
341613 01115_Soft_Drinks_3722462659 2018-09-24 2.00000 4.99000 2.00000 0.00000 0.00000 0.00000 0.00000 39.00000 Soft_Drinks 1115.00000 3722462659.00000 September
341614 01115_Soft_Drinks_3722462659 2018-09-25 0.00000 4.99000 0.00000 0.00000 0.00000 0.00000 0.00000 39.00000 Soft_Drinks 1115.00000 3722462659.00000 September
341615 01115_Soft_Drinks_3722462659 2018-09-26 2.00000 4.99000 2.00000 0.00000 0.00000 0.00000 0.00000 39.00000 Soft_Drinks 1115.00000 3722462659.00000 September
# df.to_csv('data/Acme_Sampled_Original.csv', index=False)

Set variables

Use the code below to set the variables that are referenced throughout the rest of this notebook. Then use the code that follows to ensure that the variables are properly bound.

SERIES_ID = 'Series'
DATE_COL  = 'Date'
TARGET    = 'SalesQty'
FREQ      = '1D'
print(df[DATE_COL].min())
print(df[DATE_COL].max())
STARTING_LEN = len(df)
print(STARTING_LEN)

Set the random state variable to ensure reproducability. Without this variable, the results will be different each time you run a cell, as a different partition will be used each time during modeling.

RANDOM_STATE = 1234

Perform EDA

Perform EDA using Pandas profiling. Pandas will visualize aspects of the data and help explain the distribution of each variable. Note that your specific dataset might require more analysis in certain areas.

profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
profile
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]