DataRobot API resources > API user guide > Common use cases > Predict CO₂ levels with out-of-time validation modeling

Predict CO₂ levels with out-of-time validation modeling¶

This notebook demonstrates how to use out-of-time validation (OTV) modeling with DataRobot's Python client to predict monthly CO₂ levels for one of Hawaii's active volcanoes, Mauna Loa. The dataset used in this notebook can be accessed here (select the first dataset listed to emulate results displayed below), but DataRobot provides a ready-to-use version of this dataset below. For this notebook, the target feature is interpolated because average has a some missing values that should be skipped.

OTV is a useful modeling method when you know that your data changes in distribution over time. If this is true of your data, random sampling of training and testing datasets would not yield an outcome that would be representative of the model accuracy when it is making predictions in a production environment. Note that OTV can be applied to both classification and regression projects. It partitions your data using the backtesting method, also used in time series modeling.

Requirements¶

Python version 3.7.3.
DataRobot API version 2.21.0.

Small adjustments to the code below may be required depending on the Python version and DataRobot API version used.

Reference documentation for DataRobot's Python client here.

Import libraries¶

In [ ]:

Copied!





import datarobot as dr
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import datarobot as dr
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Import data¶

You can download the sample training dataset here.

In [10]:

Copied!





data_path = "https://docs.datarobot.com/en/docs/api/guide/common-case/co2_mm_mlo.csv"
df = pd.read_csv(data_path)  # Add your dataset here
df["day"] = 1  # Displays an arbitrary "day" column to create an accurate "date" feature
df.head()
data_path = "https://docs.datarobot.com/en/docs/api/guide/common-case/co2_mm_mlo.csv"
df = pd.read_csv(data_path)  # Add your dataset here
df["day"] = 1  # Displays an arbitrary "day" column to create an accurate "date" feature
df.head()

Out[10]:

	year	month	decimal date	average	interpolated	trend	ndays	day
0	1958	3	1958.208	315.71	315.71	314.62	-1	1
1	1958	4	1958.292	317.45	317.45	315.29	-1	1
2	1958	5	1958.375	317.50	317.50	314.71	-1	1
3	1958	6	1958.458	-99.99	317.10	314.85	-1	1
4	1958	7	1958.542	315.86	315.86	314.98	-1	1

Connect to DataRobot¶

Use the snippet below to authenticate and connect to DataRobot. You can read more about different options for connecting to DataRobot from the client.

In [ ]:

Copied!

# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call
# dr.Client(config_path='path-to-drconfig.yaml')
# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call
# dr.Client(config_path='path-to-drconfig.yaml')

Preprocessing¶

Before you begin modeling, you must complete the following steps:

Create an accurate "date" feature.
Remove all unnecessary features.
Create two month lag features.

You can create many more features (such as aggregates on a monthly level or percentages), but for the purposes of OTV, this is not required.

In [11]:

Copied!





df["date"] = pd.to_datetime(df[["year", "month", "day"]])
df.drop(["year", "month", "decimal date", "average", "ndays", "day"], inplace=True, axis=1)

# Create 2 month lag features
for i in range(1, 5):
    df["lag_{}".format(i)] = df["interpolated"].shift(i)

df = df.iloc[8:]
df.head()
df["date"] = pd.to_datetime(df[["year", "month", "day"]])
df.drop(["year", "month", "decimal date", "average", "ndays", "day"], inplace=True, axis=1)

# Create 2 month lag features
for i in range(1, 5):
    df["lag_{}".format(i)] = df["interpolated"].shift(i)

df = df.iloc[8:]
df.head()

Out[11]:

	interpolated	trend	date	lag_1	lag_2	lag_3	lag_4
8	313.33	315.31	1958-11-01	312.66	313.20	314.93	315.86
9	314.67	315.61	1958-12-01	313.33	312.66	313.20	314.93
10	315.62	315.70	1959-01-01	314.67	313.33	312.66	313.20
11	316.38	315.88	1959-02-01	315.62	314.67	313.33	312.66
12	316.71	315.62	1959-03-01	316.38	315.62	314.67	313.33

Plot the data¶

By plotting the data (displayed below), you can observe that it follows an upwards trend. Note that randomly partitioning the data for testing purposes would not work and you would not get representative accuracy metrics.

In [12]:

Copied!

sns.lineplot(x="date", y="interpolated", data=df)
sns.lineplot(x="date", y="interpolated", data=df)

Out[12]:

<AxesSubplot:xlabel='date', ylabel='interpolated'>

No description has been provided for this image

Define datetime partitioning¶

Use the snippet below to define datetime partitioning for the data. You can also reference a more complete example of Datetime Partitioning.

In [13]:

Copied!

spec = dr.DatetimePartitioningSpecification(
    datetime_partition_column="date", number_of_backtests=4, use_time_series=False
)
spec = dr.DatetimePartitioningSpecification(
    datetime_partition_column="date", number_of_backtests=4, use_time_series=False
)

Start the project¶

The snippet below passes the spec object as an input to the partitioning_method variable in the set_target method. This starts the project with the designated settings.

In [ ]:

Copied!

project = dr.Project.create(df, project_name="Predicting CO2 levels for Mauna Loa")

project.analyze_and_model("interpolated", partitioning_method=spec, worker_count=-1)
project.wait_for_autopilot()
project = dr.Project.create(df, project_name="Predicting CO2 levels for Mauna Loa")

project.analyze_and_model("interpolated", partitioning_method=spec, worker_count=-1)
project.wait_for_autopilot()

Access insights¶

All model insights are available via the API. The example below displays Feature Impact calculated for one of the trained models.

Access more examples and sample code for extracting insights from the DataRobot Community.

In [15]:

Copied!

model = project.get_top_model()

# Get Feature Impact
feature_impact = model.get_or_request_feature_impact()

# Save feature impact in pandas dataframe
fi_df = pd.DataFrame(feature_impact)
model = project.get_top_model()

# Get Feature Impact
feature_impact = model.get_or_request_feature_impact()

# Save feature impact in pandas dataframe
fi_df = pd.DataFrame(feature_impact)

In [16]:

Copied!

fig, ax = plt.subplots(figsize=(12, 5))

# Plot feature impact
sns.barplot(x="featureName", y="impactNormalized", data=fi_df[0:5], color="g")
fig, ax = plt.subplots(figsize=(12, 5))

# Plot feature impact
sns.barplot(x="featureName", y="impactNormalized", data=fi_df[0:5], color="g")

Out[16]:

<AxesSubplot:xlabel='featureName', ylabel='impactNormalized'>

Predict CO₂ levels with out-of-time validation modeling¶

Requirements¶

Import libraries¶

Import data¶

Connect to DataRobot¶

Preprocessing¶

Plot the data¶

Define datetime partitioning¶

Start the project¶

Access insights¶

Was this page helpful?

Great! Let us know what you found helpful.

What can we do to improve the content?