Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Modeling workflow overview

This code example outlines how to use DataRobot's clients, both Python and R, to train and experiment with models. It also offers ideas for integrating DataRobot with other products via the API.

Specifically, you will:

  • Create a project and run Autopilot.
  • Experiment with feature lists, modeling algorithms, and hyperparameters.
  • Choose the best model.
  • Perform an in-depth evaluation of the selected model.
  • Deploy a model into production in a few lines of code.

In addition to this walkthrough, you can download a Jupyter notebook for each language:

Data used for this example

This walkthrough uses a synthetic dataset that illustrates a credit card company’s anti-money laundering (AML) compliance program, with the intent of detecting the following money-laundering scenarios:

  • A customer spends on the card, but overpays their credit card bill and seeks a cash refund for the difference.
  • A customer receives credits from a merchant without offsetting transactions, and either spends the money or requests a cash refund from the bank.

A rule-based engine is in place to produce an alert when it detects potentially suspicious activity consistent with the scenarios above. The engine triggers an alert whenever a customer requests a refund of any amount. Small refund requests are included because they could be a money launderer’s way of testing the refund mechanism or trying to establish refund requests as a normal pattern for their account.

The target feature is SAR, suspicious activity reports. It indicates whether or not the alert resulted in an SAR after manual review by investigators, which means that this project is a binary classification problem. The unit of analysis is an individual alert, so the model will be built on the alert level. Each alert will get a score ranging from 0 to 1, indicating the probability of being an alert leading to an SAR. The data consists of a mixture of numeric, categorical, and text data.

Setup

Import libraries

The first step to create a project is uploading a dataset. This example uses the dataset auto-mpg.csv, which you can download here.

import datarobot as dr
from datarobot_bp_workshop import Workshop, Visualize
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import time
import warnings
import graphviz
import plotly.express as px
warnings.filterwarnings('ignore')
w = Workshop()

# wider .head()s
pd.options.display.width = 0
pd.options.display.max_columns = 200
pd.options.display.max_rows = 2000

sns.set_theme(style="darkgrid")
library(dplyr)
library(ggplot2)
library(datarobot)

Connect to DataRobot

Read more about different options for connecting to DataRobot from the client.

dr.Client(config_path = '<file-path-to-drconfig.yaml>')
datarobot::ConnectToDataRobot(configPath = '<file-path-to-drconfig.yaml>')

Upload a dataset

# To read from a local file, uncomment and use:
# df = pd.read_csv('./data/DR_Demo_AML_Alert.csv')

# To read from an s3 bucket:
df = pd.read_csv('https://s3.amazonaws.com/datarobot_public_datasets/DR_Demo_AML_Alert.csv')
df.head()

# To view target distribution:
df_target_summary = pd.DataFrame(df['SAR'].value_counts()).reset_index().rename(columns={'index':'SAR','SAR':'Count'})
ax = sns.barplot(x='SAR',y='Count',
             data=df_target_summary)

for index, row in df_target_summary.iterrows():
    ax.text(row.SAR,row.Count, round(row.Count,2), color='black', ha="center")

plt.show()
# Set to the location of the training data via a local file or URL
# Sample file location: '/Users/myuser/Downloads/DR_Demo_AML_Alert.csv'
dataset_file_path <- "https://s3.amazonaws.com/datarobot_public_datasets/DR_Demo_AML_Alert.csv"
training_data <- utils::read.csv(dataset_file_path)
test_data <- training_data[ -c(2) ]
head(training_data)

Create a project and train models with Autopilot

When you have successfully deployed a model, you can use the DataRobot Prediction API to make predictions. This allows you to access advanced model management features such as data drift, accuracy, and service health statistics.

You can also reference a Python prediction snippet from the UI. Navigate to the Deployments page, select your deployment, and go to Predictions > Prediction API to reference the snippet for making predictions.

# Create a project by uploading data. This will take a few minutes.
project = dr.Project.create(sourcedata=df,
                            project_name='DR_Demo_API_alert_AML_{}'.format(pd.datetime.now().strftime('%Y-%m-%d %H:%M')))

# Set the project's target and initiate Autopilot in Quick mode.
project.set_target(target='SAR', mode='quick', worker_count=-1)

# Open the project's Leaderboard to monitor the progress in UI.
project.open_leaderboard_browser()

# Wait for Autopilot to finish. You can set verbosity to 0 if you do not wish to see progress updates.
project.wait_for_autopilot(verbosity=1)
# Create a project by uploading data. This will take a few minutes.
project <- SetupProject(dataSource = training_data, projectName = "SAR Detection", maxWait = 60 * 60)

# Set the project target and initiate Autopilot
SetTarget(project,
  target = "SAR")

  # Block execution until Autopilot is complete
  WaitForAutopilot(project)
  # The `WaitForAutopilot()` function forces the R Kernel to wait until DataRobot has finished modeling before executing the next series of commands.

  # Open the project's Leaderboard to monitor the progress in UI.
  ViewWebProject("620423876638a2187c5aa876") # Provide the project ID

Retrieve and review results from the Leaderboard

def get_top_of_leaderboard(project, verbose = True):
    # A helper method to assemble a dataframe with Leaderboard results and print a summary:
    leaderboard = []
    for m in project.get_models():
        leaderboard.append([m.blueprint_id, m.featurelist.id, m.id, m.model_type, m.sample_pct, m.metrics['AUC']['validation'], m.metrics['AUC']['crossValidation']])
    leaderboard_df = pd.DataFrame(columns = ['bp_id', 'featurelist', 'model_id', 'model', 'pct', 'validation', 'cross_validation'], data = leaderboard)

    if verbose == True:
        # Print a Leaderboard summary:
        print("Unique blueprints tested: " + str(len(leaderboard_df['bp_id'].unique())))
        print("Feature lists tested: " + str(len(leaderboard_df['featurelist'].unique())))
        print("Models trained: " + str(len(leaderboard_df)))
        print("Blueprints in the project repository: " + str(len(project.get_blueprints())))

        # Print the essential information for top models, sorted by accuracy from validation data:
        print("\n\nTop models in the leaderboard:")
        leaderboard_top = leaderboard_df[leaderboard_df['pct'] == 64].sort_values(by = 'cross_validation', ascending = False).head().reset_index(drop = True)
        display(leaderboard_top.drop(columns = ['bp_id', 'featurelist'], inplace = False))

        # Show blueprints of top models:
        for index, m in leaderboard_top.iterrows():
            Visualize.show_dr_blueprint(dr.Blueprint.get(project.id, m['bp_id']))

    return leaderboard_top

leaderboard_top = get_top_of_leaderboard(project)
# Use the `ListModels()` function to retrieve a list of all the trained DataRobot models for a specified project.
ListModels(project)

# Retrive the model DataRobot recommends for deployment
model <- GetRecommendedModel(project, type = RecommendedModelType$RecommendedForDeployment)

# Get a model's blueprint
GetModelBlueprintChart(project, "<model-id>") # Provide the model ID

Experiment to get better results

When you run a project using Autopilot, DataRobot first creates blueprints based on the characteristics of your data and puts them in the Repository. Then, it chooses a subset from these to train; when training completes, these are the blueprints you’ll find in the Leaderboard. After the Leaderboard is populated, it can be useful to train some of those blueprints that DataRobot skipped. For example, you can try a more complex Keras blueprint like Keras Residual AutoInt Classifier using Training Schedule (3 Attention Layers with 2 Heads, 2 Layers: 100, 100 Units). In some cases, you may want to directly access the trained model through R and retrain it with a different feature list or tune its hyperparameters.

Find blueprints not yet trained for the project from the Repository

blueprints = project.get_blueprints()

# After retrieving the blueprints, you can search for a specific blueprint
# In the example below, search for all models that have "Gradient" in their name

models_to_run = []
for blueprint in blueprints:
    if 'Gradient' in blueprint.model_type:
        models_to_run.append(blueprint)
models_to_run
modelsInLeaderboard <- ListModels(project)
modelsInLeaderboard_df <- as.data.frame(modelsInLeaderboard)

Python: define and train a custom blueprint

Python

This section, exclusive to the Python client, describes how to use various DataRobot features to improve the results returned from models. Use the snippet below to define and train a custom blueprint. You can read more about composing custom blueprints via code by visiting DataRobot's blueprint workshop.

pdm3 = w.Tasks.PDM3(w.TaskInputs.CAT)
pdm3.set_task_parameters(cm=50000, sc=10)

ndc = w.Tasks.NDC(w.TaskInputs.NUM)
rdt5 = w.Tasks.RDT5(ndc)

ptm3 = w.Tasks.PTM3(w.TaskInputs.TXT)
ptm3.set_task_parameters(d2=0.2, mxf=20000, d1=5, n='l2', id=True)

kerasc = w.Tasks.KERASC(rdt5, pdm3, ptm3)
kerasc.set_task_parameters(always_use_test_set=1, epochs=4, hidden_batch_norm=1, hidden_units='list(64)', hidden_use_bias=0, learning_rate=0.03, use_training_schedule=1)

# Check task documentation:
# kerasc.documentation()

kerasc_blueprint = w.BlueprintGraph(kerasc, name='A Custom Keras BP (1 Layer: 64 Units)').save()
kerasc_blueprint.show()
kerasc_blueprint.train(project_id = project.id, sample_pct = 64)

After creating a custom blueprint, use the code outlined below to train models with the custom blueprint.

Train a model using a different feature list

# Select a model from the Leaderboard:
model = dr.Model.get(project = project.id, model_id = leaderboard_top.iloc[0]['model_id'])

# Retrieve Feature Impact:
feature_impact = model.get_or_request_feature_impact()

# Create a feature list using the top 25 features based on feature impact:
feature_list = [f["featureName"] for f in feature_impact[:25]]
new_list = project.create_featurelist('new_feat_list', feature_list)

# Retrain models using the new feature list:
model.retrain(featurelist_id = new_list.id)
for (i in 1:length(models_to_run)){
    job <- RequestNewModel(project, models_to_run[[i]])
    WaitForJobToComplete(project, job, maxWait=600)
}

Tune hyperparameters for a model

tune = model.start_advanced_tuning_session()

# Get available task names,
# and available parameter names for a task name that exists on this model
tasks = tune.get_task_names()
tune.get_parameter_names(tasks[1])

# Adjust this section as required as it may differ depending on task/parameter names as well as acceptable values
tune.set_parameter(
    task_name=tasks[1],
    parameter_name='n_estimators',
    value=200)

job = tune.run()
StartTuningSession(model)

Select the top-performing model

# View the top models on the Leaderboard
leaderboard_top = get_top_of_leaderboard(project)
# Select the model based on accuracy (AUC)
top_model = dr.Model.get(project = project.id, model_id = leaderboard_top.iloc[0]['model_id'])
# Use the `ListModels()` function to retrieve a list of all the trained DataRobot models for a specified project
ListModels(project)

# Retrive the model DataRobot recommends for deployment
model <- GetRecommendedModel(project, type = RecommendedModelType$RecommendedForDeployment)

Model evaluation

Retrieve and plot Feature Impact

max_num_features = 15

# Retrieve Feature Impact
feature_impacts = top_model.get_or_request_feature_impact()

# Plot permutation-based Feature Impact
feature_impacts.sort(key=lambda x: x['impactNormalized'], reverse=True)
FeatureImpactDF = pd.DataFrame([{'Impact Normalized': f["impactNormalized"],
                                 'Feature Name': f["featureName"]}
                                for f in feature_impacts[:max_num_features]])
FeatureImpactDF["X axis"] = FeatureImpactDF.index
g = sns.lmplot(x="Impact Normalized", y="X axis", data=FeatureImpactDF, fit_reg=False)
sns.barplot(y=FeatureImpactDF["Feature Name"], x=FeatureImpactDF["Impact Normalized"])
# Retrieve the top 10 most impactful features:
feature_impact <- GetFeatureImpact(model)
feature_impact <- feature_impact[order(-feature_impact$impactNormalized), ] %>% slice(1:10)

# Create plot of top 10 features based on Feature Impact
ggplot(data = feature_impact,
       mapping = aes(
          x = featureName,
          y = impactNormalized)) +
    geom_col() +
    labs(x = "Feature")

Retrieve and plot the ROC curve

roc = top_model.get_roc_curve('validation')
df_roc = pd.DataFrame(roc.roc_points)
dr_dark_blue = '#08233F'
dr_roc_green = '#03c75f'
white = '#ffffff'

fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)

plt.scatter(df_roc.false_positive_rate, df_roc.true_positive_rate, color=dr_roc_green)
plt.plot(df_roc.false_positive_rate, df_roc.true_positive_rate, color=dr_roc_green)
plt.plot([0, 1], [0, 1], color=white, alpha=0.25)
plt.title('ROC curve')
plt.xlabel('False Positive Rate (Fallout)')
plt.xlim([0, 1])
plt.ylabel('True Positive Rate (Sensitivity)')
plt.ylim([0, 1])
plt.show()
roc <- GetRocCurve(model,
               source = 'validation')
roc_df <- roc$rocPoints
head(roc_df)

Retrieve and plot Feature Effects

feature_effects = top_model.get_or_request_feature_effect(source='validation')
max_features = 5

for f in feature_effects.feature_effects[:max_features]:
    plt.figure(figsize = (9,6))
    d = pd.DataFrame(f['partial_dependence']['data'])
    if f['feature_type'] == 'numeric':
        d = d[d['label'] != 'nan']
        d['label'] = pd.to_numeric(d['label'])
        sns.lineplot(x="label", y="dependence", data = d).set_title(f['feature_name'] + ": importance=" + str(round(f['feature_impact_score'], 2)))
    else:
        sns.scatterplot(x="label", y="dependence", data = d).set_title(f['feature_name'] + ": importance=" + str(round(f['feature_impact_score'], 2)))

Score data before deployment

# Use training data to test how the model makes predictions
test_data = df.head(50)

dataset_from_file = project.upload_dataset(test_data)
predict_job_1 = top_model.request_predictions(dataset_from_file.id)

predictions = predict_job_1.get_result_when_complete()
display(predictions.head())
test_data <- training_data[ -c(2) ]
head(test_data)
# Uploading the testing dataset
scoring <- UploadPredictionDataset(project, dataSource = test_data)

# Requesting prediction
predict_job_id <- RequestPredictions(project, modelId = model$modelId, datasetId = scoring$id)

# Grabbing predictions
predictions_prob <- GetPredictions(project, predictId = predict_job_id, type = "probability")
head(predictions_prob)

Compute Prediction Explanations

# Prepare prediction explanations
pe_job = dr.PredictionExplanationsInitialization.create(project.id, top_model.id)
pe_job.wait_for_completion()
# Compute prediction explanations with default parameters
pe_job2 = dr.PredictionExplanations.create(project.id,
                                           top_model.id,
                                           dataset_from_file.id,
                                           max_explanations=3,
                                           threshold_low = 0.1,
                                           threshold_high = 0.5)
pe = pe_job2.get_result_when_complete()
display(pe.get_all_as_dataframe().head())
GetPredictionExplanations(model, test_data)

Deploy a model

After identifying the best-performing models, you can deploy them and use DataRobot's REST API to make HTTP requests and return predictions. You can also configure batch jobs to write back into your environment of choice.

Once deployed, access monitoring capabilities such as:

# Copy and paste the model ID from previous steps or from the UI:
model_id = top_model.id
prediction_server_id = dr.PredictionServer.list()[0].id

deployment = dr.Deployment.create_from_learning_model(
    model_id,
    label = 'New Deployment',
    description = 'A new deployment',
    default_prediction_server_id = prediction_server_id)
deployment
prediction_server <- ListPredictionServers()[[1]]

deployment <- CreateDeployment(model,
                               label = 'New Deployment',
                               description = 'A new deployment',
                               defaultPredictionServerId = prediction_server$id)
deployment

Updated April 28, 2022
Back to top