Python modeling workflow overview¶
This code example outlines how to use DataRobot's Python API client to train and experiment with models. It also offers ideas for integrating DataRobot with other products via the API.
具体的には、以下のことを行います。
プロジェクトを作成し、オートパイロットを実行します。
特徴量セット、モデリングアルゴリズム、ハイパーパラメーターを実験します。
最適なモデルを選択します。
選択したモデルの詳細な評価を実行します。
数行のコードでモデルを本番環境にデプロイします。
この例で使用されるデータ¶
この基本ステップでは、以下のマネーロンダリングシナリオを検出することを目的として、クレジットカード会社のマネーロンダリング防止(AML)コンプライアンスプログラムを示す合成データセットを使用します。
顧客はカードを使用しますが、クレジットカードの請求額を超えて支払ったため、差額の現金払い戻しを求めます。
顧客は、取引を相殺せずに加盟店からクレジットを受け取り、その金額を使うか、銀行に現金の払い戻しを求めます。
上記のシナリオと一致する潜在的に不審なアクティビティを検出したときにアラートを生成するために、ルールベースのエンジンが導入されています。このエンジンは、顧客が任意の金額の払い戻しを要求するたびに、アラートをトリガーします。少額の払い戻し要求が含まれているのは、マネーロンダラーが払い戻しメカニズムをテストしたり、払い戻し要求を自分のアカウントの通常のパターンとして確立しようとしたりするためである可能性があります。
ターゲット特徴量はSAR(不審なアクティビティの報告)です。これは、調査担当者による手作業レビューの後にアラートがSARになったかどうかを示します。つまり、このプロジェクトは二値分類の問題です。分析の単位は個々のアラートであるため、このモデルはアラートレベルで構築されます。各アラートには0から1の範囲のスコアが付けられ、SARにつながるアラートである確率を示します。データは、数値データ、カテゴリーデータ、およびテキストデータの組み合わせで構成されています。
セットアップ¶
Import Libraries¶
import datarobot as dr
from datarobot_bp_workshop import Visualize, Workshop
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
import time
import warnings
import graphviz
import plotly.express as px
import seaborn as sns
warnings.filterwarnings("ignore")
w = Workshop()
# wider .head()s
pd.options.display.width = 0
pd.options.display.max_columns = 200
pd.options.display.max_rows = 2000
sns.set_theme(style="darkgrid")
DataRobotに接続する¶
Read more about different options for connecting to DataRobot from the client.
# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call
# dr.Client(config_path='path-to-drconfig.yaml')
Select a training dataset¶
# To read from a local file, uncomment and use:
# df = pd.read_csv('./data/DR_Demo_AML_Alert.csv')
# To read from an s3 bucket:
df = pd.read_csv("https://s3.amazonaws.com/datarobot_public_datasets/DR_Demo_AML_Alert.csv")
df.head()
# To view target distribution:
df_target_summary = (
pd.DataFrame(df["SAR"].value_counts())
.reset_index()
.rename(columns={"index": "SAR", "SAR": "Count"})
)
ax = sns.barplot(x="SAR", y="Count", data=df_target_summary)
for index, row in df_target_summary.iterrows():
ax.text(row.SAR, row.Count, round(row.Count, 2), color="black", ha="center")
plt.show()
プロジェクトを作成し、オートパイロットでモデルをトレーニングする¶
# Create a project by uploading data. This will take a few minutes.
project = dr.Project.create(
sourcedata=df,
project_name="DR_Demo_API_alert_AML_{}".format(pd.datetime.now().strftime("%Y-%m-%d %H:%M")),
)
# Set the project's target and initiate Autopilot in Quick mode.
# Wait for Autopilot to finish. You can set verbosity to 0 if you do not wish to see progress updates.
project.analyze_and_model(target="SAR", worker_count=-1)
# Open the project's Leaderboard to monitor the progress in UI.
project.open_in_browser()
リーダーボードから結果を取得して確認する¶
def get_top_of_leaderboard(project, verbose=True):
# A helper method to assemble a dataframe with Leaderboard results and print a summary:
leaderboard = []
for m in project.get_models():
leaderboard.append(
[
m.blueprint_id,
m.featurelist.id,
m.id,
m.model_type,
m.sample_pct,
m.metrics["AUC"]["validation"],
m.metrics["AUC"]["crossValidation"],
]
)
leaderboard_df = pd.DataFrame(
columns=[
"bp_id",
"featurelist",
"model_id",
"model",
"pct",
"validation",
"cross_validation",
],
data=leaderboard,
)
if verbose == True:
# Print a Leaderboard summary:
print("Unique blueprints tested: " + str(len(leaderboard_df["bp_id"].unique())))
print("Feature lists tested: " + str(len(leaderboard_df["featurelist"].unique())))
print("Models trained: " + str(len(leaderboard_df)))
print("Blueprints in the project repository: " + str(len(project.get_blueprints())))
# Print the essential information for top models, sorted by accuracy from validation data:
print("\n\nTop models on the Leaderboard:")
leaderboard_top = (
leaderboard_df[leaderboard_df["pct"] == 64]
.sort_values(by="cross_validation", ascending=False)
.head()
.reset_index(drop=True)
)
display(leaderboard_top.drop(columns=["bp_id", "featurelist"], inplace=False))
# Show blueprints of top models:
for index, m in leaderboard_top.iterrows():
Visualize.show_dr_blueprint(dr.Blueprint.get(project.id, m["bp_id"]))
return leaderboard_top
leaderboard_top = get_top_of_leaderboard(project)
Unique blueprints tested: 15 Feature lists tested: 4 Models trained: 28 Blueprints in the project repository: 81 Top models on the Leaderboard:
| model_id | model | pct | validation | cross_validation | |
|---|---|---|---|---|---|
| 0 | 61ae672ab9e0c15c325b05b2 | RandomForest Classifier (Gini) | 64.0 | 0.94577 | 0.945790 |
| 1 | 61ae73aa2a2a4649c04e2c73 | RandomForest Classifier (Gini) | 64.0 | 0.94598 | 0.945712 |
| 2 | 61ae69df4e24ec75154c24ee | AVG Blender | 64.0 | 0.94573 | 0.945318 |
| 3 | 61ae65a640d62771a45b0594 | eXtreme Gradient Boosted Trees Classifier with... | 64.0 | 0.94675 | 0.945166 |
| 4 | 61ae65a540d62771a45b0592 | RandomForest Classifier (Gini) | 64.0 | 0.94542 | 0.944690 |
実験して、より良い結果を得る¶
When you run a project using Autopilot, DataRobot first creates blueprints based on the characteristics of your data and puts them in the Repository. Then, it chooses a subset from these to train; when training completes, these are the blueprints you’ll find on the Leaderboard. After the Leaderboard is populated, it can be useful to train some of those blueprints that DataRobot skipped. For example, you can try a more complex Keras blueprint like Keras Residual AutoInt Classifier using Training Schedule (3 Attention Layers with 2 Heads, 2 Layers: 100, 100 Units). In some cases, you may want to directly access the trained model through R and retrain it with a different feature list or tune its hyperparameters.
Find blueprints trained for the project from the Repository¶
blueprints = project.get_blueprints()
# After retrieving the blueprints, you can search for a specific blueprint
# In the example below, search for all models that have "Gradient" in their name
models_to_run = []
for blueprint in blueprints:
if "Gradient" in blueprint.model_type:
models_to_run.append(blueprint)
models_to_run
Define and train a custom blueprint¶
If you wish to instead create a custom blueprint rather than finding one from the Repository, use the following snippet. You can read more about composing custom blueprints via code by visiting the blueprint workshop in DataRobot.
pdm3 = w.Tasks.PDM3(w.TaskInputs.CAT)
pdm3.set_task_parameters(cm=50000, sc=10)
ndc = w.Tasks.NDC(w.TaskInputs.NUM)
rdt5 = w.Tasks.RDT5(ndc)
ptm3 = w.Tasks.PTM3(w.TaskInputs.TXT)
ptm3.set_task_parameters(d2=0.2, mxf=20000, d1=5, n="l2", id=True)
kerasc = w.Tasks.KERASC(rdt5, pdm3, ptm3)
kerasc.set_task_parameters(
always_use_test_set=1,
epochs=4,
hidden_batch_norm=1,
hidden_units="list(64)",
hidden_use_bias=0,
learning_rate=0.03,
use_training_schedule=1,
)
# Check task documentation:
# kerasc.documentation()
kerasc_blueprint = w.BlueprintGraph(kerasc, name="A Custom Keras BP (1 Layer: 64 Units)").save()
kerasc_blueprint.show()
kerasc_blueprint.train(project_id=project.id, sample_pct=64)
Training requested! Blueprint Id: 4f5c40cbacfa89b3e37dc2f6d5c169a2
Name: 'A Custom Keras BP (1 Layer: 64 Units)' Input Data: Categorical | Numeric | Text Tasks: One-Hot Encoding | Numeric Data Cleansing | Smooth Ridit Transform | Matrix of word-grams occurrences | Keras Neural Network Classifier
Train a model with a different feature list¶
# Select a model from the Leaderboard:
model = dr.Model.get(project=project.id, model_id=leaderboard_top.iloc[0]["model_id"])
# Retrieve Feature Impact:
feature_impact = model.get_or_request_feature_impact()
# Create a feature list using the top 25 features based on feature impact:
feature_list = [f["featureName"] for f in feature_impact[:25]]
new_list = project.create_featurelist("new_feat_list", feature_list)
# Retrain models using the new feature list:
model.retrain(featurelist_id=new_list.id)
モデルのハイパーパラメーターを調整する¶
tune = model.start_advanced_tuning_session()
# Get available task names,
# and available parameter names for a task name that exists on this model
tasks = tune.get_task_names()
tune.get_parameter_names(tasks[2])
# Adjust this section as required as it may differ depending on task/parameter names as well as acceptable values
tune.set_parameter(task_name=tasks[1], parameter_name="n_estimators", value=200)
job = tune.run()
Select the best model¶
# View the top models on the Leaderboard
leaderboard_top = get_top_of_leaderboard(project)
Unique blueprints tested: 15 Feature lists tested: 4 Models trained: 28 Blueprints in the project repository: 81 Top models on the Leaderboard:
| model_id | model | pct | validation | cross_validation | |
|---|---|---|---|---|---|
| 0 | 61ae672ab9e0c15c325b05b2 | RandomForest Classifier (Gini) | 64.0 | 0.94577 | 0.945790 |
| 1 | 61ae73aa2a2a4649c04e2c73 | RandomForest Classifier (Gini) | 64.0 | 0.94598 | 0.945712 |
| 2 | 61ae69df4e24ec75154c24ee | AVG Blender | 64.0 | 0.94573 | 0.945318 |
| 3 | 61ae65a640d62771a45b0594 | eXtreme Gradient Boosted Trees Classifier with... | 64.0 | 0.94675 | 0.945166 |
| 4 | 61ae65a540d62771a45b0592 | RandomForest Classifier (Gini) | 64.0 | 0.94542 | 0.944690 |
# Select the model based on accuracy (AUC)
top_model = dr.Model.get(project=project.id, model_id=leaderboard_top.iloc[0]["model_id"])
In-depth model evaluation¶
ROC曲線を取得してプロットする¶
roc = top_model.get_roc_curve("validation")
df_roc = pd.DataFrame(roc.roc_points)
dr_dark_blue = "#08233F"
dr_roc_green = "#03c75f"
white = "#ffffff"
fig = plt.figure(figsize=(8, 8))
axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)
plt.scatter(df_roc.false_positive_rate, df_roc.true_positive_rate, color=dr_roc_green)
plt.plot(df_roc.false_positive_rate, df_roc.true_positive_rate, color=dr_roc_green)
plt.plot([0, 1], [0, 1], color=white, alpha=0.25)
plt.title("ROC curve")
plt.xlabel("False Positive Rate (Fallout)")
plt.xlim([0, 1])
plt.ylabel("True Positive Rate (Sensitivity)")
plt.ylim([0, 1])
plt.show()
特徴量のインパクトを取得してプロットする¶
max_num_features = 15
# Retrieve Feature Impact
feature_impacts = top_model.get_or_request_feature_impact()
# Plot permutation-based Feature Impact
feature_impacts.sort(key=lambda x: x["impactNormalized"], reverse=True)
FeatureImpactDF = pd.DataFrame(
[
{"Impact Normalized": f["impactNormalized"], "Feature Name": f["featureName"]}
for f in feature_impacts[:max_num_features]
]
)
FeatureImpactDF["X axis"] = FeatureImpactDF.index
g = sns.lmplot(x="Impact Normalized", y="X axis", data=FeatureImpactDF, fit_reg=False)
sns.barplot(y=FeatureImpactDF["Feature Name"], x=FeatureImpactDF["Impact Normalized"])
特徴量ごとの作用を取得してプロットする¶
feature_effects = top_model.get_or_request_feature_effect(source="validation")
max_features = 5
for f in feature_effects.feature_effects[:max_features]:
plt.figure(figsize=(9, 6))
d = pd.DataFrame(f["partial_dependence"]["data"])
if f["feature_type"] == "numeric":
d = d[d["label"] != "nan"]
d["label"] = pd.to_numeric(d["label"])
sns.lineplot(x="label", y="dependence", data=d).set_title(
f["feature_name"] + ": importance=" + str(round(f["feature_impact_score"], 2))
)
else:
sns.scatterplot(x="label", y="dependence", data=d).set_title(
f["feature_name"] + ": importance=" + str(round(f["feature_impact_score"], 2))
)
デプロイ前にデータをスコアリングする¶
# Use training data to test how the model makes predictions
test_data = df.head(50)
dataset_from_file = project.upload_dataset(test_data)
predict_job_1 = top_model.request_predictions(dataset_from_file.id)
predictions = predict_job_1.get_result_when_complete()
display(predictions.head())
| row_id | prediction | positive_probability | prediction_threshold | class_0.0 | class_1.0 | |
|---|---|---|---|---|---|---|
| 0 | 0 | 0.0 | 0.000192 | 0.5 | 0.999808 | 0.000192 |
| 1 | 1 | 0.0 | 0.214922 | 0.5 | 0.785078 | 0.214922 |
| 2 | 2 | 0.0 | 0.256123 | 0.5 | 0.743877 | 0.256123 |
| 3 | 3 | 0.0 | 0.000051 | 0.5 | 0.999949 | 0.000051 |
| 4 | 4 | 0.0 | 0.215951 | 0.5 | 0.784049 | 0.215951 |
予測の説明を計算する¶
# Prepare prediction explanations
pe_job = dr.PredictionExplanationsInitialization.create(project.id, top_model.id)
pe_job.wait_for_completion()
# Compute prediction explanations with default parameters
pe_job2 = dr.PredictionExplanations.create(
project.id,
top_model.id,
dataset_from_file.id,
max_explanations=3,
threshold_low=0.1,
threshold_high=0.5,
)
pe = pe_job2.get_result_when_complete()
display(pe.get_all_as_dataframe().head())
モデルのデプロイ¶
最高のパフォーマンスを発揮するモデルを特定したら、それをデプロイし、DataRobotのREST APIを使用してHTTPリクエストを行い、予測を返すことができます。選択した環境に書き戻すようにバッチジョブを設定することもできます。
デプロイ後は、次のような監視機能を利用できます。
モデルの再トレーニング
# Copy and paste the model ID from previous steps or from the UI:
model_id = top_model.id
prediction_server_id = dr.PredictionServer.list()[0].id
deployment = dr.Deployment.create_from_learning_model(
model_id,
label="New Deployment",
description="A new deployment",
default_prediction_server_id=prediction_server_id,
)
deployment