Create a clustering project¶
This notebook outlines how to create a clustering project and initiate Autopilot in Manual mode via DataRobot's REST API. Manual mode allows you to select and train specific blueprints for modeling. If you run a clustering project in comprehensive Autopilot mode, some blueprints may take a long time to complete. For example, HDBSCAN is inherently a slow model to train. Because of these time constraints, this notebook only runs one blueprint (K-Means) and tests several clusters.
Requirements¶
- DataRobot recommends Python version 3.7 or later.
- DataRobot API version 2.28.0
Import libraries¶
import datetime
import json
import time
from pandas.io.json import json_normalize
import requests
import yaml
Set credentials¶
FILE_CREDENTIALS = (
"/Volumes/GoogleDrive/My Drive/rodrigo.miranda/mlops-admin/rodrigo.miranda_drconfig.yaml"
)
parsed_file = yaml.load(open(FILE_CREDENTIALS), Loader=yaml.FullLoader)
DR_ENDPOINT = parsed_file["endpoint"]
API_TOKEN = parsed_file["token"]
AUTH_HEADERS = {"Authorization": "token %s" % API_TOKEN}
Define functions¶
The functions below handle responses, including asynchronous calls.
def wait_for_async_resolution(status_url):
status = False
while status == False:
resp = requests.get(status_url, headers=AUTH_HEADERS)
r = json.loads(resp.content)
try:
statusjob = r["status"].upper()
except:
statusjob = ""
if resp.status_code == 200 and statusjob != "RUNNING" and statusjob != "INITIALIZED":
status = True
print("Finished: " + str(datetime.datetime.now()))
return resp
print("Waiting: " + str(datetime.datetime.now()))
time.sleep(10) # Delays for 10 seconds.
def wait_for_result(response):
assert response.status_code in (200, 201, 202), response.content
if response.status_code == 200:
data = response.json()
elif response.status_code == 201:
status_url = response.headers["Location"]
resp = requests.get(status_url, headers=AUTH_HEADERS)
assert resp.status_code == 200, resp.content
data = resp.json()
elif response.status_code == 202:
status_url = response.headers["Location"]
resp = wait_for_async_resolution(status_url)
data = resp.json()
return data
Create a project¶
Endpoint: POST /api/v2/projects/
FILE_DATASET = (
"/Volumes/GoogleDrive/My Drive/Datasets/Customer Invoices/clustering_customer_invoices.csv"
)
payload = {"file": ("Clustering - Customer Invoices 02", open(FILE_DATASET, "r"))}
response = requests.post(
"%s/projects/" % (DR_ENDPOINT), headers=AUTH_HEADERS, files=payload, timeout=60
)
response
<Response [202]>
# Wait for async task to complete
print("Uploading dataset and creating Project...")
projectCreation_response = wait_for_result(response)
project_id = projectCreation_response["id"]
print("\nProject ID: " + project_id)
Uploading dataset and creating Project... Waiting: 2022-08-09 14:54:27.008806 Waiting: 2022-08-09 14:54:37.578847 Waiting: 2022-08-09 14:54:48.139574 Waiting: 2022-08-09 14:54:58.846699 Waiting: 2022-08-09 14:55:09.401604 Waiting: 2022-08-09 14:55:19.981831 Waiting: 2022-08-09 14:55:30.551482 Finished: 2022-08-09 14:55:41.361760 Project ID: 62f25900543b1c01e5bdaf59
Initiate Autopilot¶
This snippet begins modeling in Manual mode.
Endpoint: PATCH /api/v2/projects/(projectId)/aim/
payload = {"unsupervisedMode": True, "unsupervisedType": "clustering", "mode": "manual"}
response = requests.patch(
"%s/projects/%s/aim/" % (DR_ENDPOINT, project_id),
headers=AUTH_HEADERS,
json=payload,
timeout=60,
)
response
<Response [202]>
print("Creating project in Manual mode...")
project_response = wait_for_result(response)
Creating project in Manual mode... Waiting: 2022-08-09 14:55:43.398565 Waiting: 2022-08-09 14:55:53.961746 Waiting: 2022-08-09 14:56:04.548413 Waiting: 2022-08-09 14:56:15.131273 Waiting: 2022-08-09 14:56:25.715646 Waiting: 2022-08-09 14:56:36.270683 Waiting: 2022-08-09 14:56:46.828606 Waiting: 2022-08-09 14:56:57.391746 Waiting: 2022-08-09 14:57:08.098635 Waiting: 2022-08-09 14:57:18.650575 Finished: 2022-08-09 14:57:29.453144
Retrieve blueprints¶
Endpoint: GET /api/v2/projects/(projectId)/blueprints/
response = requests.get(
"%s/projects/%s/blueprints/" % (DR_ENDPOINT, project_id), headers=AUTH_HEADERS
)
response
<Response [200]>
r = json.loads(response.content)
r
print("Available blueprints:\n")
for bp in r:
print(bp["modelType"])
print(bp["id"] + "\n")
Available blueprints: Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) 500ce93b06e38c4df2800f62ade6650d Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) DBSCAN Hybrid Model 59478b8603dd3e270bd4277a9c1456d7 Gaussian Mixture Model 68a6aa4312a27d1fc55580f7fb1121bc K-Means Clustering 9c08a327281f53fa366bb52c817499d2
Build models¶
Endpoint: POST /api/v2/projects/(projectId)/models/
Next, train 3 K-Means models simultaneously using a different number of clusters.
payload = {"blueprintId": "9c08a327281f53fa366bb52c817499d2", "nClusters": 3}
response1 = requests.post(
"%s/projects/%s/models/" % (DR_ENDPOINT, project_id),
headers=AUTH_HEADERS,
json=payload,
timeout=60,
)
response1
<Response [202]>
payload = {"blueprintId": "9c08a327281f53fa366bb52c817499d2", "nClusters": 5}
response2 = requests.post(
"%s/projects/%s/models/" % (DR_ENDPOINT, project_id),
headers=AUTH_HEADERS,
json=payload,
timeout=60,
)
response2
<Response [202]>
payload = {"blueprintId": "9c08a327281f53fa366bb52c817499d2", "nClusters": 10}
response3 = requests.post(
"%s/projects/%s/models/" % (DR_ENDPOINT, project_id),
headers=AUTH_HEADERS,
json=payload,
timeout=60,
)
response3
<Response [202]>
print("Waiting for models training to finish...")
Waiting for models training to finish... Finished: 2022-08-09 15:06:54.415661 Finished: 2022-08-09 15:06:55.300248 Finished: 2022-08-09 15:06:56.155223