DataRobot API resources > API user guide > REST API code examples > Create a clustering project

Create a clustering project¶

This notebook outlines how to create a clustering project and initiate Autopilot in Manual mode via DataRobot's REST API. Manual mode allows you to select and train specific blueprints for modeling. If you run a clustering project in comprehensive Autopilot mode, some blueprints may take a long time to complete. For example, HDBSCAN is inherently a slow model to train. Because of these time constraints, this notebook only runs one blueprint (K-Means) and tests several clusters.

Requirements¶

DataRobot recommends Python version 3.7 or later.
DataRobot API version 2.28.0

Import libraries¶

In [1]:

Copied!





import datetime
import json
import time

from pandas.io.json import json_normalize
import requests
import yaml
import datetime
import json
import time

from pandas.io.json import json_normalize
import requests
import yaml

Set credentials¶

In [2]:

Copied!





FILE_CREDENTIALS = (
    "/Volumes/GoogleDrive/My Drive/rodrigo.miranda/mlops-admin/rodrigo.miranda_drconfig.yaml"
)

parsed_file = yaml.load(open(FILE_CREDENTIALS), Loader=yaml.FullLoader)

DR_ENDPOINT = parsed_file["endpoint"]
API_TOKEN = parsed_file["token"]
AUTH_HEADERS = {"Authorization": "token %s" % API_TOKEN}
FILE_CREDENTIALS = (
    "/Volumes/GoogleDrive/My Drive/rodrigo.miranda/mlops-admin/rodrigo.miranda_drconfig.yaml"
)

parsed_file = yaml.load(open(FILE_CREDENTIALS), Loader=yaml.FullLoader)

DR_ENDPOINT = parsed_file["endpoint"]
API_TOKEN = parsed_file["token"]
AUTH_HEADERS = {"Authorization": "token %s" % API_TOKEN}

Define functions¶

The functions below handle responses, including asynchronous calls.

In [3]:

Copied!





def wait_for_async_resolution(status_url):
    status = False

    while status == False:
        resp = requests.get(status_url, headers=AUTH_HEADERS)
        r = json.loads(resp.content)

        try:
            statusjob = r["status"].upper()
        except:
            statusjob = ""

        if resp.status_code == 200 and statusjob != "RUNNING" and statusjob != "INITIALIZED":
            status = True
            print("Finished: " + str(datetime.datetime.now()))
            return resp

        print("Waiting: " + str(datetime.datetime.now()))
        time.sleep(10)  # Delays for 10 seconds.


def wait_for_result(response):
    assert response.status_code in (200, 201, 202), response.content

    if response.status_code == 200:
        data = response.json()

    elif response.status_code == 201:
        status_url = response.headers["Location"]
        resp = requests.get(status_url, headers=AUTH_HEADERS)
        assert resp.status_code == 200, resp.content
        data = resp.json()

    elif response.status_code == 202:
        status_url = response.headers["Location"]
        resp = wait_for_async_resolution(status_url)
        data = resp.json()

    return data
def wait_for_async_resolution(status_url):
    status = False

    while status == False:
        resp = requests.get(status_url, headers=AUTH_HEADERS)
        r = json.loads(resp.content)

        try:
            statusjob = r["status"].upper()
        except:
            statusjob = ""

        if resp.status_code == 200 and statusjob != "RUNNING" and statusjob != "INITIALIZED":
            status = True
            print("Finished: " + str(datetime.datetime.now()))
            return resp

        print("Waiting: " + str(datetime.datetime.now()))
        time.sleep(10)  # Delays for 10 seconds.


def wait_for_result(response):
    assert response.status_code in (200, 201, 202), response.content

    if response.status_code == 200:
        data = response.json()

    elif response.status_code == 201:
        status_url = response.headers["Location"]
        resp = requests.get(status_url, headers=AUTH_HEADERS)
        assert resp.status_code == 200, resp.content
        data = resp.json()

    elif response.status_code == 202:
        status_url = response.headers["Location"]
        resp = wait_for_async_resolution(status_url)
        data = resp.json()

    return data

Create a project¶

Endpoint: POST /api/v2/projects/

In [4]:

Copied!

FILE_DATASET = (
    "/Volumes/GoogleDrive/My Drive/Datasets/Customer Invoices/clustering_customer_invoices.csv"
)
FILE_DATASET = (
    "/Volumes/GoogleDrive/My Drive/Datasets/Customer Invoices/clustering_customer_invoices.csv"
)

In [5]:

Copied!

payload = {"file": ("Clustering - Customer Invoices 02", open(FILE_DATASET, "r"))}

response = requests.post(
    "%s/projects/" % (DR_ENDPOINT), headers=AUTH_HEADERS, files=payload, timeout=60
)

response
payload = {"file": ("Clustering - Customer Invoices 02", open(FILE_DATASET, "r"))}

response = requests.post(
    "%s/projects/" % (DR_ENDPOINT), headers=AUTH_HEADERS, files=payload, timeout=60
)

response

Out[5]:

<Response [202]>

In [6]:

Copied!

# Wait for async task to complete

print("Uploading dataset and creating Project...")

projectCreation_response = wait_for_result(response)

project_id = projectCreation_response["id"]
print("\nProject ID: " + project_id)
# Wait for async task to complete

print("Uploading dataset and creating Project...")

projectCreation_response = wait_for_result(response)

project_id = projectCreation_response["id"]
print("\nProject ID: " + project_id)

Uploading dataset and creating Project...
Waiting: 2022-08-09 14:54:27.008806
Waiting: 2022-08-09 14:54:37.578847
Waiting: 2022-08-09 14:54:48.139574
Waiting: 2022-08-09 14:54:58.846699
Waiting: 2022-08-09 14:55:09.401604
Waiting: 2022-08-09 14:55:19.981831
Waiting: 2022-08-09 14:55:30.551482
Finished: 2022-08-09 14:55:41.361760

Project ID: 62f25900543b1c01e5bdaf59

Initiate Autopilot¶

This snippet begins modeling in Manual mode.

Endpoint: PATCH /api/v2/projects/(projectId)/aim/

In [7]:

Copied!





payload = {"unsupervisedMode": True, "unsupervisedType": "clustering", "mode": "manual"}

response = requests.patch(
    "%s/projects/%s/aim/" % (DR_ENDPOINT, project_id),
    headers=AUTH_HEADERS,
    json=payload,
    timeout=60,
)

response
payload = {"unsupervisedMode": True, "unsupervisedType": "clustering", "mode": "manual"}

response = requests.patch(
    "%s/projects/%s/aim/" % (DR_ENDPOINT, project_id),
    headers=AUTH_HEADERS,
    json=payload,
    timeout=60,
)

response

Out[7]:

<Response [202]>

In [8]:

Copied!

print("Creating project in Manual mode...")

project_response = wait_for_result(response)
print("Creating project in Manual mode...")

project_response = wait_for_result(response)

Creating project in Manual mode...
Waiting: 2022-08-09 14:55:43.398565
Waiting: 2022-08-09 14:55:53.961746
Waiting: 2022-08-09 14:56:04.548413
Waiting: 2022-08-09 14:56:15.131273
Waiting: 2022-08-09 14:56:25.715646
Waiting: 2022-08-09 14:56:36.270683
Waiting: 2022-08-09 14:56:46.828606
Waiting: 2022-08-09 14:56:57.391746
Waiting: 2022-08-09 14:57:08.098635
Waiting: 2022-08-09 14:57:18.650575
Finished: 2022-08-09 14:57:29.453144

Retrieve blueprints¶

Endpoint: GET /api/v2/projects/(projectId)/blueprints/

In [9]:

Copied!

response = requests.get(
    "%s/projects/%s/blueprints/" % (DR_ENDPOINT, project_id), headers=AUTH_HEADERS
)

response
response = requests.get(
    "%s/projects/%s/blueprints/" % (DR_ENDPOINT, project_id), headers=AUTH_HEADERS
)

response

Out[9]:

<Response [200]>

In [ ]:

Copied!

r = json.loads(response.content)

r
r = json.loads(response.content)

r

In [12]:

Copied!





print("Available blueprints:\n")
for bp in r:
    print(bp["modelType"])
    print(bp["id"] + "\n")
print("Available blueprints:\n")
for bp in r:
    print(bp["modelType"])
    print(bp["id"] + "\n")

Available blueprints:

Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)
500ce93b06e38c4df2800f62ade6650d

Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) DBSCAN Hybrid Model
59478b8603dd3e270bd4277a9c1456d7

Gaussian Mixture Model
68a6aa4312a27d1fc55580f7fb1121bc

K-Means Clustering
9c08a327281f53fa366bb52c817499d2

Build models¶

Endpoint: POST /api/v2/projects/(projectId)/models/

Next, train 3 K-Means models simultaneously using a different number of clusters.

In [13]:

Copied!





payload = {"blueprintId": "9c08a327281f53fa366bb52c817499d2", "nClusters": 3}

response1 = requests.post(
    "%s/projects/%s/models/" % (DR_ENDPOINT, project_id),
    headers=AUTH_HEADERS,
    json=payload,
    timeout=60,
)

response1
payload = {"blueprintId": "9c08a327281f53fa366bb52c817499d2", "nClusters": 3}

response1 = requests.post(
    "%s/projects/%s/models/" % (DR_ENDPOINT, project_id),
    headers=AUTH_HEADERS,
    json=payload,
    timeout=60,
)

response1

Out[13]:

<Response [202]>

In [14]:

Copied!





payload = {"blueprintId": "9c08a327281f53fa366bb52c817499d2", "nClusters": 5}

response2 = requests.post(
    "%s/projects/%s/models/" % (DR_ENDPOINT, project_id),
    headers=AUTH_HEADERS,
    json=payload,
    timeout=60,
)

response2
payload = {"blueprintId": "9c08a327281f53fa366bb52c817499d2", "nClusters": 5}

response2 = requests.post(
    "%s/projects/%s/models/" % (DR_ENDPOINT, project_id),
    headers=AUTH_HEADERS,
    json=payload,
    timeout=60,
)

response2

Out[14]:

<Response [202]>

In [15]:

Copied!





payload = {"blueprintId": "9c08a327281f53fa366bb52c817499d2", "nClusters": 10}

response3 = requests.post(
    "%s/projects/%s/models/" % (DR_ENDPOINT, project_id),
    headers=AUTH_HEADERS,
    json=payload,
    timeout=60,
)

response3
payload = {"blueprintId": "9c08a327281f53fa366bb52c817499d2", "nClusters": 10}

response3 = requests.post(
    "%s/projects/%s/models/" % (DR_ENDPOINT, project_id),
    headers=AUTH_HEADERS,
    json=payload,
    timeout=60,
)

response3

Out[15]:

<Response [202]>

In [29]:

Copied!

print("Waiting for models training to finish...")
print("Waiting for models training to finish...")

Waiting for models training to finish...
Finished: 2022-08-09 15:06:54.415661
Finished: 2022-08-09 15:06:55.300248
Finished: 2022-08-09 15:06:56.155223