Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Deploy the MLOps agent on GKE

The following steps describe how to deploy the MLOps agent on Google Kubernetes Engine (GKE) with Pub/Sub as a spooler. This allows you to monitor a custom Python model developed outside DataRobot. The custom model will be scored at the local machine and will send the statistics to Google Cloud Platform (GCP) Pub/Sub. Finally, the agent (deployed on GKE) will consume this data and send it back to the DataRobot MLOps dashboard.

Overview

DataRobot MLOps offers the ability to monitor all your ML models (trained in DataRobot or outside) in a centralized dashboard with the DataRobot MLOps agent. The agent, a Java utility running in parallel with the deployed model, can monitor models developed in Java, Python, and R programming languages.

The MLOps agent communicates with the model via a spooler (i.e., file system, GCP Pub/Sub, AWS SQS, or RabbitMQ) and sends model statistics back to the MLOps dashboard. These can include the number of scored records, number of features, scoring time, data drift, and more. You can embed the agent can into a Docker image and deploy it on a Kubernetes cluster for scalability and robustness.

Prerequisites

You must complete the following steps before creating the main configuration.

  1. Install the Google Cloud SDK specific to your operating system.

  2. Run the following at a command prompt:

    gcloud init

    You will be asked to choose an existing project or create a new one, as well as to select the compute zone.

  3. Install the Kubernetes command-line tool:

    gcloud components install kubectl

  4. Retrieve your Google Cloud service account credentials to call Google Cloud APIs. If you don’t have a default service account, you can create it by following this procedure.

  5. Once credentials are in place, download the JSON file that contains them. Later, when it is time to pass your credentials to the application that will call Google Cloud APIs, you can use one of these methods:

Procedure

The following sections, each a step in the process, describe the procedure for deploying the MLOps agent on GKE with the Pub/Sub.

Create an external deployment

First, create an external deployment. You will use the resulting model ID and deployment ID to configure communications with the agent (described in the instructions for running Docker locally).

Create a Pub/Sub topic and subscription

Second, create a Pub/Sub topic and subscription:

  1. Go to your Google Cloud console Pub/Sub service and create a topic (i.e., a named resource where publishers can send messages).

  2. Create a subscription—a named resource representing the stream of messages from a single, specific topic, to be delivered to the subscribing application. Use the Pub/Sub topic from the previous step and set Delivery type to Pull. This provides a Subscription ID.

    Additionally, you can configure message retention duration and other parameters.

Embed MLOps agent in Docker

To create a Docker image that embeds the agent:

  1. Create the working directory on the machine where you will prepare the necessary files.

  2. Create a directory named conf.

  3. Download and unzip the tarball file with the MLOps agent from Developer Tools.

  4. Copy the mlops.log4j2.properties file from <unzipped directory>/conf to your <working directory/conf>.

  5. Copy the file mlops.agent.conf.yaml to the working directory. Provide the following parameters (the example uses defaults for all other parameters):

    Parameter Definition
    mlopsUrl Installation URL for Self-Managed AI Platform; app.datarobot.com for managed AI Platform
    apiToken DataRobot key
    projectId GCP ProjectId
    topicName Created in the Pub/Sub section

    For example:

    mlopsUrl: "MLOPS-URL"
    apiToken: "YOUR-DR-API-TOKEN"
    channelConfigs:
    - type: "PUBSUB_SPOOL"
      details: {name: "pubsub", projectId: "YOUR-GOOGLE-PROJECT-ID", topicName: "YOUR-PUBSUB-TOPIC-ID-DEFINED-AT-STEP-2"}
    
  6. Copy the <unzipped directory>/lib/mlops-agent-X.X.X.jar file to your working directory.

  7. In the working directory, create the Dockerfile using the following content:

    FROM openjdk:8
    
    ENV AGENT_BASE_LOC=/opt/datarobot/ma
    ENV AGENT_LOG_PROPERTIES=mlops.log4j2.properties
    ENV AGENT_CONF_LOC=$AGENT_BASE_LOC/conf/mlops.agent.conf.yaml
    
    COPY mlops-agent-*.jar ${AGENT_BASE_LOC}/mlops-agent.jar
    COPY conf $AGENT_BASE_LOC/conf
    COPY entrypoint.sh /
    
    RUN chmod +x /entrypoint.sh
    
    ENTRYPOINT ["./entrypoint.sh"]
    
  8. Create entrypoint.sh with the following content:

    #!/bin/sh
    
    echo "######## STARTING MLOPS-AGENT ########"
    echo
    
    exec java -Dlog.file=$AGENT_BASE_LOC/logs/mlops.agent.log   -Dlog4j.configurationFile=file:$AGENT_BASE_LOC/conf/$AGENT_LOG_PROPERTIES -cp   $AGENT_BASE_LOC/mlops-agent.jar com.datarobot.mlops.agent.Agent --config $  AGENT_CONF_LOC
    
  9. Create the Docker image, ensuring you include the period (.) at the end of the Docker build command.

    export PROJECT_ID=ai-XXXXXXX-111111
    
    docker build -t gcr.io/${PROJECT_ID}/monitoring-agents:v1 .
    
  10. Run the docker images command to verify a successful build.

Run Docker locally

Note

While technically an optional step, best practice advises always testing your image locally to save time and network bandwidth.

The monitoring agent tarball includes the necessary library (along with Java and R libraries) for sending statistics from the custom Python model back to MLOps. You can find the libraries in the lib directory.

To run locally:

  1. Install the DataRobot_MLOps library for Python:

    pip install datarobot_mlops_package-<VERSION>/lib/datarobot_mlops-<VERSION>-py2.py3-none-any.whl

  2. Run your Docker container image.

    Note

    You will need the JSON file with credentials that you downloaded in the prerequisites (the step that describes downloading Google Cloud account credentials).

    docker run -it --rm --name ma -v /path-to-you-directory/mlops.agent.conf.yaml:/ opt/datarobot/ma/conf/mlops.agent.conf.yaml -v /path-to-your-directory/your-google-application-credentials.json:/opt/datarobot/ma/conf/gac.json -e GOOGLE_APPLICATION_CREDENTIALS="/opt/datarobot/ma/conf/gac.json" gcr.io/${PROJECT_ID}/monitoring-agents:v1
    

    The following is the example of the Python code where your model is scored (all package import statements are omitted from this example):

        from datarobot_mlops.mlops import MLOps
    
    DEPLOYMENT_ID = "EXTERNAL-DEPLOYMENT-ID-DEFINED-AT-STEP-1"
    MODEL_ID = "EXTERNAL-MODEL-ID-DEFINED-AT-STEP-1"
    PROJECT_ID = "YOUR-GOOGLE-PROJECT-ID"
    TOPIC_ID = "YOUR-PUBSUB-TOPIC-ID-DEFINED-AT-STEP-2"
    
        # MLOPS: initialize the MLOps instance
        mlops = MLOps() \
            .set_deployment_id(DEPLOYMENT_ID) \
            .set_model_id(MODEL_ID) \
            .set_pubsub_spooler(PROJECT_ID, TOPIC_ID) \
            .init()
    
        # Read your custom model pickle file (model has been trained outside DataRobot)
        model = pd.read_pickle('custom_model.pickle')
    
        # Read scoring data
        features_df_scoring = pd.read_csv('features.csv')
    
        # Get predictions
        start_time = time.time()
        predictions = model.predict_proba(features_df_scoring)
        predictions = predictions.tolist()
        num_predictions = len(predictions)
        end_time = time.time()
    
        # MLOPS: report the number of predictions in the request and the execution time
        mlops.report_deployment_stats(num_predictions, end_time - start_time)
    
        # MLOPS: report the features and predictions
        mlops.report_predictions_data(features_df=features_df_scoring, predictions=predictions)
    
        # MLOPS: release MLOps resources when finished
        mlops.shutdown()
    
  3. Set the GOOGLE_APPLICATION_CREDENTIALS environment variable:

    export GOOGLE_APPLICATION_CREDENTIALS="<your-google-application-credentials.json>"

  4. Score your data locally to test if the model works as expected. You will then be able to see a new record in monitoring agent log:

    python score-your-model.py

    The statistics in the MLOps dashboard are updated as well:

Push Docker image to the Container Registry

After you have tested and validated the container image locally, upload it to a registry so that your Google Kubernetes Engine (GKE) cluster can download and run it.

  1. Configure the Docker command-line tool to authenticate to Container Registry:

    gcloud auth configure-docker

  2. Push the Docker image you built to the Container Registry:

    docker push gcr.io/${PROJECT_ID}/monitoring-agents:v1

Create the GKE cluster

After storing the Docker image in the Container Registry, you next create a GKE cluster, as follows:

  1. Set your project ID and Compute Engine zone options for the gcloudtool:

    gcloud config set project $PROJECT_ID

    gcloud config set compute/zone europe-west1-b

  2. Create a cluster.

    Note

    This example, for simplicity, creates a private cluster with unrestricted access to the public endpoint. For security, be sure to restrict access to the control plane for your production environment. Find detailed information about configuring different GKE private clusters here.

    gcloud container clusters create monitoring-agents-cluster \
        --network default \
        --create-subnetwork name=my-subnet-0 \
        --no-enable-master-authorized-networks \
        --enable-ip-alias \
        --enable-private-nodes \
        --master-ipv4-cidr 172.16.0.32/28 \
        --no-enable-basic-auth \
        --no-issue-client-certificate
    

    Where:

    Parameter Result
    --create-subnetwork name=my-subnet-0 Causes GKE to automatically create a subnet named my-subnet-0.
    --no-enable-master-authorized-networks Disables authorized networks for the cluster.
    --enable-ip-alias Makes the cluster VPC-native.
    --enable-private-nodes Indicates that the cluster's nodes do not have external IP addresses.
    --master-ipv4-cidr 172.16.0.32/28 Specifies an internal address range for the control plane. This setting is permanent for this cluster.
    --no-enable-basic-auth Disables basic auth for the cluster.
    --no-issue-client-certificate Disables issuing a client certificate.
  3. Run the following command to see the cluster worker instances:

    gcloud compute instances list

Create a cloud router

The MLOps agent running on a GKE private cluster needs access to the DataRobot MLOps service. To do this, you must give the private nodes outbound access to the internet, which you can achieve using a NAT cloud router (Google documentation here).

  1. Create a cloud router:

    gcloud compute routers create nat-router \
        --network default \
     --region europe-west1
    
  2. Add configuration to the router.

    gcloud compute routers nats create nat-config \
    
        --router-region europe-west1 \
        --router nat-router \
        --nat-all-subnet-ip-ranges \
        --auto-allocate-nat-external-ips​
    

Create K8s ConfigMaps

With the cloud router configured, you can now create K8s ConfigMaps to contain the MLOps agent configuration and Google credentials. You will need the downloaded JSON credentials file created during the prerequisites stage.

Note

Use K8s Secrets to save your configuration files for production usage.

Use the following code to create ConfigMaps:

kubectl create configmap ma-configmap --from-file=mlops.agent.conf.yaml=your-path/mlops.agent.conf.yaml

kubectl create configmap gac-configmap --from-file=gac.json=your-google-application-credentials.json

Create the K8s Deployment

To create the deployment, create the ma-deployment.yaml file with the following content:

Note

This example uses three always-running replicas; for autoscaling, use kubectl autoscale deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ma-deployment
  labels:
    app: ma
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ma
  template:
    metadata:
      labels:
        app: ma
    spec:
      containers:
      - name: ma
        image: gcr.io/${PROJECT_ID}/monitoring-agents:v1
        volumeMounts:
        - name:  agent-conf-volume
          mountPath: /opt/datarobot/ma/conf/mlops.agent.conf.yaml
          subPath: mlops.agent.conf.yaml
        - name:  gac-conf-volume
          mountPath: /opt/datarobot/ma/conf/gac.json
          subPath: gac.json
        env:
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: /opt/datarobot/ma/conf/gac.json
        ports:
        - containerPort: 80
      volumes:
      - name:  agent-conf-volume
        configMap:
          items:
          - key: mlops.agent.conf.yaml
            path: mlops.agent.conf.yaml
          name: ma-configmap
      - name:  gac-conf-volume
        configMap:
          items:
          - key: gac.json
            path: gac.json
          name: gac-configmap

Next, create the deployment with the following command:

kubectl apply -f ma-deployment.yaml

Finally, check the running pods:

kubectl get pods

Score the model

Score your local model and verify the output.

  1. Score your local model:

    python score-your-model.py

  2. Check the GKE Pod log; it shows that one record has been sent to DataRobot.

  3. Check the Pub/Sub log.

  4. Check the DataRobot MLOps dashboard.

Clean up

  1. Delete the NAT in the cloud router:

    gcloud compute routers nats delete nat-config --router=nat-router --router-region=europe-west1

  2. Delete the cloud router:

    gcloud compute routers delete nat-router --region=europe-west1

  3. Delete the cluster:

    gcloud container clusters delete monitoring-agents-cluster


Updated August 6, 2024