Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Configure management agent environment plugins

Management agent plugins deploy and manage models in a given prediction environment. The management agent submits commands to the plugin, and the plugin executes them and returns the status of the command to the management agent. To facilitate this interaction, you provide prediction environment details during plugin configuration, allowing the plugin to execute commands in that environment. For example, a Kubernetes plugin can launch a deployment (container) in a Kubernetes cluster, replace a model in the deployment, stop the container, etc.

The MLOps management agent contains the following example plugins:

  • Filesystem plugin.
  • Docker plugin.
  • Kubernetes plugin.
  • Test plugin.

Note

These example plugins are installed as part of the datarobot_bosun-*-py3-none-any.whl wheel file.

Configure example plugins

The following example plugins require additional configuration for use with the management agent:

To enable communication between the management agent and the deployment, the filesystem plugin creates one directory per deployment in the local filesystem, and downloads each deployment's model package and configuration .yaml file into the deployment's local directory. These artifacts can then be used to serve predictions from a PPS container.

plugin.filesystem.conf.yaml
# The top-level directory that will be used to store each deployment directory
baseDir: "."

# Each deployment directory will be prefixed with the following string
deploymentDirPrefix: "deployment_"

# The name of the deployment config file to create inside the deployment directory.
# Note: If working with the PPS, DO NOT change this name; the PPS expects this filename.
deploymentInfoFile: "config.yml"

# If defined, this string will be prefixed to the predictions URL for this deployment,
# and the URL will be returned, with the deployment id suffixed to the end with the
# /predict endpoint.
deploymentPredictionBaseUrl: "http://localhost:8080"

# If defined, create a yaml file with the kv of the deployment.
# If the name of the file is the same as the deploymentInfoFile,
# the key values are added to the same file as the other config.
# deploymentKVFile: "kv.yaml"

The Docker plugin can deploy native DataRobot models and custom models on a Docker server. In addition, the plugin automatically runs the monitoring agent to monitor deployed models and uses the traefik reverse proxy to provide a single prediction endpoint for each deployment.

The management agent's Docker plugin supports the use of the Portable Prediction Server, allowing a single Docker container to serve multiple models. It enables you to configure the PPS to indicate where models for each deployment are located and gives you the ability to start, stop, and manage deployments.

The Docker plugin can:

  • Retrieve a model package from DataRobot for a deployment.
  • Launch the DataRobot model within the Docker container.
  • Shut down and clean up the Docker container.
  • Report status back via events.
  • Monitor predictions using the monitoring agent.

To configure the Docker plugin, take the following steps:

  1. Set up the environment required for the Docker plugin:

    docker pull rabbitmq:3-management
    docker pull traefik:2.3.3
    docker network create bosun
    
  2. Build the monitoring agent container image:

    cd datarobot_mlops_package-*/
    cd tools/agent_docker
    make build
    
  3. Download the Portable Prediction Server from the DataRobot UI. If you are planning to use a custom model image, make sure the image is built and accessible to the Docker service.

  4. Configure the Docker plugin configuration file:

    plugin.docker.conf.yaml
    # Docker network on which to run all containers.
    # This network must be created prior to running
    # the agent (i.e., 'docker network create <NAME>`)
    dockerNetwork: "bosun"
    
    # Traefik image to use
    traefikImage: "traefik:2.3.3"
    
    # Address that will be reported to DataRobot
    outfacingPredictionURLPrefix: "http://10.10.12.22:81"
    
    # MLOps Agent image to use for monitoring
    agentImage: "datarobot/mlops-tracking-agent:latest"
    
    # RabbitMQ image to use for building a channel
    rabbitmqImage: "rabbitmq:3-management"
    
    # PPS base image
    ppsBaseImage: "datarobot/datarobot-portable-prediction-api:latest"
    
    # Prefix for generated images
    generatedImagePrefix: "mlops_"
    
    # Prefix for running containers
    containerNamePrefix: "mlops_"
    
    # Mapping of traefik proxy ports (not mandatory)
    traefikPortMapping:
        80: 81
        8080: 8081
    
    # Mapping of RabbitMQ (not mandatory)
    rabbitmqPortMapping:
        15672: 15673
        5672: 5673
    

DataRobot provides a plugin to deploy and manage models in your Kubernetes cluster without writing any additional code. For configuration information, see the README file in the tools/charts/datarobot-management-agent folder in the tarball.

plugin.k8s.conf.yaml
## The following settings are related to connecting to your Kubernetes cluster
#
# The name of the kube-config context to use (similar to --context argument of kubectl). There is a special
# `IN_CLUSTER` string to be used if you are running the plugin inside a cluster. The default is "IN_CLUSTER"
# kubeConfigContext: IN_CLUSTER

# The namespace that you want to create and manage external deployments (similar to --namespace argument of kubectl). You
# can leave this as `null` to use the "default" namespace, the namespace defined in your context, or (if running `IN_CLUSTER`)
# manage resources in the same namespace the plugin is executing in.
# kubeNamespace:

## The following settings are related to whether or not MLOps monitoring is enabled
#
# We need to know the location of the dockerized agent image that can be launched into your Kubernetes cluster.
# You can build the image by running `make build` in the tools/agent_docker/ directory and retagging the image
# and pushing it to your registry.
# agentImage: "<FILL-IN-DOCKER-REGISTRY>/mlops-tracking-agent:latest"

## The following settings are all related to accessing the model from outside the Kubernetes cluster
#
# The URL prefix used to access the deployed model, i.e., https://example.com/deployments/
# The model will be accessible via <outfacingPredictionURLPrefix/<model_id>/predict
outfacingPredictionURLPrefix: "<FILL-CORRECT-URL-FOR-K8S-INGRESS>"

# We are still using the beta Ingress resource API, so a class must be provided. If your cluster
# doesn't have a default ingress class, please provide one.
# ingressClass:

## The following settings are all related to building the finalized model image (base image + mlpkg)
#
# The location of the Portable Prediction Server base image. You can download it from DataRobot's developer
# tools section, retag it, and push it to your registry.
ppsBaseImage: "<FILL-IN-DOCKER-REGISTRY>/datarobot-portable-prediction-api:latest"

# The Docker repo to which this plugin can push finalized models. The built images will be tagged
# as follows: <generatedImageRepo>:m-<model_pkg_id>
generatedImageRepo: "<FILL-IN-DOCKER-REGISTRY>/mlops-model"

# We use Kaniko to build our finalized image. See https://github.com/GoogleContainerTools/kaniko#readme.
# The default is to use the image below.
# kanikoImage: "gcr.io/kaniko-project/executor:v1.5.2"

# The name of the Kaniko ConfigMap to use. This provides the settings Kaniko will need to be able to push to
# your registry type. See https://github.com/GoogleContainerTools/kaniko#pushing-to-different-registries.
# The default is to not use any additional configuration.
# kanikoConfigmapName: "docker-config"

# The name of the Kaniko Secret to use. This provides the settings Kaniko will need to be able to push to
# your registry type. See https://github.com/GoogleContainerTools/kaniko#pushing-to-different-registries.
# The default is to not use any additional secrets. The secret must be of the type: kubernetes.io/dockerconfigjson
# kanikoSecretName: "registry-credentials"

# The name of a service account to use for running Kaniko if you want to run it in a more secure fashion.
# See https://github.com/GoogleContainerTools/kaniko#security.
# The default is to use the "default" service account in the namespace in which the pod runs.
# kanikoServiceAccount: default

To configure the test plugin, use the --plugin test option and set the temporary directory and sleep time (in seconds) for each action executed by the test plugin. For example, the deployment launch_time_sec set in the test plugin configuration below creates a temporary file for the deployment, sleeps for 1 second, and then returns.

plugin.test.conf.yaml
tmp_dir: "/tmp"
launch_time_sec: 1
stop_time_sec: 1
replace_model_time_sec: 1
pe_status_time_sec: 1
deployment_status_time_sec: 1
deployment_list_time_sec: 1
plugin_start_time: 1
plugin_stop_time: 1

Create a custom plugin

The management agent's plugin framework is flexible enough to accommodate custom plugins. This flexibility is helpful when you have a custom prediction environment (different from, for example, the standard Docker or Kubernetes environment) in which you deploy your models. You can implement a plugin for such a prediction environment either by modifying the existing plugin or by implementing one from scratch. You can use the filesystem plugin as a reference when creating a custom Python plugin.

Note

Currently, custom Java plugins are not supported.

If you decide to write a custom plugin, the following section describes the interface definition provided to write a Python plugin.

Implement the plugin interface

The management agent Python package defines the abstract base class BosunPluginBase. Each management agent plugin must inherit and implement the interface defined by this base class.

To start implementing a custom plugin (SamplePlugin below), inherit the BosunPluginBase base class. As an example, implement the plugin under sample_plugin directory in the file sample_plugin.py:

class SamplePlugin(BosunPluginBase):
    def __init__(self, plugin_config, private_config_file=None, pe_info=None, dry_run=False):

Python plugin arguments

The constructor is invoked with the following arguments:

Argument Definition
plugin_config A dictionary containing general information about the plugin. We will go over the details in the following section.
private_config_file Path to the private configuration file for the plugin as passed in by the --private-config flag when calling the bosun-plugin-runner script. This file is optional and the contents are fully at the discretion of your custom plugin.
pe_info An instance of PEInfo, which contains information about the prediction environment. This parameter is unset for certain actions.
dry_run The invocation for dry run (development) or the actual run.

Python plugin methods

This class implements the following methods:

Note

The return type for each of the following functions must be ActionStatusInfo.

def plugin_start(self):

This method initializes the plugin; for example, it can check if the plugin can connect with the prediction environment (e.g., Docker, Kubernetes). In the case of the filesystem plugin, this method checks if the baseDir exists on the filesystem. Management agent invokes this method typically only once during the startup process. This method is guaranteed to be called before any deployment-specific action can be invoked.


def plugin_stop(self):

This method implements any tear-down process, for example, close client connections to the prediction environment. The management agent invokes this method typically only once during the shutdown process. This plugin method is guaranteed to be called after all deployment-specific actions are done.


def deployment_list(self):

This method returns the list of deployments already running in the given prediction environment. The management agent typically invokes this method during the startup to determine which deployments are already running in the prediction environment. The list of deployments is returned as a map of deployment_id -> Deployment Information, using the data field in the ActionStatusInfo (described below)


def deployment_start(self, deployment_info):

This method implements a deployment launch process. Management Agent invokes this method when deployment is created or activated in DataRobot. For example, this method can launch the container in the Kubernetes or Docker service. In the case of the filesystem plugin, this method creates a directory with the name deployment_<deployment_id>. It then places the deployment's model and a YAML configuration file under the new directory. The plugin should ensure that the deployment in the prediction environment is uniquely identifiable by the deployment id and, ideally, by the paired deployment id and model id. For example, the built-in Docker plugin launches the container with the following name: deployment_<deployment_id>_<model-id>


def deployment_stop(self, deployment_info):

This method implements a deployment stop process. Management Agent invokes this method when deployment is deactivated or deleted in DataRobot. For example, this method can stop the container in the Kubernetes or Docker service. The deployment id and model id from the deployment_info uniquely identifies the container that needs to be stopped. In the case of the filesystem plugin, this method removes the directory created for that deployment by the deployment_start method.


def deployment_replace_model(self, deployment_info):

This method implements a model replacement process in the deployment. The management agent invokes this method when a model is replaced in a deployment in DataRobot. modelArtifact contains the path to the new model, and newModelId contains the id of the new model to use for replacement. In the case of the Docker or Kubernetes plugin, a potential implementation of this method could stop the container with the old model id and then start a new container with the new model. In the case of filesystem plugin, it removes the old deployment directory and creates a new one with the new model.


def pe_status(self):

This method queries for the status of the prediction environment, for example, whether the Kubernetes or Docker service is still reachable. The management agent periodically invokes this method to ensure the prediction environment is in a good state. In order to improve the experience, the plugin can support queries for the status of the deployments running in the prediction environment in addition to the status of the prediction environment itself. In this case, the IDs of the deployments are included in the deployments field of the peInfo structure (described below), and the status of each deployment is returned using data field in the ActionStatusInfo object (described below). The deployment status is returned as a map of deployment_id to Deployment Information.


def deployment_status(self):

This method queries the status of the deployment deployed in a prediction environment, for example, whether the container corresponding to the deployment is still up and running. The management agent periodically invokes this method to ensure that the deployment is in a good state.


def deployment_relaunch(self, deployment_info):

This method implements the process of relaunching (stopping + starting) the deployment. The management agent Python package already provides a default implementation of this method by invoking deployment_stop followed by deployment_start; however, the plugin can implement its own relaunch mechanism if there is an optimal way to relaunch a deployment.


Python plugin return value

The return value for all these operations is an ActionStatusInfo object providing the status of the action:

class ActionStatusInfo:
    def __init__(self, status, msg=None, state=None, duration=None, data=None):

This object contains the following fields:

Field Definition
status Indicates the status of the action.
Values: ActionStatus.OK, ActionStatus.WARN, ActionStatus.ERROR, and ActionStatus.UNKNOWN
msg Returns a string type message that the plugin can forward to the management agent, which in turn, will forward the message to the MLOps service (DataRobot).
state Indicates the state of the deployment after the execution of action.
Values: ready, stopped, and errored.
duration Indicates the time the action took to execute.
data Returns information that plugin can forward to the management agent. Currently, deployment_list method uses this field to list the deployments in the form of a dictionary of deployment_id to Deployment Information. This field can also be used by the pe_status method to report the status of deployments running in the prediction environment in addition to the prediction environment status.

Note

The base class automatically adds the timestamp to the object to keep track of different action status values.

Use the bosun-plugin-runner

The management agent Python package provides the bosun-plugin-runner CLI tool, which allows you to invoke the custom plugin class and run a specific action. Using this tool, you can run your plugin in standalone mode while developing and debugging your plugin.

For example:

bosun-plugin-runner \
    --plugin sample_plugin/sample_plugin \
    --action pe_status \
    --config sample_configs/action_config_pe_status_only.yaml \
    --private-config sample_configs/sample_plugin_config.yaml \
    --status-file /tmp/status.yaml \
    --show-status

The bosun-plugin-runner accepts the following arguments:

Argument Definition
--plugin Specifies the module containing the plugin class. In this case, we used sample_plugin/sample_plugin since the plugin class is inside the sample_plugin directory in the sample_plugin.py file.
--action Specifies the action to run. Here we use the pe_status action. Other supported actions are listed below.
--config Provides the configuration file to use for the action specified. We describe this in more detail in the next section. When your plugin runs as part of the Management agent service, this file will be generated for you but when testing specific actions manually via the bosun-plugin-runner you will have to generate the configuration file yourself.
--private-config Provides a plugin specific configuration file used only by plugin.
--status-file Provides a path for saving the plugin status that results from the action.
--show-status Shows the contents of the --status-file on stdout.

To view the list of actions supported by bosun-plugin-runner use the --list-actions option:

bosun-plugin-runner --list-actions
# plugin_start
# plugin_stop
# deployment_start
# deployment_stop
# deployment_replace_model
# deployment_status
# pe_status
# deployment_list

Create the action config file

The --config flag is used to pass a YAML configuration file to the plugin. This is the structure of the configuration that the management agent prepares and invokes the plugin action with; however, during plugin development, you may need to write this configuration file yourself.

The typical contents of such a config file are shown below:

pluginConfig:
  name: "ExternalCommand-1"
  type: "ExternalCommand"
  platform: "os"
  commandPrefix: "python3 sample_plugin.py"
  mlopsUrl: "https://app.datarobot.com"

peInfo:
   id: "0x2345"
   name: "Sample-PE"
   description: "some description"
   createdOn: "iso formatted date"
   createdBy: "some username"
   deployments: ["deployment-1", "deployment-2"]
   keyValueConfig:
    max_models: 5

deploymentInfo:
  id: "deployment-1"
  name: "deployment-1"
  description: "Deployment 1 for testing"
  modelId: "model-A"
  modelArtifact: "/tmp/model-A.txt"
  modelExecutionType: "dedicated"
  keyValueConfig:
    key1: "some-value-for-key-1"

The action configuration file contains three sections: pluginConfig, peInfo, and deploymentInfo.

The pluginConfig section contains general information about the plugin, for example, ID of the prediction environment, its type, and the platform. It may also contain the mlopsUrl, the address of the MLOps service (DataRobot) (in case the plugin would like to connect). This is the section that translates to the pluginConfig dictionary and is passed as a constructor argument.

The peInfo section contains information about the prediction environment this action refers to. Typically, this information is used for pe_status action. If deployments key contains valid deployment ids, the plugin is expected to return not only the status of the prediction environment but also the status of the deployments listed under deployments.

The deploymentInfo section contains the information about the deployment in the prediction environment this action refers to. All the deployment-related actions use this section to identify which deployment and model to work on. As this is a particularly important section of the config, let us go over some of the important fields:

  • id, name, and description: Provides information about the deployment as set in DataRobot.

  • modelId, modelArtifact: Indicates the ID of the model and the path where the model can be found. Note that the management agent will place the right model at this path before invoking deployment_start or deployment_replace_model.

  • keyValueConfig: Lists the additional configuration for the deployment. Note that this additional config can be set on the deployment in DataRobot. For example, this can be used to specify how much memory the container corresponding to this deployment should use.

Run actions with bosun-plugin-runner

As covered above, during plugin development, you can use the bosun-plugin-runner to invoke the actions. For example, here is how a deployment_start action can be invoked. We will use the same config as described in the previous section and dump it to a file sample_configs/config_deployment-1_model-A.yaml file.

bosun-plugin-runner \
    --plugin sample_plugin/sample_plugin \
    --config sample_configs/action_config_deployment_1_model_A.yaml \
    --private-config sample_configs/sample_plugin_config.yaml \
    --action deployment_start \
    --status-file /tmp/status.yaml \
    --show-status

The status of this deployment_start action is captured in the file /tmp/status.yaml

Configure the command prefix

Now that your plugin is ready for the management agent, you can configure the command prefix in the management agent configuration file as:

    command: "<BOSUN_VENV_PATH>/bin/bosun-plugin-runner --plugin sample_plugin --private-config <CONF_PATH>/plugin.sample_plugin_.conf.yaml"
You will need to install the sample plugin in the same virtual environment as the management agent Python package. Ensure the private configuration file path for the plugin is set correctly.


Updated November 21, 2023