Configure management agent environment plugins¶
Management agent plugins deploy and manage models in a given prediction environment. The management agent submits commands to the plugin, and the plugin executes them and returns the status of the command to the management agent. To facilitate this interaction, you provide prediction environment details during plugin configuration, allowing the plugin to execute commands in that environment. For example, a Kubernetes plugin can launch a deployment (container) in a Kubernetes cluster, replace a model in the deployment, stop the container, etc.
The MLOps management agent contains the following example plugins:
- Filesystem plugin.
- Docker plugin.
- Kubernetes plugin.
- Test plugin.
These example plugins are installed as part of the
datarobot_bosun-*-py3-none-any.whl wheel file.
Configure example plugins¶
The following example plugins require additional configuration for use with the management agent:
To enable communication between the management agent and the deployment, the filesystem plugin creates one directory per deployment in the local filesystem, and downloads each deployment's model package and configuration
.yaml file into the deployment's local directory. These artifacts can then be used to serve predictions from a PPS container.
# The top-level directory that will be used to store each deployment directory baseDir: "." # Each deployment directory will be prefixed with the following string deploymentDirPrefix: "deployment_" # The name of the deployment config file to create inside the deployment directory. # Note: If working with the PPS, DO NOT change this name; the PPS expects this filename. deploymentInfoFile: "config.yml" # If defined, this string will be prefixed to the predictions URL for this deployment, # and the URL will be returned, with the deployment id suffixed to the end with the # /predict endpoint. deploymentPredictionBaseUrl: "http://localhost:8080" # If defined, create a yaml file with the kv of the deployment. # If the name of the file is the same as the deploymentInfoFile, # the key values are added to the same file as the other config. # deploymentKVFile: "kv.yaml"
The Docker plugin can deploy native DataRobot models and custom models on a Docker server. In addition, the plugin automatically runs the monitoring agent to monitor deployed models and uses the
traefik reverse proxy to provide a single prediction endpoint for each deployment.
The management agent's Docker plugin supports the use of the Portable Prediction Server, allowing a single Docker container to serve multiple models. It enables you to configure the PPS to indicate where models for each deployment are located and gives you the ability to start, stop, and manage deployments.
The Docker plugin can:
- Retrieve a model package from DataRobot for a deployment.
- Launch the DataRobot model within the Docker container.
- Shut down and clean up the Docker container.
- Report status back via events.
- Monitor predictions using the monitoring agent.
To configure the Docker plugin, take the following steps:
Set up the environment required for the Docker plugin:
docker pull rabbitmq:3-management docker pull traefik:2.3.3 docker network create bosun
Build the monitoring agent container image:
cd datarobot_mlops_package-*/ cd tools/agent_docker make build
Download the Portable Prediction Server from the DataRobot UI. If you are planning to use a custom model image, make sure the image is built and accessible to the Docker service.
Configure the Docker plugin configuration file:plugin.docker.conf.yaml
# Docker network on which to run all containers. # This network must be created prior to running # the agent (i.e., 'docker network create <NAME>`) dockerNetwork: "bosun" # Traefik image to use traefikImage: "traefik:2.3.3" # Address that will be reported to DataRobot outfacingPredictionURLPrefix: "http://10.10.12.22:81" # MLOps Agent image to use for monitoring agentImage: "datarobot/mlops-tracking-agent:latest" # RabbitMQ image to use for building a channel rabbitmqImage: "rabbitmq:3-management" # PPS base image ppsBaseImage: "datarobot/datarobot-portable-prediction-api:latest" # Prefix for generated images generatedImagePrefix: "mlops_" # Prefix for running containers containerNamePrefix: "mlops_" # Mapping of traefik proxy ports (not mandatory) traefikPortMapping: 80: 81 8080: 8081 # Mapping of RabbitMQ (not mandatory) rabbitmqPortMapping: 15672: 15673 5672: 5673
DataRobot provides a plugin to deploy and manage models in your Kubernetes cluster without writing any additional code. For configuration information, see the README file in the
tools/charts/datarobot-management-agent folder in the tarball.
## The following settings are related to connecting to your Kubernetes cluster # # The name of the kube-config context to use (similar to --context argument of kubectl). There is a sepcial # `IN_CLUSTER` string to be used if you are running the plugin inside a cluster. The default is "IN_CLUSTER" # kubeConfigContext: IN_CLUSTER # The namespace that you want to create and manage external deployments (similar to --namespace argument of kubectl). You # can leave this as `null` to use the "default" namespace, the namespace defined in your context, or (if running `IN_CLUSTER`) # manage resources in the same namespace the plugin is executing in. # kubeNamespace: ## The following settings are related to whether or not MLOps monitoring is enabled # # We need to know the location of the dockerized agent image that can be launched into your Kubernetes cluster. # You can build the image by running `make build` in the tools/agent_docker/ directory and retagging the image # and pushing it to your registry. # agentImage: "<FILL-IN-DOCKER-REGISTRY>/mlops-tracking-agent:latest" ## The following settings are all related to accessing the model from outside the Kubernetes cluster # # The URL prefix used to access the deployed model, i.e., https://example.com/deployments/ # The model will be accessible via <outfacingPredictionURLPrefix/<model_id>/predict outfacingPredictionURLPrefix: "<FILL-CORRECT-URL-FOR-K8S-INGRESS>" # We are still using the beta Ingress resource API, so a class must be provided. If your cluster # doesn't have a default ingress class, please provide one. # ingressClass: ## The following settings are all related to building the finalized model image (base image + mlpkg) # # The location of the Portable Prediction Server base image. You can download it from DataRobot's developer # tools section, retag it, and push it to your registry. ppsBaseImage: "<FILL-IN-DOCKER-REGISTRY>/datarobot-portable-prediction-api:latest" # The Docker repo to which this plugin can push finalized models. The built images will be tagged # as follows: <generatedImageRepo>:m-<model_pkg_id> generatedImageRepo: "<FILL-IN-DOCKER-REGISTRY>/mlops-model" # We use Kaniko to build our finalized image. See https://github.com/GoogleContainerTools/kaniko#readme. # The default is to use the image below. # kanikoImage: "gcr.io/kaniko-project/executor:v1.5.2" # The name of the Kaniko ConfigMap to use. This provides the settings Kaniko will need to be able to push to # your registry type. See https://github.com/GoogleContainerTools/kaniko#pushing-to-different-registries. # The default is to not use any additional configuration. # kanikoConfigmapName: "docker-config" # The name of the Kaniko Secret to use. This provides the settings Kaniko will need to be able to push to # your registry type. See https://github.com/GoogleContainerTools/kaniko#pushing-to-different-registries. # The default is to not use any additional secrets. The secret must be of the type: kubernetes.io/dockerconfigjson # kanikoSecretName: "registry-credentials" # The name of a service account to use for running Kaniko if you want to run it in a more secure fashion. # See https://github.com/GoogleContainerTools/kaniko#security. # The default is to use the "default" service account in the namespace in which the pod runs. # kanikoServiceAccount: default
To configure the test plugin, use the
--plugin test option and set the temporary directory and sleep time (in seconds) for each action executed by the test plugin. For example, the deployment
launch_time_sec set in the test plugin configuration below creates a temporary file for the deployment, sleeps for 1 second, and then returns.
tmp_dir: "/tmp" launch_time_sec: 1 stop_time_sec: 1 replace_model_time_sec: 1 pe_status_time_sec: 1 deployment_status_time_sec: 1 deployment_list_time_sec: 1 plugin_start_time: 1 plugin_stop_time: 1
Create a custom plugin¶
The management agent's plugin framework is flexible enough to accommodate custom plugins. This flexibility is helpful when you have a custom prediction environment (different from, for example, the standard Docker or Kubernetes environment) in which you deploy your models. You can implement a plugin for such a prediction environment either by modifying the existing plugin or by implementing one from scratch. You can use the filesystem plugin as a reference when creating a custom Python plugin.
Currently, custom Java plugins are not supported.
If you decide to write a custom plugin, the following section describes the interface definition provided to write a Python plugin.
Implement the plugin interface¶
The management agent Python package defines the abstract base class
BosunPluginBase. Each management agent plugin must inherit and implement the interface defined by this base class.
To start implementing a custom plugin (
SamplePlugin below), inherit the
BosunPluginBase base class. As an example, implement the plugin under
sample_plugin directory in the file
class SamplePlugin(BosunPluginBase): def __init__(self, plugin_config, private_config_file=None, pe_info=None, dry_run=False):
Python plugin arguments¶
The constructor is invoked with the following arguments:
||A dictionary containing general information about the plugin. We will go over the details in the following section.|
||Path to the private configuration file for the plugin as passed in by the
||An instance of
||The invocation for dry run (development) or the actual run.|
Python plugin methods¶
This class implements the following methods:
The return type for each of the following functions must be
This method initializes the plugin; for example, it can check if the plugin can connect with the prediction environment (e.g., Docker, Kubernetes). In the case of the filesystem plugin, this method checks if the
baseDir exists on the filesystem. Management agent invokes this method typically only once during the startup process. This method is guaranteed to be called before any deployment-specific action can be invoked.
This method implements any tear-down process, for example, close client connections to the prediction environment. The management agent invokes this method typically only once during the shutdown process. This plugin method is guaranteed to be called after all deployment-specific actions are done.
This method returns the list of deployments already running in the given prediction environment. The management agent typically invokes this method during the startup to determine which deployments are already running in the prediction environment. The list of deployments is returned as a map of
deployment_id -> Deployment Information, using the
data field in the
ActionStatusInfo (described below)
def deployment_start(self, deployment_info):
This method implements a deployment launch process. Management Agent invokes this method when deployment is created or activated in DataRobot. For example, this method can launch the container in the Kubernetes or Docker service. In the case of the filesystem plugin, this method creates a directory with the name
deployment_<deployment_id>. It then places the deployment's model and a YAML configuration file under the new directory. The plugin should ensure that the deployment in the prediction environment is uniquely identifiable by the deployment id and, ideally, by the paired deployment id and model id. For example, the built-in Docker plugin launches the container with the following name:
def deployment_stop(self, deployment_info):
This method implements a deployment stop process. Management Agent invokes this method when deployment is deactivated or deleted in DataRobot. For example, this method can stop the container in the Kubernetes or Docker service. The deployment id and model id from the
deployment_info uniquely identifies the container that needs to be stopped. In the case of the filesystem plugin, this method removes the directory created for that deployment by the
def deployment_replace_model(self, deployment_info):
This method implements a model replacement process in the deployment. The management agent invokes this method when a model is replaced in a deployment in DataRobot.
modelArtifact contains the path to the new model, and
newModelId contains the id of the new model to use for replacement. In the case of the Docker or Kubernetes plugin, a potential implementation of this method could stop the container with the old model id and then start a new container with the new model. In the case of filesystem plugin, it removes the old deployment directory and creates a new one with the new model.
This method queries for the status of the prediction environment, for example, whether the Kubernetes or Docker service is still reachable. The management agent periodically invokes this method to ensure the prediction environment is in a good state. In order to improve the experience, the plugin can support queries for the status of the deployments running in the prediction environment in addition to the status of the prediction environment itself. In this case, the IDs of the deployments are included in the
deployments field of the
peInfo structure (described below), and the status of each deployment is returned using
data field in the
ActionStatusInfo object (described below). The deployment status is returned as a map of
deployment_id to Deployment Information.
This method queries the status of the deployment deployed in a prediction environment, for example, whether the container corresponding to the deployment is still up and running. The management agent periodically invokes this method to ensure that the deployment is in a good state.
def deployment_relaunch(self, deployment_info):
This method implements the process of relaunching (stopping + starting) the deployment. The management agent Python package already provides a default implementation of this method by invoking
deployment_stop followed by
deployment_start; however, the plugin can implement its own relaunch mechanism if there is an optimal way to relaunch a deployment.
Python plugin return value¶
The return value for all these operations is an
ActionStatusInfo object providing the status of the action:
class ActionStatusInfo: def __init__(self, status, msg=None, state=None, duration=None, data=None):
This object contains the following fields:
||Indicates the status of the action.
||Indicates the state of the deployment after the execution of action.
||Indicates the time the action took to execute.|
||Returns information that plugin can forward to the management agent. Currently,
The base class automatically adds the
timestamp to the object to keep track of different action status values.
Use the bosun-plugin-runner¶
The management agent Python package provides the
bosun-plugin-runner CLI tool, which allows you to invoke the custom plugin class and run a specific action. Using this tool, you can run your plugin in standalone mode while developing and debugging your plugin.
bosun-plugin-runner \ --plugin sample_plugin/sample_plugin \ --action pe_status \ --config sample_configs/action_config_pe_status_only.yaml \ --private-config sample_configs/sample_plugin_config.yaml \ --status-file /tmp/status.yaml \ --show-status
bosun-plugin-runner accepts the following arguments:
||Specifies the module containing the plugin class. In this case, we used sample_plugin/sample_plugin since the plugin class is inside the sample_plugin directory in the sample_plugin.py file.|
||Specifies the action to run. Here we use the
||Provides the configuration file to use for the action specified. We describe this in more detail in the next section. When your plugin runs as part of the Management agent service, this file will be generated for you but when testing specific actions manually via the
||Provides a plugin specific configuration file used only by plugin.|
||Provides a path for saving the plugin status that results from the action.|
||Shows the contents of the
To view the list of actions supported by
bosun-plugin-runner use the
bosun-plugin-runner --list-actions # plugin_start # plugin_stop # deployment_start # deployment_stop # deployment_replace_model # deployment_status # pe_status # deployment_list
Create the action config file¶
--config flag is used to pass a YAML configuration file to the plugin. This is the structure of the configuration that the management agent prepares and invokes the plugin action with; however, during plugin development, you may need to write this configuration file yourself.
The typical contents of such a config file are shown below:
pluginConfig: name: "ExternalCommand-1" type: "ExternalCommand" platform: "os" commandPrefix: "python3 sample_plugin.py" mlopsUrl: "https://app.datarobot.com" peInfo: id: "0x2345" name: "Sample-PE" description: "some description" createdOn: "iso formatted date" createdBy: "some username" deployments: ["deployment-1", "deployment-2"] keyValueConfig: max_models: 5 deploymentInfo: id: "deployment-1" name: "deployment-1" description: "Deployment 1 for testing" modelId: "model-A" modelArtifact: "/tmp/model-A.txt" modelExecutionType: "dedicated" keyValueConfig: key1: "some-value-for-key-1"
The action configuration file contains three sections:
pluginConfig section contains general information about the plugin, for example, ID of the prediction environment, its type, and the platform. It may also contain the
mlopsUrl, the address of the MLOps service (DataRobot) (in case the plugin would like to connect). This is the section that translates to the
pluginConfig dictionary and is passed as a constructor argument.
peInfo section contains information about the prediction environment this action refers to. Typically, this information is used for
pe_status action. If
deployments key contains valid deployment ids, the plugin is expected to return not only the status of the prediction environment but also the status of the deployments listed under
deploymentInfo section contains the information about the deployment in the prediction environment this action refers to. All the deployment-related actions use this section to identify which deployment and model to work on. As this is a particularly important section of the config, let us go over some of the important fields:
description: Provides information about the deployment as set in DataRobot.
modelArtifact: Indicates the ID of the model and the path where the model can be found. Note that the management agent will place the right model at this path before invoking
keyValueConfig: Lists the additional configuration for the deployment. Note that this additional config can be set on the deployment in DataRobot. For example, this can be used to specify how much memory the container corresponding to this deployment should use.
Run actions with bosun-plugin-runner¶
As covered above, during plugin development, you can use the
bosun-plugin-runner to invoke the actions. For example, here is how a
deployment_start action can be invoked. We will use the same config as described in the previous section and dump it to a file
bosun-plugin-runner \ --plugin sample_plugin/sample_plugin \ --config sample_configs/action_config_deployment_1_model_A.yaml \ --private-config sample_configs/sample_plugin_config.yaml \ --action deployment_start \ --status-file /tmp/status.yaml \ --show-status
The status of this
deployment_start action is captured in the file
Configure the command prefix¶
Now that your plugin is ready for the management agent, you can configure the
command prefix in the management agent configuration file as:
command: "<BOSUN_VENV_PATH>/bin/bosun-plugin-runner --plugin sample_plugin --private-config <CONF_PATH>/plugin.sample_plugin_.conf.yaml"