Skip to content

The Kubernetes Availability Monitor

Datarobot 10.X.X ships with a series of dedicated monitoring jobs. These are the Kubernetes Availability Monitor services.

Kavmon (Kubernetes Availability Monitor) is a CLI application that is an improved replacement of the DataRobot availability-monitor (AVMon) service used in prior Datarobot versions. It re-implements a subset of checks developed for AVMon and adds new checks that are unique for a Kubernetes environment. Lists of implemented checks are included in the Check Definitions sections for reference.

Kavmon is implemented as a new command-line tool that is shipped in the datarobot-runtime image.

Kavmon is designed to be run using native Kubernetes cron jobs, and tuning check schedules can be done in the cron jobs' configuration. When troubleshooting a running cluster, an operator can also use the Kavmon CLI tool (by exec-ing into a running container) to run the checks on demand in the foreground, or trigger an unscheduled run of a specific cron job.

Since Kavmon is an on-demand CLI tool, it doesn't include any storage backend. By default Kavmon sends metrics to the OTEL collector if the telemetry is enabled for the DataRobot namespace. See Observability with OpenTelemetry section for available configuration. Additionally, Kavmon logs can be used to produce check reports with an external observability stack, such as ELK.

Querying The Kubernetes Availability Monitor

Kavmon CLI provides a single subcommand, check:

$ datarobot-kavmon check --help
Usage: run.py check [OPTIONS]
  Run checks based on provided filters in the running cluster.
  Example:     `kavmon check -g info -s datasetsserviceapi`
  The above command will execute all checks that belong to group ‘info’ and
  service ‘datasetsserviceapi’...
Options:
  -g, --group TEXT           Filter by check 'group' (can specify multiple
                             groups separated by comma).  [default: ]
  -n, --name TEXT            Filter by check 'name' (can specify multiple
                             names separated by comma).  [default: ]
  -s, --service TEXT         Filter by check 'service' (can specify multiple
                             services separated by comma).  [default: ]
  -w, --num-workers INTEGER  Number of parallel workers to use (default: use
                             all available CPUs).  [default: 0]
  -t, --timeout INTEGER      Override timeout for all checks, in seconds
                             [default: 0]
  --cronjob                  Control exit code of 'kavmon check' script.
                             Always return exit code 0.
  -l, --log-level TEXT       Logging level for kavmon cli and kavmon checks.
  --no-metrics               Force checks not generating metrics.  [default:
                             False]
  --help                     Show this message and exit. 

The command should be executed in a container running the datarobot-runtime image inside the Kubernetes cluster. It executes requested checks in parallel, using all CPUs available to the Kavmon pod; the number of workers can be tuned by --num-workers argument.

Checks that should be executed can be filtered by --group, --name and --service filters or their combinations; if no filters are provided, all checks will be performed.

By default, Kavmon will also send job metrics to a configured metrics system; this can be disabled by supplying --no-metrics argument. To only view the final output of the checks (in the format closely matching JSON outputs of AVMon), set the log level to warning or critical by using the -l argument.

Kavmon CLI will exit with code 0 if all checks are healthy, and code 1 if some of them are not. If --cronjob argument is supplied, the CLI will always exit with code 0; this is done to avoid constantly restarting Kubernetes cron jobs if checks fail.

Command examples

Command Expected Output
datarobot-kavmon check -s app Execute checks for the health of the main application.
datarobot-kavmon check -g dbHealth Execute checks belonging to group dbHealth.
datarobot-kavmon check -g info -s datasetsserviceapi Execute checks belonging to group info and service datasetsserviceapi
datarobot-kavmon Execute all checks.

To invoke those commands manually you would execute something like the following, from a host that has access to the cluster and namespace into which Datarobot is installed.

Connect to a pod shell

Step one, connect to a pod using the datarobot-runtime image. In this case we are using the mmapp-app service's pod. The number strings at the end of the pod name will be different in your scenario.

# find the pod name
$ kubectl get pods -n DR_CORE_NAMESPACE | grep mmapp
mmapp-app-5d45968868-h9924                                        1/1     Running     0             63m 
# connect to a shell on the running pod
$ kubectl exec -ti mmapp-app-5d45968868-h9924 -- /entrypoint bash
WARNING: Generating random home
[INFO quantum_env.activate] Running activate script
[INFO quantum_env.activate] quantum_env 3.1.4 - Extended Virtualenv Support for DataRobot
bash-4.4$ 

Execute the commands

Step two, invoke the datarobot-kavmon service. In this example we are asking for the status of the app service.

datarobot-kavmon check -s app 

The expected output will look something like this:

{
  "@message": "Kavmon starting",
  "@timestamp": "2023-01-30T17:00:31.455233Z",
  "@source_host": "mmapp-app-5d45968868-h9924",
  "@fields": {
    "name": "__main__",
    "args": [],
    "levelname": "INFO",
    "levelno": 20,
    "pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/cli/run.py",
    "filename": "run.py",
    "module": "run",
    "stack_info": null,
    "lineno": 127,
    "funcName": "check",
    "created": 1675098031.4551651,
    "msecs": 455.16514778137207,
    "relativeCreated": 3607.522964477539,
    "thread": 139640676325184,
    "threadName": "MainThread",
    "processName": "MainProcess",
    "process": 754,
    "datarobot_service_id": "default",
    "arguments": {
      "group": "",
      "name": "",
      "service": "app",
      "num_workers": 0,
      "timeout": 0,
      "cronjob": false,
      "log_level": "info",
      "no_metrics": false,
      "provided_log_level": 20
    },
    "DataRobotLogger_filter_disallow": false,
    "DataRobotLogger_filter_aliases": false,
    "DataRobotLogger_keys_disallow": [
      "arguments"
    ]
  }
}
{
  "@message": "Picked checks based on provided filters",
  "@timestamp": "2023-01-30T17:00:31.455453Z",
  "@source_host": "mmapp-app-5d45968868-h9924",
  "@fields": {
    "name": "support.kavmon.cli.run_manager",
    "args": [],
    "levelname": "INFO",
    "levelno": 20,
    "pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/cli/run_manager.py",
    "filename": "run_manager.py",
    "module": "run_manager",
    "stack_info": null,
    "lineno": 89,
    "funcName": "get_matched_checks",
    "created": 1675098031.4554243,
    "msecs": 455.42430877685547,
    "relativeCreated": 3607.7821254730225,
    "thread": 139640676325184,
    "threadName": "MainThread",
    "processName": "MainProcess",
    "process": 754,
    "datarobot_service_id": "default",
    "health_checks": [
      "License Valid and Non-Expired"
    ],
    "DataRobotLogger_filter_disallow": false,
    "DataRobotLogger_filter_aliases": false,
    "DataRobotLogger_keys_disallow": [
      "health_checks"
    ]
  }
}
{
  "@message": "Health check execution started",
  "@timestamp": "2023-01-30T17:00:31.465817Z",
  "@source_host": "mmapp-app-5d45968868-h9924",
  "@fields": {
    "name": "support.kavmon.base.health_check",
    "args": [],
    "levelname": "INFO",
    "levelno": 20,
    "pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/base/health_check.py",
    "filename": "health_check.py",
    "module": "health_check",
    "stack_info": null,
    "lineno": 94,
    "funcName": "run_wrapper",
    "created": 1675098031.4656253,
    "msecs": 465.6252861022949,
    "relativeCreated": 3617.983102798462,
    "thread": 139640676325184,
    "threadName": "MainThread",
    "processName": "ForkProcess-1",
    "process": 818,
    "datarobot_service_id": "default",
    "health_check": "License Valid and Non-Expired",
    "DataRobotLogger_filter_disallow": false,
    "DataRobotLogger_filter_aliases": false,
    "DataRobotLogger_keys_disallow": [
      "health_check"
    ]
  }
}
{
  "@message": "Health check execution finished",
  "@timestamp": "2023-01-30T17:00:31.502575Z",
  "@source_host": "mmapp-app-5d45968868-h9924",
  "@fields": {
    "name": "support.kavmon.base.health_check",
    "args": [],
    "levelname": "INFO",
    "levelno": 20,
    "pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/base/health_check.py",
    "filename": "health_check.py",
    "module": "health_check",
    "stack_info": null,
    "lineno": 140,
    "funcName": "run_wrapper",
    "created": 1675098031.5025504,
    "msecs": 502.5503635406494,
    "relativeCreated": 3654.9081802368164,
    "thread": 139640676325184,
    "threadName": "MainThread",
    "processName": "ForkProcess-1",
    "process": 818,
    "datarobot_service_id": "default",
    "health_check": "License Valid and Non-Expired",
    "healthy": true,
    "timed_out": false,
    "health_check_group": "info",
    "health_check_service": "app",
    "health_check_result": "{\"group\": \"info\", \"name\": \"License Valid and Non-Expired\", \"service\": \"app\", \"healthy\": true, \"timestamp_start\": \"2023-01-30 17:00:31.466165\", \"concurrentWorkersCount\": 500, \"expirationTimestamp\": \"2023-02-03T16:26:16.000000Z\", \"expired\": false, \"maximumActiveUsers\": 0, \"prepaidDeploymentLimit\": 0, \"maxDeploymentLimit\": 0, \"timestamp_end\": \"2023-01-30 17:00:31.502487\"}",
    "DataRobotLogger_filter_disallow": false,
    "DataRobotLogger_filter_aliases": false,
    "DataRobotLogger_keys_disallow": [
      "health_check",
      "healthy",
      "timed_out",
      "health_check_group",
      "health_check_service",
      "health_check_result"
    ]
  }
}
{
  "@message": "Check succeeded.",
  "@timestamp": "2023-01-30T17:00:31.502700Z",
  "@source_host": "mmapp-app-5d45968868-h9924",
  "@fields": {
    "name": "support.kavmon.base.health_check",
    "args": [],
    "levelname": "INFO",
    "levelno": 20,
    "pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/base/health_check.py",
    "filename": "health_check.py",
    "module": "health_check",
    "stack_info": null,
    "lineno": 143,
    "funcName": "run_wrapper",
    "created": 1675098031.5026789,
    "msecs": 502.67887115478516,
    "relativeCreated": 3655.036687850952,
    "thread": 139640676325184,
    "threadName": "MainThread",
    "processName": "ForkProcess-1",
    "process": 818,
    "datarobot_service_id": "default",
    "health_check": "License Valid and Non-Expired",
    "DataRobotLogger_filter_disallow": false,
    "DataRobotLogger_filter_aliases": false,
    "DataRobotLogger_keys_disallow": [
      "health_check"
    ]
  }
}
{
  "@message": "Finished processing all health checks. All checks were healthy.",
  "@timestamp": "2023-01-30T17:00:31.512450Z",
  "@source_host": "mmapp-app-5d45968868-h9924",
  "@fields": {
    "name": "support.kavmon.cli.run_manager",
    "args": [],
    "levelname": "INFO",
    "levelno": 20,
    "pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/cli/run_manager.py",
    "filename": "run_manager.py",
    "module": "run_manager",
    "stack_info": null,
    "lineno": 60,
    "funcName": "run_check_manager",
    "created": 1675098031.5123436,
    "msecs": 512.3436450958252,
    "relativeCreated": 3664.701461791992,
    "thread": 139640676325184,
    "threadName": "MainThread",
    "processName": "MainProcess",
    "process": 754,
    "datarobot_service_id": "default",
    "DataRobotLogger_filter_disallow": false,
    "DataRobotLogger_filter_aliases": false
  }
}
{
  "@message": "Kavmon checks finished",
  "@timestamp": "2023-01-30T17:00:31.512757Z",
  "@source_host": "mmapp-app-5d45968868-h9924",
  "@fields": {
    "name": "support.kavmon.cli.run_manager",
    "args": [],
    "levelname": "INFO",
    "levelno": 20,
    "pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/cli/run_manager.py",
    "filename": "run_manager.py",
    "module": "run_manager",
    "stack_info": null,
    "lineno": 70,
    "funcName": "run_check_manager",
    "created": 1675098031.5127304,
    "msecs": 512.7303600311279,
    "relativeCreated": 3665.088176727295,
    "thread": 139640676325184,
    "threadName": "MainThread",
    "processName": "MainProcess",
    "process": 754,
    "datarobot_service_id": "default",
    "all_healthy": true,
    "failed_checks": "",
    "check_groups": "info",
    "health_check_results": "{\"info\": [{\"name\": \"License Valid and Non-Expired\", \"service\": \"app\", \"healthy\": true, \"timestamp_start\": \"2023-01-30 17:00:31.466165\", \"concurrentWorkersCount\": 500, \"expirationTimestamp\": \"2023-02-03T16:26:16.000000Z\", \"expired\": false, \"maximumActiveUsers\": 0, \"prepaidDeploymentLimit\": 0, \"maxDeploymentLimit\": 0, \"timestamp_end\": \"2023-01-30 17:00:31.502487\"}]}",
    "DataRobotLogger_filter_disallow": false,
    "DataRobotLogger_filter_aliases": false,
    "DataRobotLogger_keys_disallow": [
      "all_healthy",
      "failed_checks",
      "check_groups",
      "health_check_results"
    ]
  }
}