Kubernetes Availability monitor¶

DataRobot 10.x.x ships with a series of dedicated monitoring jobs. These are the Kubernetes Availability Monitor services.

Kavmon (Kubernetes Availability Monitor) is a CLI application that's an improved replacement of the DataRobot availability-monitor (AVMon) service used in prior DataRobot versions. It re-implements a subset of checks developed for AVMon and adds new checks that are unique for a Kubernetes environment. Lists of implemented checks are included in the Check Definitions sections for reference.

Kavmon is implemented as a new command-line tool that's shipped in the datarobot-runtime image.

Kavmon is designed to be run using native Kubernetes cron jobs, and tuning check schedules can be done in the cron jobs' configuration. When troubleshooting a running cluster, an operator can also use the Kavmon CLI tool (by exec-ing into a running container) to run the checks on demand in the foreground, or trigger an unscheduled run of a specific cron job.

Since Kavmon is an on-demand CLI tool, it doesn't include any storage backend. By default Kavmon sends metrics to the OTEL collector if the telemetry is enabled for the DataRobot namespace. See Observability with OpenTelemetry section for available configuration. Additionally, Kavmon logs can be used to produce check reports with an external observability stack, such as ELK.

Querying the Kubernetes availability monitor¶

Kavmon CLI provides a single subcommand, check:

$ datarobot-kavmon check --help
Usage: run.py check [OPTIONS]
  Run checks based on provided filters in the running cluster.
  Example:     `kavmon check -g info -s datasetsserviceapi`
  The above command will execute all checks that belong to group ‘info’ and
  service ‘datasetsserviceapi’...
Options:
  -g, --group TEXT           Filter by check 'group' (can specify multiple
                             groups separated by comma).  [default: ]
  -n, --name TEXT            Filter by check 'name' (can specify multiple
                             names separated by comma).  [default: ]
  -s, --service TEXT         Filter by check 'service' (can specify multiple
                             services separated by comma).  [default: ]
  -w, --num-workers INTEGER  Number of parallel workers to use (default: use
                             all available CPUs).  [default: 0]
  -t, --timeout INTEGER      Override timeout for all checks, in seconds
                             [default: 0]
  --cronjob                  Control exit code of 'kavmon check' script.
                             Always return exit code 0.
  -l, --log-level TEXT       Logging level for kavmon cli and kavmon checks.
  --no-metrics               Force checks not generating metrics.  [default:
                             False]
  --help                     Show this message and exit.

The command should be executed in a container running the datarobot-runtime image inside the Kubernetes cluster. It executes requested checks in parallel, using all CPUs available to the Kavmon pod; the number of workers can be tuned by --num-workers argument.

Checks to execute can be filtered by --group, --name, and --service filters or their combinations; if no filters are provided, all checks are performed.

By default, Kavmon also sends job metrics to a configured metrics system; you can disable this by supplying the --no-metrics argument. To view only the final output of the checks (in a format closely matching JSON outputs of AVMon), set the log level to warning or critical by using the -l argument.

Kavmon CLI exits with code 0 if all checks are healthy and code 1 if some aren't. If the --cronjob argument is supplied, the CLI always exits with code 0; this avoids constantly restarting Kubernetes cron jobs when checks fail.

Command examples¶

Command	Expected Output
`datarobot-kavmon check -s app`	Execute checks for the health of the main application.
`datarobot-kavmon check -g dbHealth`	Execute checks belonging to group `dbHealth`.
`datarobot-kavmon check -g info -s datasetsserviceapi`	Execute checks belonging to group `info` and service `datasetsserviceapi`
`datarobot-kavmon`	Execute all checks.

To invoke those commands manually, run something like the following from a host that has access to the cluster and the namespace into which DataRobot is installed.

Connect to a pod shell¶

Connect to a pod using the datarobot-runtime image. In this case, the example uses the mmapp-app service's pod. The number strings at the end of the pod name are different in your scenario.

# find the pod name
$ kubectl get pods -n DR_CORE_NAMESPACE | grep mmapp
mmapp-app-5d45968868-h9924                                        1/1     Running     0             63m

# connect to a shell on the running pod
$ kubectl exec -ti mmapp-app-5d45968868-h9924 -- /entrypoint bash
WARNING: Generating random home
[INFO quantum_env.activate] Running activate script
[INFO quantum_env.activate] quantum_env 3.1.4 - Extended Virtualenv Support for DataRobot
bash-4.4$

Execute the commands¶

Invoke the datarobot-kavmon service. In this example, the status of the app service is requested.

datarobot-kavmon check -s app

The expected output looks something like this:

{
  "@message": "Kavmon starting",
  "@timestamp": "2023-01-30T17:00:31.455233Z",
  "@source_host": "mmapp-app-5d45968868-h9924",
  "@fields": {
    "name": "__main__",
    "args": [],
    "levelname": "INFO",
    "levelno": 20,
    "pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/cli/run.py",
    "filename": "run.py",
    "module": "run",
    "stack_info": null,
    "lineno": 127,
    "funcName": "check",
    "created": 1675098031.4551651,
    "msecs": 455.16514778137207,
    "relativeCreated": 3607.522964477539,
    "thread": 139640676325184,
    "threadName": "MainThread",
    "processName": "MainProcess",
    "process": 754,
    "datarobot_service_id": "default",
    "arguments": {
      "group": "",
      "name": "",
      "service": "app",
      "num_workers": 0,
      "timeout": 0,
      "cronjob": false,
      "log_level": "info",
      "no_metrics": false,
      "provided_log_level": 20
    },
    "DataRobotLogger_filter_disallow": false,
    "DataRobotLogger_filter_aliases": false,
    "DataRobotLogger_keys_disallow": [
      "arguments"
    ]
  }
}
{
  "@message": "Picked checks based on provided filters",
  "@timestamp": "2023-01-30T17:00:31.455453Z",
  "@source_host": "mmapp-app-5d45968868-h9924",
  "@fields": {
    "name": "support.kavmon.cli.run_manager",
    "args": [],
    "levelname": "INFO",
    "levelno": 20,
    "pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/cli/run_manager.py",
    "filename": "run_manager.py",
    "module": "run_manager",
    "stack_info": null,
    "lineno": 89,
    "funcName": "get_matched_checks",
    "created": 1675098031.4554243,
    "msecs": 455.42430877685547,
    "relativeCreated": 3607.7821254730225,
    "thread": 139640676325184,
    "threadName": "MainThread",
    "processName": "MainProcess",
    "process": 754,
    "datarobot_service_id": "default",
    "health_checks": [
      "License Valid and Non-Expired"
    ],
    "DataRobotLogger_filter_disallow": false,
    "DataRobotLogger_filter_aliases": false,
    "DataRobotLogger_keys_disallow": [
      "health_checks"
    ]
  }
}
{
  "@message": "Health check execution started",
  "@timestamp": "2023-01-30T17:00:31.465817Z",
  "@source_host": "mmapp-app-5d45968868-h9924",
  "@fields": {
    "name": "support.kavmon.base.health_check",
    "args": [],
    "levelname": "INFO",
    "levelno": 20,
    "pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/base/health_check.py",
    "filename": "health_check.py",
    "module": "health_check",
    "stack_info": null,
    "lineno": 94,
    "funcName": "run_wrapper",
    "created": 1675098031.4656253,
    "msecs": 465.6252861022949,
    "relativeCreated": 3617.983102798462,
    "thread": 139640676325184,
    "threadName": "MainThread",
    "processName": "ForkProcess-1",
    "process": 818,
    "datarobot_service_id": "default",
    "health_check": "License Valid and Non-Expired",
    "DataRobotLogger_filter_disallow": false,
    "DataRobotLogger_filter_aliases": false,
    "DataRobotLogger_keys_disallow": [
      "health_check"
    ]
  }
}
{
  "@message": "Health check execution finished",
  "@timestamp": "2023-01-30T17:00:31.502575Z",
  "@source_host": "mmapp-app-5d45968868-h9924",
  "@fields": {
    "name": "support.kavmon.base.health_check",
    "args": [],
    "levelname": "INFO",
    "levelno": 20,
    "pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/base/health_check.py",
    "filename": "health_check.py",
    "module": "health_check",
    "stack_info": null,
    "lineno": 140,
    "funcName": "run_wrapper",
    "created": 1675098031.5025504,
    "msecs": 502.5503635406494,
    "relativeCreated": 3654.9081802368164,
    "thread": 139640676325184,
    "threadName": "MainThread",
    "processName": "ForkProcess-1",
    "process": 818,
    "datarobot_service_id": "default",
    "health_check": "License Valid and Non-Expired",
    "healthy": true,
    "timed_out": false,
    "health_check_group": "info",
    "health_check_service": "app",
    "health_check_result": "{\"group\": \"info\", \"name\": \"License Valid and Non-Expired\", \"service\": \"app\", \"healthy\": true, \"timestamp_start\": \"2023-01-30 17:00:31.466165\", \"concurrentWorkersCount\": 500, \"expirationTimestamp\": \"2023-02-03T16:26:16.000000Z\", \"expired\": false, \"maximumActiveUsers\": 0, \"prepaidDeploymentLimit\": 0, \"maxDeploymentLimit\": 0, \"timestamp_end\": \"2023-01-30 17:00:31.502487\"}",
    "DataRobotLogger_filter_disallow": false,
    "DataRobotLogger_filter_aliases": false,
    "DataRobotLogger_keys_disallow": [
      "health_check",
      "healthy",
      "timed_out",
      "health_check_group",
      "health_check_service",
      "health_check_result"
    ]
  }
}
{
  "@message": "Check succeeded.",
  "@timestamp": "2023-01-30T17:00:31.502700Z",
  "@source_host": "mmapp-app-5d45968868-h9924",
  "@fields": {
    "name": "support.kavmon.base.health_check",
    "args": [],
    "levelname": "INFO",
    "levelno": 20,
    "pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/base/health_check.py",
    "filename": "health_check.py",
    "module": "health_check",
    "stack_info": null,
    "lineno": 143,
    "funcName": "run_wrapper",
    "created": 1675098031.5026789,
    "msecs": 502.67887115478516,
    "relativeCreated": 3655.036687850952,
    "thread": 139640676325184,
    "threadName": "MainThread",
    "processName": "ForkProcess-1",
    "process": 818,
    "datarobot_service_id": "default",
    "health_check": "License Valid and Non-Expired",
    "DataRobotLogger_filter_disallow": false,
    "DataRobotLogger_filter_aliases": false,
    "DataRobotLogger_keys_disallow": [
      "health_check"
    ]
  }
}
{
  "@message": "Finished processing all health checks. All checks were healthy.",
  "@timestamp": "2023-01-30T17:00:31.512450Z",
  "@source_host": "mmapp-app-5d45968868-h9924",
  "@fields": {
    "name": "support.kavmon.cli.run_manager",
    "args": [],
    "levelname": "INFO",
    "levelno": 20,
    "pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/cli/run_manager.py",
    "filename": "run_manager.py",
    "module": "run_manager",
    "stack_info": null,
    "lineno": 60,
    "funcName": "run_check_manager",
    "created": 1675098031.5123436,
    "msecs": 512.3436450958252,
    "relativeCreated": 3664.701461791992,
    "thread": 139640676325184,
    "threadName": "MainThread",
    "processName": "MainProcess",
    "process": 754,
    "datarobot_service_id": "default",
    "DataRobotLogger_filter_disallow": false,
    "DataRobotLogger_filter_aliases": false
  }
}
{
  "@message": "Kavmon checks finished",
  "@timestamp": "2023-01-30T17:00:31.512757Z",
  "@source_host": "mmapp-app-5d45968868-h9924",
  "@fields": {
    "name": "support.kavmon.cli.run_manager",
    "args": [],
    "levelname": "INFO",
    "levelno": 20,
    "pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/cli/run_manager.py",
    "filename": "run_manager.py",
    "module": "run_manager",
    "stack_info": null,
    "lineno": 70,
    "funcName": "run_check_manager",
    "created": 1675098031.5127304,
    "msecs": 512.7303600311279,
    "relativeCreated": 3665.088176727295,
    "thread": 139640676325184,
    "threadName": "MainThread",
    "processName": "MainProcess",
    "process": 754,
    "datarobot_service_id": "default",
    "all_healthy": true,
    "failed_checks": "",
    "check_groups": "info",
    "health_check_results": "{\"info\": [{\"name\": \"License Valid and Non-Expired\", \"service\": \"app\", \"healthy\": true, \"timestamp_start\": \"2023-01-30 17:00:31.466165\", \"concurrentWorkersCount\": 500, \"expirationTimestamp\": \"2023-02-03T16:26:16.000000Z\", \"expired\": false, \"maximumActiveUsers\": 0, \"prepaidDeploymentLimit\": 0, \"maxDeploymentLimit\": 0, \"timestamp_end\": \"2023-01-30 17:00:31.502487\"}]}",
    "DataRobotLogger_filter_disallow": false,
    "DataRobotLogger_filter_aliases": false,
    "DataRobotLogger_keys_disallow": [
      "all_healthy",
      "failed_checks",
      "check_groups",
      "health_check_results"
    ]
  }
}