The Kubernetes Availability Monitor¶
Datarobot 10.X.X ships with a series of dedicated monitoring jobs. These are the Kubernetes Availability Monitor services.
Kavmon (Kubernetes Availability Monitor) is a CLI application that is an improved replacement of the DataRobot availability-monitor (AVMon) service used in prior Datarobot versions. It re-implements a subset of checks developed for AVMon and adds new checks that are unique for a Kubernetes environment. Lists of implemented checks are included in the Check Definitions sections for reference.
Kavmon is implemented as a new command-line tool that is shipped in the datarobot-runtime image.
Kavmon is designed to be run using native Kubernetes cron jobs, and tuning check schedules can be done in the cron jobs' configuration. When troubleshooting a running cluster, an operator can also use the Kavmon CLI tool (by exec-ing into a running container) to run the checks on demand in the foreground, or trigger an unscheduled run of a specific cron job.
Since Kavmon is an on-demand CLI tool, it doesn't include any storage backend. By default Kavmon sends metrics to the OTEL collector if the telemetry is enabled for the DataRobot namespace. See Observability with OpenTelemetry section for available configuration. Additionally, Kavmon logs can be used to produce check reports with an external observability stack, such as ELK.
Querying The Kubernetes Availability Monitor¶
Kavmon CLI provides a single subcommand, check:
$ datarobot-kavmon check --help
Usage: run.py check [OPTIONS]
Run checks based on provided filters in the running cluster.
Example: `kavmon check -g info -s datasetsserviceapi`
The above command will execute all checks that belong to group ‘info’ and
service ‘datasetsserviceapi’...
Options:
-g, --group TEXT Filter by check 'group' (can specify multiple
groups separated by comma). [default: ]
-n, --name TEXT Filter by check 'name' (can specify multiple
names separated by comma). [default: ]
-s, --service TEXT Filter by check 'service' (can specify multiple
services separated by comma). [default: ]
-w, --num-workers INTEGER Number of parallel workers to use (default: use
all available CPUs). [default: 0]
-t, --timeout INTEGER Override timeout for all checks, in seconds
[default: 0]
--cronjob Control exit code of 'kavmon check' script.
Always return exit code 0.
-l, --log-level TEXT Logging level for kavmon cli and kavmon checks.
--no-metrics Force checks not generating metrics. [default:
False]
--help Show this message and exit.
The command should be executed in a container running the datarobot-runtime image inside the Kubernetes cluster. It executes requested checks in parallel, using all CPUs available to the Kavmon pod; the number of workers can be tuned by --num-workers argument.
Checks that should be executed can be filtered by --group, --name and --service filters or their combinations; if no filters are provided, all checks will be performed.
By default, Kavmon will also send job metrics to a configured metrics system; this can be disabled by supplying --no-metrics argument. To only view the final output of the checks (in the format closely matching JSON outputs of AVMon), set the log level to warning or critical by using the -l argument.
Kavmon CLI will exit with code 0 if all checks are healthy, and code 1 if some of them are not. If --cronjob argument is supplied, the CLI will always exit with code 0; this is done to avoid constantly restarting Kubernetes cron jobs if checks fail.
Command examples¶
| Command | Expected Output |
|---|---|
datarobot-kavmon check -s app |
Execute checks for the health of the main application. |
datarobot-kavmon check -g dbHealth |
Execute checks belonging to group dbHealth. |
datarobot-kavmon check -g info -s datasetsserviceapi |
Execute checks belonging to group info and service datasetsserviceapi |
datarobot-kavmon |
Execute all checks. |
To invoke those commands manually you would execute something like the following, from a host that has access to the cluster and namespace into which Datarobot is installed.
Connect to a pod shell¶
Step one, connect to a pod using the datarobot-runtime image. In this case we are using the mmapp-app service's pod. The number strings at the end of the pod name will be different in your scenario.
# find the pod name
$ kubectl get pods -n DR_CORE_NAMESPACE | grep mmapp
mmapp-app-5d45968868-h9924 1/1 Running 0 63m
# connect to a shell on the running pod
$ kubectl exec -ti mmapp-app-5d45968868-h9924 -- /entrypoint bash
WARNING: Generating random home
[INFO quantum_env.activate] Running activate script
[INFO quantum_env.activate] quantum_env 3.1.4 - Extended Virtualenv Support for DataRobot
bash-4.4$
Execute the commands¶
Step two, invoke the datarobot-kavmon service. In this example we are asking for the status of the app service.
datarobot-kavmon check -s app
The expected output will look something like this:
{
"@message": "Kavmon starting",
"@timestamp": "2023-01-30T17:00:31.455233Z",
"@source_host": "mmapp-app-5d45968868-h9924",
"@fields": {
"name": "__main__",
"args": [],
"levelname": "INFO",
"levelno": 20,
"pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/cli/run.py",
"filename": "run.py",
"module": "run",
"stack_info": null,
"lineno": 127,
"funcName": "check",
"created": 1675098031.4551651,
"msecs": 455.16514778137207,
"relativeCreated": 3607.522964477539,
"thread": 139640676325184,
"threadName": "MainThread",
"processName": "MainProcess",
"process": 754,
"datarobot_service_id": "default",
"arguments": {
"group": "",
"name": "",
"service": "app",
"num_workers": 0,
"timeout": 0,
"cronjob": false,
"log_level": "info",
"no_metrics": false,
"provided_log_level": 20
},
"DataRobotLogger_filter_disallow": false,
"DataRobotLogger_filter_aliases": false,
"DataRobotLogger_keys_disallow": [
"arguments"
]
}
}
{
"@message": "Picked checks based on provided filters",
"@timestamp": "2023-01-30T17:00:31.455453Z",
"@source_host": "mmapp-app-5d45968868-h9924",
"@fields": {
"name": "support.kavmon.cli.run_manager",
"args": [],
"levelname": "INFO",
"levelno": 20,
"pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/cli/run_manager.py",
"filename": "run_manager.py",
"module": "run_manager",
"stack_info": null,
"lineno": 89,
"funcName": "get_matched_checks",
"created": 1675098031.4554243,
"msecs": 455.42430877685547,
"relativeCreated": 3607.7821254730225,
"thread": 139640676325184,
"threadName": "MainThread",
"processName": "MainProcess",
"process": 754,
"datarobot_service_id": "default",
"health_checks": [
"License Valid and Non-Expired"
],
"DataRobotLogger_filter_disallow": false,
"DataRobotLogger_filter_aliases": false,
"DataRobotLogger_keys_disallow": [
"health_checks"
]
}
}
{
"@message": "Health check execution started",
"@timestamp": "2023-01-30T17:00:31.465817Z",
"@source_host": "mmapp-app-5d45968868-h9924",
"@fields": {
"name": "support.kavmon.base.health_check",
"args": [],
"levelname": "INFO",
"levelno": 20,
"pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/base/health_check.py",
"filename": "health_check.py",
"module": "health_check",
"stack_info": null,
"lineno": 94,
"funcName": "run_wrapper",
"created": 1675098031.4656253,
"msecs": 465.6252861022949,
"relativeCreated": 3617.983102798462,
"thread": 139640676325184,
"threadName": "MainThread",
"processName": "ForkProcess-1",
"process": 818,
"datarobot_service_id": "default",
"health_check": "License Valid and Non-Expired",
"DataRobotLogger_filter_disallow": false,
"DataRobotLogger_filter_aliases": false,
"DataRobotLogger_keys_disallow": [
"health_check"
]
}
}
{
"@message": "Health check execution finished",
"@timestamp": "2023-01-30T17:00:31.502575Z",
"@source_host": "mmapp-app-5d45968868-h9924",
"@fields": {
"name": "support.kavmon.base.health_check",
"args": [],
"levelname": "INFO",
"levelno": 20,
"pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/base/health_check.py",
"filename": "health_check.py",
"module": "health_check",
"stack_info": null,
"lineno": 140,
"funcName": "run_wrapper",
"created": 1675098031.5025504,
"msecs": 502.5503635406494,
"relativeCreated": 3654.9081802368164,
"thread": 139640676325184,
"threadName": "MainThread",
"processName": "ForkProcess-1",
"process": 818,
"datarobot_service_id": "default",
"health_check": "License Valid and Non-Expired",
"healthy": true,
"timed_out": false,
"health_check_group": "info",
"health_check_service": "app",
"health_check_result": "{\"group\": \"info\", \"name\": \"License Valid and Non-Expired\", \"service\": \"app\", \"healthy\": true, \"timestamp_start\": \"2023-01-30 17:00:31.466165\", \"concurrentWorkersCount\": 500, \"expirationTimestamp\": \"2023-02-03T16:26:16.000000Z\", \"expired\": false, \"maximumActiveUsers\": 0, \"prepaidDeploymentLimit\": 0, \"maxDeploymentLimit\": 0, \"timestamp_end\": \"2023-01-30 17:00:31.502487\"}",
"DataRobotLogger_filter_disallow": false,
"DataRobotLogger_filter_aliases": false,
"DataRobotLogger_keys_disallow": [
"health_check",
"healthy",
"timed_out",
"health_check_group",
"health_check_service",
"health_check_result"
]
}
}
{
"@message": "Check succeeded.",
"@timestamp": "2023-01-30T17:00:31.502700Z",
"@source_host": "mmapp-app-5d45968868-h9924",
"@fields": {
"name": "support.kavmon.base.health_check",
"args": [],
"levelname": "INFO",
"levelno": 20,
"pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/base/health_check.py",
"filename": "health_check.py",
"module": "health_check",
"stack_info": null,
"lineno": 143,
"funcName": "run_wrapper",
"created": 1675098031.5026789,
"msecs": 502.67887115478516,
"relativeCreated": 3655.036687850952,
"thread": 139640676325184,
"threadName": "MainThread",
"processName": "ForkProcess-1",
"process": 818,
"datarobot_service_id": "default",
"health_check": "License Valid and Non-Expired",
"DataRobotLogger_filter_disallow": false,
"DataRobotLogger_filter_aliases": false,
"DataRobotLogger_keys_disallow": [
"health_check"
]
}
}
{
"@message": "Finished processing all health checks. All checks were healthy.",
"@timestamp": "2023-01-30T17:00:31.512450Z",
"@source_host": "mmapp-app-5d45968868-h9924",
"@fields": {
"name": "support.kavmon.cli.run_manager",
"args": [],
"levelname": "INFO",
"levelno": 20,
"pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/cli/run_manager.py",
"filename": "run_manager.py",
"module": "run_manager",
"stack_info": null,
"lineno": 60,
"funcName": "run_check_manager",
"created": 1675098031.5123436,
"msecs": 512.3436450958252,
"relativeCreated": 3664.701461791992,
"thread": 139640676325184,
"threadName": "MainThread",
"processName": "MainProcess",
"process": 754,
"datarobot_service_id": "default",
"DataRobotLogger_filter_disallow": false,
"DataRobotLogger_filter_aliases": false
}
}
{
"@message": "Kavmon checks finished",
"@timestamp": "2023-01-30T17:00:31.512757Z",
"@source_host": "mmapp-app-5d45968868-h9924",
"@fields": {
"name": "support.kavmon.cli.run_manager",
"args": [],
"levelname": "INFO",
"levelno": 20,
"pathname": "/opt/datarobot-runtime/app/DataRobot/support/kavmon/cli/run_manager.py",
"filename": "run_manager.py",
"module": "run_manager",
"stack_info": null,
"lineno": 70,
"funcName": "run_check_manager",
"created": 1675098031.5127304,
"msecs": 512.7303600311279,
"relativeCreated": 3665.088176727295,
"thread": 139640676325184,
"threadName": "MainThread",
"processName": "MainProcess",
"process": 754,
"datarobot_service_id": "default",
"all_healthy": true,
"failed_checks": "",
"check_groups": "info",
"health_check_results": "{\"info\": [{\"name\": \"License Valid and Non-Expired\", \"service\": \"app\", \"healthy\": true, \"timestamp_start\": \"2023-01-30 17:00:31.466165\", \"concurrentWorkersCount\": 500, \"expirationTimestamp\": \"2023-02-03T16:26:16.000000Z\", \"expired\": false, \"maximumActiveUsers\": 0, \"prepaidDeploymentLimit\": 0, \"maxDeploymentLimit\": 0, \"timestamp_end\": \"2023-01-30 17:00:31.502487\"}]}",
"DataRobotLogger_filter_disallow": false,
"DataRobotLogger_filter_aliases": false,
"DataRobotLogger_keys_disallow": [
"all_healthy",
"failed_checks",
"check_groups",
"health_check_results"
]
}
}