Exporting Kubernetes Availability Monitor metrics¶
Kavmon automatically sends its metrics to the OpenTelemetry (OTEL) collector when telemetry is enabled for the DataRobot namespace. Kavmon emits metrics as gauges that can be collected by any OTLP-compatible endpoint and then exported to your observability stack (including Prometheus).
OpenTelemetry Integration¶
Kavmon integrates with the OpenTelemetry standard to export metrics. The metrics are sent as gauges with labels that describe the check being performed. This allows for flexible collection, aggregation, and visualization in your monitoring system.
The nature of datarobot-kavmon is ephemeral, as the jobs that are scheduled do not stay around long enough for traditional pull-based monitoring. By sending metrics to an OTEL collector, Kavmon can report its health check results even for short-lived jobs.
Configure Kavmon¶
Kavmon will automatically send metrics when OpenTelemetry is enabled for the DataRobot namespace. For details on enabling and configuring OpenTelemetry, see the Observability with OpenTelemetry section.
Additional Labels Configuration¶
You can configure additional custom labels to be attached to all Kavmon metrics by setting the AM_OTEL_LABELS environment variable. This accepts a JSON object with key-value pairs:
core:
config_env_vars:
AM_OTEL_LABELS: '{"environment": "production", "cluster": "us-east-1", "team": "platform"}'
These labels will be merged with the default Kavmon labels (check_group, check_name, check_service, job_name) and included in all metrics.
To apply the changes to the cluster, run the helm upgrade command to upgrade the release:
helm upgrade --install datarobot-core charts/datarobot-generic-X.X.X.tgz --wait --namespace DATAROBOT_NAMESPACE --values 02_dr-app_values.yaml --timeout 5m --debug
Metrics output format¶
When datarobot-kavmon submits metrics to an OTEL collector, the metrics are emitted as gauges with the following structure:
Metric Labels¶
Each metric includes the following labels:
| Label | Description |
|---|---|
| check_group | The name of the group to which the check belongs |
| check_name | The name of the check |
| check_service | The name of the service that is being checked |
| job_name | The name of the Kavmon job (e.g., "kavmon") |
| custom labels | Any additional labels configured via AM_OTEL_LABELS |
Metric Values¶
The health status of each check is represented by the gauge value:
0means the check isUNHEALTHY1means the check isHEALTHY
Example Metrics¶
When exported to Prometheus format (via an OTEL collector with Prometheus exporter), the metrics will look similar to:
# HELP healthy Availability monitor metric healthy {: #help-healthy-availability-monitor-metric-healthy }
# TYPE healthy gauge {: #type-healthy-gauge }
healthy{check_group="datarobotQueue",check_name="Test RabbitMQ exchanges and bindings",check_service="rabbit",job_name="kavmon"} 1
healthy{check_group="dbHealth",check_name="Mongo Sync Status",check_service="mongo",job_name="kavmon"} 0
healthy{check_group="dbHealth",check_name="Rabbit Vhosts Exist",check_service="rabbit",job_name="kavmon"} 1
healthy{check_group="dbHealth",check_name="Redis connection check",check_service="redis",job_name="kavmon"} 1
healthy{check_group="dependency_check",check_name="Mongo dependency check",check_service="mongo",job_name="kavmon"} 1
healthy{check_group="info",check_name="Count DSS Workers",check_service="datasetsserviceapi",job_name="kavmon"} 1
healthy{check_group="info",check_name="Count active UI sessions",check_service="internalapi",job_name="kavmon"} 1
healthy{check_group="info",check_name="Failed Kubernetes not ready pods Job",check_service="kubernetes",job_name="kavmon"} 0
healthy{check_group="info",check_name="License Valid and Non-Expired",check_service="app",job_name="kavmon"} 1
healthy{check_group="info",check_name="Status of test job run by modmonscheduler every minute",check_service="modmonscheduler",job_name="kavmon"} 1
healthy{check_group="testJobs",check_name="Apps Builder Workers check",check_service="appsbuilderapi",job_name="kavmon"} 1
healthy{check_group="testJobs",check_name="Kubeworkers Ready",check_service="execmanagerkubeworkers",job_name="kavmon"} 1
healthy{check_group="testJobs",check_name="Notifications Broker Publish Check",check_service="notificationsbroker",job_name="kavmon"} 1
healthy{check_group="testJobs",check_name="Quick Worker Ping Job",check_service="internalapi",job_name="kavmon"} 1
If you configured additional labels via AM_OTEL_LABELS, they will also appear in each metric:
healthy{check_group="info",check_name="License Valid and Non-Expired",check_service="app",job_name="kavmon",environment="production",cluster="us-east-1"} 1