Exporting Kubernetes availability monitor metrics¶

Kavmon emits health check metrics as gauges that can be exported to your observability stack using one of two supported methods:

OpenTelemetry (OTEL): Kavmon sends metrics to an OTLP-compatible collector, which can then forward them to any supported observability backend (including Prometheus via a remote-write exporter).
Prometheus Pushgateway: Kavmon pushes metrics directly to a Prometheus Pushgateway, where they are scraped by Prometheus.

注意

Configure the export method that matches your observability infrastructure. If both an OTel collector and a Prometheus Pushgateway are available at the same time, Kavmon sends metrics to both, resulting in metric duplication across two collectors. All resulting metric labels are identical, so you need additional filtering to distinguish the sources. To avoid this, enable only the method appropriate for your environment.

The nature of datarobot-kavmon is ephemeral—the jobs that are scheduled don't persist long enough for traditional pull-based monitoring. Both export methods are designed to accommodate this by pushing metrics out of the short-lived Kavmon process to a persistent store (an OTEL collector or a Pushgateway).

OpenTelemetry integration¶

Kavmon integrates with the OpenTelemetry standard to export metrics. The metrics are sent as gauges with labels that describe the check being performed. This allows for flexible collection, aggregation, and visualization in your monitoring system.

Kavmon automatically sends metrics when OpenTelemetry is enabled for the DataRobot namespace. For details on enabling and configuring OpenTelemetry, see the Observability with OpenTelemetry section.

Additional labels configuration¶

You can configure additional custom labels to attach to all Kavmon OTel metrics by setting the AM_OTEL_LABELS environment variable. This accepts a JSON object with key-value pairs:

core:
  config_env_vars:
    AM_OTEL_LABELS: '{"environment": "production", "cluster": "us-east-1", "team": "platform"}'

These labels are merged with the default Kavmon labels (check_group, check_name, check_service, and job_name) and included in all metrics.

To apply the changes to the cluster, run the helm upgrade command to upgrade the release:

helm upgrade --install datarobot-core charts/datarobot-generic-X.X.X.tgz --wait --namespace DATAROBOT_NAMESPACE --values 02_dr-app_values.yaml --timeout 5m --debug

Prometheus Pushgateway integration¶

If you are not using OpenTelemetry, Kavmon can push metrics directly to a Prometheus Pushgateway instead. Prometheus then scrapes the Pushgateway on a regular interval.

備考

Use this option when OpenTelemetry is not configured for the DataRobot namespace. If both methods are active simultaneously, Kavmon sends metrics to both, resulting in metric duplication—see the warning above.

Configure Kavmon to use Pushgateway¶

Set the AM_PROMETHEUS_URL environment variable to point to your Prometheus Pushgateway endpoint:

core:
  config_env_vars:
    AM_PROMETHEUS_URL: "http://prometheus-pushgateway.<NAMESPACE>.svc:9091"

Replace <NAMESPACE> with the Kubernetes namespace where your Prometheus Pushgateway is deployed. For example, if the Pushgateway is in the monitoring namespace:

core:
  config_env_vars:
    AM_PROMETHEUS_URL: "http://prometheus-pushgateway.monitoring.svc:9091"

Additional labels configuration¶

You can configure additional custom grouping labels for Pushgateway metrics by setting the AM_PUSHGATEWAY_LABELS environment variable. Labels follow the Prometheus Pushgateway path format (/label_name/label_value):

core:
  config_env_vars:
    AM_PUSHGATEWAY_LABELS: "/color/blue"

To apply the changes to the cluster, run the helm upgrade command to upgrade the release:

helm upgrade --install datarobot-core charts/datarobot-generic-X.X.X.tgz --wait --namespace DATAROBOT_NAMESPACE --values 02_dr-app_values.yaml --timeout 5m --debug

Example: Pushgateway in Kubernetes¶

DataRobot does not provide the Prometheus Pushgateway service. This example shows the values required to configure datarobot-kavmon to push metrics to a Pushgateway that you provide, deployed in the same Kubernetes cluster.

Namespace: DR_CORE_NAMESPACE (example value: datarobot-core)
Pushgateway Helm chart: prometheus-community/prometheus-pushgateway
Release name (example): prometheus-pushgateway

1: Install the Pushgateway chart¶

Add the Helm repository and install the chart with default settings:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm upgrade --install prometheus-pushgateway prometheus-community/prometheus-pushgateway --debug --namespace DR_CORE_NAMESPACE

Replace DR_CORE_NAMESPACE with the namespace where you want to install the Pushgateway.

2: Build the `AM_PROMETHEUS_URL` value¶

Construct the URL from the following:

The service name (the default from the chart is prometheus-pushgateway).
The namespace where the Pushgateway release is installed (in this example, your DR_CORE_NAMESPACE value, for example, datarobot-core).
Append .svc.cluster.local:9091 to the end.

備考

9091 is the default port. This is configurable via a values.yaml entry; see the chart's documentation.

For this example, the resulting URL is http://prometheus-pushgateway.datarobot-core.svc.cluster.local:9091.

3: Configure Kavmon¶

Set the AM_PROMETHEUS_URL environment variable with the URL from step 2, as shown in the Configure Kavmon to use Pushgateway section.

4: Verify it works¶

To verify that metrics are being published to the Pushgateway, connect to a pod running the datarobot-runtime image. This example uses an mmapp-app-XXXX pod. Replace DR_CORE_NAMESPACE with your namespace.

Find a running mmapp-app pod name:

kubectl get pods -n DR_CORE_NAMESPACE | grep mmapp
# Example output: mmapp-app-7fd7467b8f-5xlvl   1/1     Running     0   52m

Connect to the pod's shell. Replace mmapp-app-7fd7467b8f-5xlvl with your pod name:

kubectl exec -ti mmapp-app-7fd7467b8f-5xlvl -n DR_CORE_NAMESPACE -- /entrypoint.sh bash

From within the pod, curl the Pushgateway metrics endpoint. Replace datarobot-core with your namespace if different:
```
curl http://prometheus-pushgateway.datarobot-core.svc.cluster.local:9091/metrics 
```

備考

The /metrics path exposes the metrics page that Prometheus polls for data.

Metrics output format¶

Regardless of which export method is configured, datarobot-kavmon emits metrics as gauges with the following structure.

Metric labels¶

Each metric includes the following labels:

ラベル	説明
check_group	The name of the group to which the check belongs
check_name	The name of the check
check_service	The name of the service that's being checked
job_name	The name of the Kavmon job (e.g., "kavmon")
custom labels	Any additional labels configured via `AM_OTEL_LABELS` (OTel method) or `AM_PUSHGATEWAY_LABELS` (Pushgateway method)

指標値¶

The health status of each check is represented by the gauge value:

0 means the check is UNHEALTHY
1 means the check is HEALTHY

Example metrics¶

When exported to Prometheus format (via an OTEL collector with a Prometheus exporter, or directly via Pushgateway), the metrics look similar to:

# HELP healthy Availability monitor metric healthy {: #help-healthy-availability-monitor-metric-healthy }
# TYPE healthy gauge {: #type-healthy-gauge }
healthy{check_group="datarobotQueue",check_name="Test RabbitMQ exchanges and bindings",check_service="rabbit",job_name="kavmon"} 1
healthy{check_group="dbHealth",check_name="Mongo Sync Status",check_service="mongo",job_name="kavmon"} 0
healthy{check_group="dbHealth",check_name="Rabbit Vhosts Exist",check_service="rabbit",job_name="kavmon"} 1
healthy{check_group="dbHealth",check_name="Redis connection check",check_service="redis",job_name="kavmon"} 1
healthy{check_group="dependency_check",check_name="Mongo dependency check",check_service="mongo",job_name="kavmon"} 1
healthy{check_group="info",check_name="Count DSS Workers",check_service="datasetsserviceapi",job_name="kavmon"} 1
healthy{check_group="info",check_name="Count active UI sessions",check_service="internalapi",job_name="kavmon"} 1
healthy{check_group="info",check_name="Failed Kubernetes not ready pods Job",check_service="kubernetes",job_name="kavmon"} 0
healthy{check_group="info",check_name="License Valid and Non-Expired",check_service="app",job_name="kavmon"} 1
healthy{check_group="info",check_name="Status of test job run by modmonscheduler every minute",check_service="modmonscheduler",job_name="kavmon"} 1
healthy{check_group="testJobs",check_name="Apps Builder Workers check",check_service="appsbuilderapi",job_name="kavmon"} 1
healthy{check_group="testJobs",check_name="Kubeworkers Ready",check_service="execmanagerkubeworkers",job_name="kavmon"} 1
healthy{check_group="testJobs",check_name="Notifications Broker Publish Check",check_service="notificationsbroker",job_name="kavmon"} 1
healthy{check_group="testJobs",check_name="Quick Worker Ping Job",check_service="internalapi",job_name="kavmon"} 1

If you configured additional labels via AM_OTEL_LABELS (OTel method) or AM_PUSHGATEWAY_LABELS (Pushgateway method), they also appear in each metric:

healthy{check_group="info",check_name="License Valid and Non-Expired",check_service="app",job_name="kavmon",environment="production",cluster="us-east-1"} 1

Exporting Kubernetes availability monitor metrics¶

OpenTelemetry integration¶

Additional labels configuration¶

Prometheus Pushgateway integration¶

Configure Kavmon to use Pushgateway¶

Additional labels configuration¶

Example: Pushgateway in Kubernetes¶

1: Install the Pushgateway chart¶

2: Build the AM_PROMETHEUS_URL value¶

3: Configure Kavmon¶

4: Verify it works¶

Metrics output format¶

Metric labels¶

指標値¶

Example metrics¶

2: Build the `AM_PROMETHEUS_URL` value¶