Skip to content

List of Kavmon checks in the "info" group

"Availability of the Manually Configured Dynamic Workers image"

The check gets a list of all Kubernetes clusters configured with kubeworkers and checks if the datarobot-runtime image is available in the cluster’s container registry (image value is the value of DATAROBOT_JOBS_RUNTIME_IMAGE).

When running manually, this requires adding the -w or --num-workers argument to be a value higher than 0 Returns unhealthy if: * Check couldn’t check image availability in the target cluster. * Image is not available in the target cluster. Check is skipped if: * ENABLE_EXECMANAGER_KUBEWORKERS = False (This is never set to False in DataRobot version 9+) * DATAROBOT_JOBS_USE_LOCAL_RUNTIME_IMAGE = True * DATAROBOT_JOBS_RUNTIME_IMAGE is not set.

"Count DSS Workers"

The check gets expected DSS worker counts per worker type by querying the list of kubernetes deployments which names start with "datarobot-datasets-service-" and collecting their desired replica counts.

Then, it makes calls to the DSS API, checking for workers connected to the queue, retrying several times if it is not getting a 200 response immediately.

Finally, it compares the number of workers connected to the queue with the number of expected worker pods.

Returns unhealthy if:

  • No successful (200) response from DSS API after the number of attempts is exhausted.
  • "error" key is still present in the JSON response from DSS API after the number of attempts is exhausted.
  • The number of expected workers connected to the queue differs from the actual number.

"Count active UI sessions"

Gets the number of current socket connections to the UI, through an HTTP API call to the internal API (/api/v0/resources/socketConnections).

Returns unhealthy if:

  • HTTP request fails or response is an un-decodable JSON.

"Failed Kubernetes not ready pods Job"

This check gets a list of all pods in the same Kubernetes namespace as Kavmon and checks the status of their containers. It ignores pods that have been terminated or ran to completion.

Returns unhealthy if:

  • There are pods that have containers with a non-ready status.

"License Valid and Non-Expired"

This check retrieves the active license info from Mongo and checks if the license is valid and not expired.

Returns unhealthy if:

  • Check cannot get license details.
  • License has expired.

"Status of test job run by modmonscheduler every minute"

Gets the status of test job scheduled to run every minute by modmonscheduler, through an HTTP API call to internal API (/api/v0/resources/scheduler).

Returns unhealthy if:

  • HTTP request fails.
  • HTTP response is an un-decodable JSON.
  • Test job execution delay was longer than the configured time (MODMONSCHEDULER_TEST_JOB_DELAY, defaults to 15 seconds).

"Prediction Spooler Status Check"

This check looks to see if there is a prediction spooler service running within the prediction-server-app kubernetes deployment.

Returns unhealthy if:

  • No active spoolers are running (1 should be running)

Check is skipped if: * PREDICTION_SPOOLER_INTERNAL_ENDPOINT is not set


"Ping CCM API Check"

This check looks to see if there is a response from the CCM API. Note This a DataRobot-internal check that will be skipped for all customer-managed DataRobot instances as it is not required in those environments. This is simply documented for completeness.

Returns unhealthy if:

  • No pong response is received.

Check is skipped if: * CCM_API_BASE_URL is not set.