List of Kavmon checks in the "testJobs" group¶
Many of these checks rely on the worker_service_ping process. They connect to the internal API and initiate a service ping job for a specific service by sending a POST request to api/v0/resources/servicePings. The response to this POST request contains a Location header, which is subsequently polled to get the job status. The polling URL is expected to return a JSON with "status"="COMPLETED" and a "pong_timestamp until the time runs out or the number of attempts is exhausted.
These checks return unhealthy if:
* worker_service_ping runs out of attempts.
* "pong_timestamp" is returned in an unexpected format.
"EDA Worker Ping Job"
Performs a worker_service_ping check (described above) for edaworker worker type.
Check is skipped if:
ENABLE_KUBEWORKERS_FOR_ALL_USERS= True (This will always be true in DataRobot version 9+)
"GPU Worker Ping Job"
Performs a worker_service_ping check (described above) for gpuworker worker type.
Check is skipped if:
ENABLE_EXECMANAGER_GPUis False or is not set.
"Low Latency Worker Ping Job"
Performs a worker_service_ping check (described above) for lowlatencyworker worker type.
Check is skipped if:
ENABLE_EXECMANAGER_LOWLATENCY= FalseENABLE_EXECMANAGER_LOWLATENCYis not set.
"Notifications Broker Publish Check"
Performs a worker_service_ping check (described above) for notificationsbroker worker type.
Check is skipped if:
ENABLE_NOTIFICATION_SERVICE= FalseENABLE_NOTIFICATION_SERVICEis not set.
"Quick Worker Ping Job"
Performs a worker_service_ping check (described above) for quickworker worker type.
"Secure Worker Ping Job"
Performs a worker_service_ping check (described above) for secureworker worker type.
Check is skipped if:
ENABLE_KUBEWORKERS_FOR_ALL_USERS= True
"Kubeworkers Ready"
This check will only run if there is at least one active Kubeworkers cluster that has been running for 60 minutes.
It submits a QueueBIT job by doing a POST request to /api/v0/queueBit/create/, which runs a series of checks as the job propagates through the Tasks/Queue managers and into the worker job.
The check then polls internal API URL api/v0/queueBit/{queue_bit_id}/ to get the status of the submitted job.
Returns unhealthy if:
- Any of the QueueBIT checks are unhealthy (warning, failure).
- The check runs out of attempts.
Check is skipped if:
ENABLE_EXECMANAGER_KUBEWORKERS= False (This is never set to False in DataRobot version 9+)
"MMQueue project subscription health check"
The check is an end-to-end web socket communication test. It loops for a several attempts to do the following:
Connect to mmqueue service where it authenticates with the queue based on the provided (AM_MMQUEUE_AUTH_USER and AM_MMQUEUE_AUTH_TOKEN environment variables). Then it subscribes to a project (based on AM_MMQUEUE_SUBSCRIBE_PROJECT environment variable). Then it sends a message to the “rabbitmq” connection and waits for the message to be received from mmqueue.
Returns unhealthy if:
- An exception is thrown during the check process.
- The check runs out of attempts.
Check is skipped if:
- Any of
AM_MMQUEUE_AUTH_USER,AM_MMQUEUE_AUTH_TOKEN, orAM_MMQUEUE_SUBSCRIBE_PROJECTenvironment variables is not set.
"Apps Builder Workers check"
Check loops for several attempts to do the following:
Doing POST request to appsbuilderapi (/applications/current/status/workers/), trying to create an Apps Builder worker status test job.
* Job creation is considered successful if the returned status code is 302 and the response has a Location header with a URL to poll for job status.
Doing GET request to appsbuilderapi polling location, trying to get the status of the Apps Builder worker status test job.
* Test job is considered successful if returned status code is 200 and status reported is COMPLETED
Returns unhealthy if:
- An exception is thrown during the check process.
- The check runs out of attempts in any of it's two phases (
POSTandGETtoappsbuilderapi).
"Apps Builder API Health Check"
This check tests the ability to connect to the Apps Builder API by making an api call to the health endpoint /applications/current/status/health/.
Returns unhealthy if:
- An exception is thrown during the check process.
- The health status returned by the API is not True
"Apps Builder Internal API Health Check"
This check tests the ability to connect to the Apps Builder Internal API by making an api call to the health endpoint /appsinternalapi/status/health/.
Returns unhealthy if:
- An exception is thrown during the check process.
- The health status returned by the API is not True.
"Ping MLOPS Actuals Storage API Health Check"
This check tests the ability to connect to the MLOPS Actuals Storage API url (defined by MMM_ACTUALS_STORAGE_URL) by making an api call to the health endpoint /health/.
Returns unhealthy if:
- An exception is thrown during the check process.
- The health status returned by the API is not "healthy".
- Health status for Elasticsearch cluster is not "green".
- Connection to Elasticsearch fails.
Check is skipped if:
ENABLE_MLOPS_ACTUALS_STORAGE= False.ENABLE_MLOPS_ACTUALS_STORAGEis not set.
"RabbitMQ Connection Health Check"
This check tests the ability to connect to RabbitMQ.
Returns unhealthy if:
- RabbitMQ url cannot be found.
- Connection to RabbitMQ fails.
"Task Manager Health Check"
This check tests the ability to connect to the Task Manager service. The check makes a ping request to a RabbitMQ queue and waits for a response from Task Manager to be published back to the RabbitMQ queue.
Returns unhealthy if:
RABBITMQ_URL_QUEUEenvironment variable is not found.- Task Manager does not reply with a response within 10 seconds.
- Task Manager replies with an unexpected token id.