Enterprise monitoring guide¶
This section provides an overview of and links to detailed documentation for monitoring the health, performance, and availability of your DataRobot platform. Effective monitoring is critical for ensuring that your DataRobot installation operates reliably and that any potential issues are identified and resolved quickly.
Recommended endpoints¶
For general health monitoring, DataRobot provides a series of REST API endpoints that can be integrated with most standard monitoring tools. These endpoints provide status checks for core services and end-to-end test jobs.
Kubernetes Availability Monitor (Kavmon)¶
DataRobot includes the Kubernetes Availability Monitor (Kavmon), a powerful command-line tool designed to run a comprehensive suite of health checks against your installation.
- Kubernetes Availability Monitor: An overview of the Kavmon tool and how to use it.
- Kavmon checks reference**: An index of all available Kavmon checks, organized by group.
- Exporting Kavmon metrics to Prometheus: A guide to configuring Kavmon to send its metrics to a Prometheus Pushgateway.
Observability with OpenTelemetry¶
DataRobot includes an observability subchart (named datarobot-observability-core) that installs the required agents to observe a DataRobot cluster, which can be configured with a number of observability providers.
- Observability subchart: An overview of the subchart
- Hyperscaler managed services for VPCs
- Observability vendors
- Subchart common configuration