Instrumenting DataRobot to emit application level metrics and traces
Instrumenting DataRobot to emit application level metrics and traces¶
Recall from the high level architecture
that the application SDK instrumented telemetry (i.e. metrics and traces emitted
explicitly from the code) are directly pushed to an OTEL deployment, as opposed
to the rest of the telemetry which is pulled (i.e. logs and cluster level
metrics). This needs to be explicitly enabled by setting the
global.opentelemetry.enabled key to true:
global:
opentelemetry:
enabled: true
With that, all the services that are OTEL instrumented will start sending telemetry (with the Notebooks exception, explained below).
It's also possible to disable this by specific service, for example:
global:
opentelemetry:
enabled: true
ocr-service:
opentelemetry:
enabled: false
Note that in this release, enabling opentelemetry for the kubeworkers service is not a supported configuration. It will correctly not be enabled by the recommended global setting, but must not be manually enabled.
Notebooks configuration¶
Notebooks microservices follow a different convention. In these cases, the environmental variables (and not chart configuration keys) for the equivalent configuration are the following:
TELEMETRY_ENABLED- to enable/disable it (true/false)TELEMETRY_OTLP__ENDPOINT- the specific endpoint. This must be set tohttp://observability-v2-otel-deployment:4317
Additionally, Notebooks services expose other variables:
TELEMETRY_EXPORTER- The actual exporter. Must be set tootlpto actually export the telemetry (default value isconsole).jaegeris also available for tracesTELEMETRY_EXPORT_INTERVAL- The interval (in milliseconds) to export telemetry atTELEMETRY_OTLP__PROTOCOL- The OTLP transport protocol (grpc/http)TELEMETRY_OTLP_HEADERS- A list of headers to apply to all outgoing data (default:N/A)TELEMETRY_OTLP__INSECURE- Whether to enable client transport security for the exporter’s gRPC connection (true/false, default:false)
Because these are environmental variables instead of chart configuration keys, there must be directly set as such. See this example:
notebooks:
# Common config variables for all services
<notebook-service-name>:
configs:
data:
TELEMETRY_ENABLED: true
TELEMETRY_EXPORTER: otlp
TELEMETRY_OTLP__PROTOCOL: grpc
TELEMETRY_OTLP__ENDPOINT: http://my-otel-endpoint:4317
TELEMETRY_OTLP__INSECURE: true
TELEMETRY_METRIC_EXPORT_INTERVAL: 5000
# Service specific configs
orchestrator:
configs:
data:
TELEMETRY_OTLP__ENDPOINT: http://another-otel-endpoint:4317
websocket:
configs:
data:
TELEMETRY_ENABLED: false
That will:
- Enable OpenTelemetry by default for every service, and make all of them
export telemetry to
https://my-otel-endpoint.comendpoint - Configure specifically
orchestratorservice to export telemetry tohttps://another-otel-endpoint.com - Disable telemetry for the
websocketservice
Running post-installation tests¶
The observability-core subchart comes with helm tests that can be run after the installation or upgrade of the DataRobot chart. These tests ensure that the otlp receivers of each of the deployments are receiving and accepting telemetry. Common failure modes that these tests will detect include:
- There’s a general otel configuration syntax error or misconfiguration
- There’s a specific misconfiguration in the receiver preventing the collector from receiving telemetry (this shouldn’t be the case unless the receiver was explicitly overridden)
- The service endpoints for the opentelemetry collectors are not reachable
- The daemonset instance (excluded from the previous point because it doesn’t expose an endpoint) could not bind the port on the node
The tests won’t fail if the exporter is configured for an unreachable endpoint.
In order to run these tests, run the helm test command for the release:
helm test dr --filter "name=test-otlp-receivers"
Note that, without the name=test-otlp-receivers, all the helm tests included
in the DataRobot distribution would be run.
Debugging failed tests¶
If any of the tests fail, the command will exit with non 0 status, and the pod where the test was run would be left in an errored state.
The first step is to identify the failed test. In order to do that, run
kubectl get pod test-otlp-receivers \
-o jsonpath='{.metadata.annotations.tested-service}'
That will point to the specific otel collector instance the test was run
against. If this pod is in an errored state, its logs should give a clear
explanation of what went wrong. For example, consider this incorrect
configuration for the opentelemetry-collector-daemonset, which tried to
configure a processor within the service, instead of as a top-level key
under config:
opentelemetry-collector-daemonset:
config:
service:
processors: # <- wrong, this should be a level up, under `config`
pipelines:
# etc
The helm test would exit with an error, with the daemonset pods outputting the following:
kubectl logs observability-v2-otel-daemonset-agent-mvqsn
Error: failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):
'service' has invalid keys: processors
If the pod was instead healthy, the issue would likely relate to networking.
Optional: extending OTEL pipelines with custom processors¶
This completely optional section explains how to extend the OTEL pipelines with custom processors, in case there's the need for e.g. adding a custom attribute, metric transformer, resource detector, etc. If that's not the case, it can be skipped.
In order to add an additional OTEL processor to the pipeline, after configuring
it, it needs to be included in the output pipeline of each of the respective
signal(s). These pipelines are defined in each of the examples values files as
_[logs|metrics|traces]Pipeline anchor, like in the following example for AWS
with CloudWatch:
_logsPipeline: &_logsPipeline
receivers: [forward/logs]
processors: []
exporters: [awscloudwatchlogs]
_metricsPipeline: &_metricsPipeline
receivers: [forward/metrics]
processors: []
exporters: [prometheusremotewrite]
_tracesPipeline: &_tracesPipeline
receivers: [forward/traces]
processors: []
exporters: [awsxray]
These anchors are already referenced in the respective OTEL configs.
In the following example, a simple processor that adds a custom attribute is defined and then referenced in the logs and metrics pipeline:
processors:
attributes/add_custom_attribute:
actions:
- key: "custom_attribute"
action: insert
value: "custom_value"
_logsPipeline: &_logsPipeline
receivers: [forward/logs]
processors: [attributes/add_custom_attribute]
exporters: [awscloudwatchlogs]
_metricsPipeline: &_metricsPipeline
receivers: [forward/metrics]
processors: []
exporters: [prometheusremotewrite]
_tracesPipeline: &_tracesPipeline
receivers: [forward/traces]
processors: [attributes/add_custom_attribute]
exporters: [awsxray]
Daemonsets, possible conflicts, taints and tolerations¶
As mentioned in the observability subchart structure
section, there are two sets of
daemonsets, prometheus-node-exporter and opentelemetry-collector-daemonset.
There are a couple of things to bear in mind with these.
Conflicts with existing daemonsets/already bound ports¶
There are several scenarios where the daemonsets could create conflicts for an existing setup in the cluster.
Disabling the OTEL and node-exporter daemonsets¶
If there already is an OTEL/node-exporter daemonset in the cluster, and you don't want to install the ones bundled with DataRobot, you will need to disable it:
datarobot-observability-core:
opentelemetry-collector-daemonset:
enabled: false
prometheus-node-exporter:
enabled: false
Note that, if the already installed daemonsets are not configured properly, as in the configuration bundled with the DataRobot chart, telemetry might be incomplete.
Inversely, you can also disable the already installed daemonsets in favor of the ones deployed with DataRobot (recommended for making sure that telemetry will be complete).
Already bound ports¶
The OTEL collector daemon agent will try to bind the port 4317 and 4318
(for OTLP gRPC and OTLP HTTP respectively) on the host. If these are already
bound on the node, a different set of ports will need to be configured for the
daemonset:
datarobot-observability-core:
opentelemetry-collector-daemonset:
ports:
otlp:
hostPort: 4319 # standard is 4317
otlp-http:
hostPort: 4320 # standard is 4318
Taints and tolerations¶
Daemonsets are expected to run on every node. If nodes where DataRobot is expected to be running are tainted, these daemonsets will need to tolerate them.
Because this is a list and helm doesn't append values, the whole list needs to
be recreated. The only existing toleration included in the chart is for the
nvidia.com/gpu taint (also mentioned in the Enable the use of GPUs
subsection in the Custom Models Configuration section in the installation
guide), so the toleration list would need to look like the following:
runEverywhereTolerations: &runEverywhereTolerations
tolerations:
- key: "nvidia.com/gpu" # Already defined by the observability chart; needs to be included not to have it clobbered
operator: "Exists"
- key: "existing-taint-1"
operator: "Exists"
- key: "existing-taint-2"
operator: "Exists"
# etc
datarobot-observability-core:
opentelemetry-collector-daemonset:
<<: *runEverywhereTolerations
prometheus-node-exporter:
<<: *runEverywhereTolerations
Running on a Dedicated Node Group¶
Note: more in depth documentation about running DataRobot on a dedicated node
group can be foud in the installation guide in the Running DataRobot
Application on a Dedicated Node Group. This section only mentions the
observability subchart configuration.
As mentioned in that document, if label used for the taint is
dedicated=DatarobotNodeGroup, it needs to be defined as a toleration for the
subcharts, as well as the selector. You can define an anchor and reference it
in the subcharts:
dedicatedNodeGroupSelectorAndTolerations: &dedicatedNodeGroupSelectorAndTolerations
nodeSelector:
dedicated: "DatarobotNodeGroup"
tolerations:
- key: "dedicated"
operator: "Equal"
value: "DatarobotNodeGroup"
effect: "NoSchedule"
dedicatedNodeGroupSelector: &dedicatedNodeGroupSelector
datarobot-observability-core:
kube-state-metrics:
<<: *dedicatedNodeGroupSelectorAndTolerations
prometheus-node-exporter:
<<: *dedicatedNodeGroupSelectorAndTolerations
opentelemetry-collector-deployment:
<<: *dedicatedNodeGroupSelectorAndTolerations
opentelemetry-collector-daemonset:
<<: *dedicatedNodeGroupSelectorAndTolerations
opentelemetry-collector-scraper:
<<: *dedicatedNodeGroupSelectorAndTolerations
opentelemetry-collector-scraper-static:
<<: *dedicatedNodeGroupSelectorAndTolerations
opentelemetry-collector-statsd:
<<: *dedicatedNodeGroupSelectorAndTolerations
Community Grafana dashboards¶
In case Grafana is used for metrics visualization, there are several dashboards created and maintained by the community that can be easily imported.
Note that not all the panes need to necessarily be working out of the box, since they’re not tailored to DataRobot, and some labels might be missing, or some specific metrics might be dropped.
This section lists some of the dashboards that can be useful.
Grafana dashboards repository¶
Dashboards available in Grafana can be imported directly by pasting the link in Dashboards/New/Import.
- Node information: node metrics (from Prometheus Node Exporter), including overall health, network, CPU, memory and disk information
- Docker monitoring: cAdvisor container metrics, for CPU, memory and disk IO
- Kube State Metrics v2: overview of the state and performance of Kubernetes resources within the cluster, based on both kube state and cAdvisor metrics
- OpenTelemetry collector: OTEL internal metrics for monitoring the health of the collectors
Other sources¶
To import these dashboards, the json definition needs to be pasted/loaded in Dashboards/New/Import.
- dotdc/grafana-dashboards-kubernetes: several dashboards for kubernetes components; see descriptions and installation