Instrumenting DataRobot to emit application level metrics and traces
Instrumenting DataRobot to emit application level metrics and traces¶
Recall from the high level architecture
that application SDK-instrumented telemetry (metrics and traces emitted
explicitly from the code) is pushed directly to an OTEL deployment, as opposed
to the rest of the telemetry, which is pulled (logs and cluster-level
metrics). Enable this explicitly by setting the
global.opentelemetry.enabled key to true:
global:
opentelemetry:
enabled: true
With that, all OTEL-instrumented services start sending telemetry (with the Notebooks exception, explained below).
It's also possible to disable this by specific service, for example:
global:
opentelemetry:
enabled: true
ocr-service:
opentelemetry:
enabled: false
Note that in this release, enabling OpenTelemetry for the kubeworkers service isn't a supported configuration. It isn't enabled by the recommended global setting and must not be manually enabled.
Notebooks configuration¶
Notebooks microservices follow a different convention. In these cases, the environmental variables (and not chart configuration keys) for the equivalent configuration are the following:
TELEMETRY_ENABLED- to enable/disable it (true/false)TELEMETRY_OTLP__ENDPOINT- the specific endpoint. This must be set tohttp://observability-v2-otel-deployment:4317
Additionally, Notebooks services expose other variables:
TELEMETRY_EXPORTER- The actual exporter. Must be set tootlpto actually export the telemetry (default value isconsole).jaegeris also available for tracesTELEMETRY_EXPORT_INTERVAL- The interval (in milliseconds) to export telemetry atTELEMETRY_OTLP__PROTOCOL- The OTLP transport protocol (grpc/http)TELEMETRY_OTLP_HEADERS- A list of headers to apply to all outgoing data (default:N/A)TELEMETRY_OTLP__INSECURE- Whether to enable client transport security for the exporter’s gRPC connection (true/false, default:false)
Because these are environment variables instead of chart configuration keys, they must be set directly as such. See this example:
notebooks:
# Common config variables for all services
<notebook-service-name>:
configs:
data:
TELEMETRY_ENABLED: true
TELEMETRY_EXPORTER: otlp
TELEMETRY_OTLP__PROTOCOL: grpc
TELEMETRY_OTLP__ENDPOINT: http://my-otel-endpoint:4317
TELEMETRY_OTLP__INSECURE: true
TELEMETRY_METRIC_EXPORT_INTERVAL: 5000
# Service specific configs
orchestrator:
configs:
data:
TELEMETRY_OTLP__ENDPOINT: http://another-otel-endpoint:4317
websocket:
configs:
data:
TELEMETRY_ENABLED: false
That configuration:
- Enables OpenTelemetry by default for every service and exports telemetry from all of them
to the
https://my-otel-endpoint.comendpoint - Configures the
orchestratorservice specifically to export telemetry tohttps://another-otel-endpoint.com - Disables telemetry for the
websocketservice
Running post-installation tests¶
The observability-core subchart comes with helm tests that can be run after the installation or upgrade of the DataRobot chart. These tests ensure that the otlp receivers of each of the deployments are receiving and accepting telemetry. Common failure modes that these tests detect include:
- There’s a general otel configuration syntax error or misconfiguration
- There’s a specific misconfiguration in the receiver preventing the collector from receiving telemetry (this shouldn’t be the case unless the receiver was explicitly overridden)
- The service endpoints for the opentelemetry collectors aren't reachable
- The daemonset instance (excluded from the previous point because it doesn’t expose an endpoint) couldn't bind the port on the node
The tests won’t fail if the exporter is configured for an unreachable endpoint.
In order to run these tests, run the helm test command for the release:
helm test dr --filter "name=test-otlp-receivers"
Note that, without the name=test-otlp-receivers filter, all helm tests included
in the DataRobot distribution are run.
Debugging failed tests¶
If any of the tests fail, the command exits with a non-zero status, and the pod where the test ran is left in an errored state.
The first step is to identify the failed test. In order to do that, run
kubectl get pod test-otlp-receivers \
-o jsonpath='{.metadata.annotations.tested-service}'
This points to the specific OTEL collector instance the test ran
against. If this pod is in an errored state, its logs should give a clear
explanation of what went wrong. For example, consider this incorrect
configuration for the opentelemetry-collector-daemonset, which tried to
configure a processor within the service, instead of as a top-level key
under config:
opentelemetry-collector-daemonset:
config:
service:
processors: # <- wrong, this should be a level up, under `config`
pipelines:
# etc
The helm test exits with an error, and the daemonset pods output the following:
kubectl logs observability-v2-otel-daemonset-agent-mvqsn
Error: failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):
'service' has invalid keys: processors
If the pod is instead healthy, the issue likely relates to networking.
Optional: Extending OTEL pipelines with custom processors¶
This completely optional section explains how to extend the OTEL pipelines with custom processors, in case you need to add, for example, a custom attribute, metric transformer, resource detector, etc. If that's not the case, it can be skipped.
In order to add an additional OTEL processor to the pipeline, after configuring
it, it needs to be included in the output pipeline of each of the respective
signal(s). These pipelines are defined in each of the examples values files as
_[logs|metrics|traces]Pipeline anchor, like in the following example for AWS
with CloudWatch:
_logsPipeline: &_logsPipeline
receivers: [forward/logs]
processors: []
exporters: [awscloudwatchlogs]
_metricsPipeline: &_metricsPipeline
receivers: [forward/metrics]
processors: []
exporters: [prometheusremotewrite]
_tracesPipeline: &_tracesPipeline
receivers: [forward/traces]
processors: []
exporters: [awsxray]
These anchors are already referenced in the respective OTEL configs.
In the following example, a simple processor that adds a custom attribute is defined and then referenced in the logs and metrics pipeline:
processors:
attributes/add_custom_attribute:
actions:
- key: "custom_attribute"
action: insert
value: "custom_value"
_logsPipeline: &_logsPipeline
receivers: [forward/logs]
processors: [attributes/add_custom_attribute]
exporters: [awscloudwatchlogs]
_metricsPipeline: &_metricsPipeline
receivers: [forward/metrics]
processors: []
exporters: [prometheusremotewrite]
_tracesPipeline: &_tracesPipeline
receivers: [forward/traces]
processors: [attributes/add_custom_attribute]
exporters: [awsxray]
Daemonsets, possible conflicts, taints and tolerations¶
As mentioned in the observability subchart structure
section, there are two sets of
daemonsets, prometheus-node-exporter and opentelemetry-collector-daemonset.
There are a couple of things to bear in mind with these.
Conflicts with existing daemonsets/already bound ports¶
There are several scenarios where the daemonsets could create conflicts for an existing setup in the cluster.
Disabling the OTEL and node-exporter daemonsets¶
If there is already an OTEL or node-exporter daemonset in the cluster and you don't want to install the ones bundled with DataRobot, disable them:
datarobot-observability-core:
opentelemetry-collector-daemonset:
enabled: false
prometheus-node-exporter:
enabled: false
Note that, if the already installed daemonsets aren't configured properly, as in the configuration bundled with the DataRobot chart, telemetry might be incomplete.
Alternatively, you can disable the already installed daemonsets in favor of the ones deployed with DataRobot (recommended to ensure that telemetry is complete).
Already bound ports¶
The OTEL collector daemon agent tries to bind ports 4317 and 4318
(for OTLP gRPC and OTLP HTTP respectively) on the host. If these are already
bound on the node, configure a different set of ports for the
daemonset:
datarobot-observability-core:
opentelemetry-collector-daemonset:
ports:
otlp:
hostPort: 4319 # standard is 4317
otlp-http:
hostPort: 4320 # standard is 4318
Taints and tolerations¶
Daemonsets are expected to run on every node. If nodes where DataRobot runs are tainted, these daemonsets must tolerate them.
Because this is a list and helm doesn't append values, the whole list needs to
be recreated. The only existing toleration included in the chart is for the
nvidia.com/gpu taint (also mentioned in the Enable the use of GPUs
subsection in the Custom Models Configuration section in the installation
guide), so the toleration list must look like the following:
runEverywhereTolerations: &runEverywhereTolerations
tolerations:
- key: "nvidia.com/gpu" # Already defined by the observability chart; needs to be included not to have it clobbered
operator: "Exists"
- key: "existing-taint-1"
operator: "Exists"
- key: "existing-taint-2"
operator: "Exists"
# etc
datarobot-observability-core:
opentelemetry-collector-daemonset:
<<: *runEverywhereTolerations
prometheus-node-exporter:
<<: *runEverywhereTolerations
Running on a dedicated node group¶
Note: more in depth documentation about running DataRobot on a dedicated node
group can be found in the installation guide in the Running DataRobot
Application on a Dedicated Node Group. This section only mentions the
observability subchart configuration.
As mentioned in that document, if the label used for the taint is
dedicated=DatarobotNodeGroup, define it as a toleration for the
subcharts, as well as the selector. You can define an anchor and reference it
in the subcharts:
dedicatedNodeGroupSelectorAndTolerations: &dedicatedNodeGroupSelectorAndTolerations
nodeSelector:
dedicated: "DatarobotNodeGroup"
tolerations:
- key: "dedicated"
operator: "Equal"
value: "DatarobotNodeGroup"
effect: "NoSchedule"
dedicatedNodeGroupSelector: &dedicatedNodeGroupSelector
datarobot-observability-core:
kube-state-metrics:
<<: *dedicatedNodeGroupSelectorAndTolerations
prometheus-node-exporter:
<<: *dedicatedNodeGroupSelectorAndTolerations
opentelemetry-collector-deployment:
<<: *dedicatedNodeGroupSelectorAndTolerations
opentelemetry-collector-daemonset:
<<: *dedicatedNodeGroupSelectorAndTolerations
opentelemetry-collector-scraper:
<<: *dedicatedNodeGroupSelectorAndTolerations
opentelemetry-collector-scraper-static:
<<: *dedicatedNodeGroupSelectorAndTolerations
opentelemetry-collector-statsd:
<<: *dedicatedNodeGroupSelectorAndTolerations
Community Grafana dashboards¶
In case Grafana is used for metrics visualization, there are several dashboards created and maintained by the community that can be easily imported.
Note that not all the panes need to necessarily be working out of the box, since they’re not tailored to DataRobot, and some labels might be missing, or some specific metrics might be dropped.
This section lists some of the dashboards that can be useful.
Grafana Dashboards repository¶
Dashboards available in Grafana can be imported directly by pasting the link in Dashboards/New/Import.
- Node information: node metrics (from Prometheus Node Exporter), including overall health, network, CPU, memory and disk information
- Docker monitoring: cAdvisor container metrics, for CPU, memory and disk IO
- Kube State Metrics v2: overview of the state and performance of Kubernetes resources within the cluster, based on both kube state and cAdvisor metrics
- OpenTelemetry collector: OTEL internal metrics for monitoring the health of the collectors
Other sources¶
To import these dashboards, the json definition needs to be pasted/loaded in Dashboards/New/Import.
- dotdc/grafana-dashboards-kubernetes: several dashboards for kubernetes components; see descriptions and installation