Skip to content

Instrumenting DataRobot to emit application level metrics and traces

Instrumenting DataRobot to emit application level metrics and traces

Recall from the high level architecture that application SDK-instrumented telemetry (metrics and traces emitted explicitly from the code) is pushed directly to an OTEL deployment, as opposed to the rest of the telemetry, which is pulled (logs and cluster-level metrics). Enable this explicitly by setting the global.opentelemetry.enabled key to true:

global:
  opentelemetry:
    enabled: true

With that, all OTEL-instrumented services start sending telemetry (with the Notebooks exception, explained below).

It's also possible to disable this by specific service, for example:

global:
  opentelemetry:
    enabled: true

ocr-service:
  opentelemetry:
    enabled: false

Note that in this release, enabling OpenTelemetry for the kubeworkers service isn't a supported configuration. It isn't enabled by the recommended global setting and must not be manually enabled.

Notebooks configuration

Notebooks microservices follow a different convention. In these cases, the environmental variables (and not chart configuration keys) for the equivalent configuration are the following:

  • TELEMETRY_ENABLED - to enable/disable it (true/false)
  • TELEMETRY_OTLP__ENDPOINT - the specific endpoint. This must be set to http://observability-v2-otel-deployment:4317

Additionally, Notebooks services expose other variables:

  • TELEMETRY_EXPORTER - The actual exporter. Must be set to otlp to actually export the telemetry (default value is console). jaeger is also available for traces
  • TELEMETRY_EXPORT_INTERVAL - The interval (in milliseconds) to export telemetry at
  • TELEMETRY_OTLP__PROTOCOL - The OTLP transport protocol (grpc/http)
  • TELEMETRY_OTLP_HEADERS - A list of headers to apply to all outgoing data (default: N/A)
  • TELEMETRY_OTLP__INSECURE - Whether to enable client transport security for the exporter’s gRPC connection (true/false, default: false)

Because these are environment variables instead of chart configuration keys, they must be set directly as such. See this example:

notebooks:
  # Common config variables for all services
  <notebook-service-name>:
    configs:
      data:
        TELEMETRY_ENABLED: true
        TELEMETRY_EXPORTER: otlp
        TELEMETRY_OTLP__PROTOCOL: grpc
        TELEMETRY_OTLP__ENDPOINT: http://my-otel-endpoint:4317
        TELEMETRY_OTLP__INSECURE: true
        TELEMETRY_METRIC_EXPORT_INTERVAL: 5000
  # Service specific configs
  orchestrator:
    configs:
      data:
        TELEMETRY_OTLP__ENDPOINT: http://another-otel-endpoint:4317
  websocket:
    configs:
      data:
        TELEMETRY_ENABLED: false

That configuration:

  • Enables OpenTelemetry by default for every service and exports telemetry from all of them to the https://my-otel-endpoint.com endpoint
  • Configures the orchestrator service specifically to export telemetry to https://another-otel-endpoint.com
  • Disables telemetry for the websocket service

Running post-installation tests

The observability-core subchart comes with helm tests that can be run after the installation or upgrade of the DataRobot chart. These tests ensure that the otlp receivers of each of the deployments are receiving and accepting telemetry. Common failure modes that these tests detect include:

  • There’s a general otel configuration syntax error or misconfiguration
  • There’s a specific misconfiguration in the receiver preventing the collector from receiving telemetry (this shouldn’t be the case unless the receiver was explicitly overridden)
  • The service endpoints for the opentelemetry collectors aren't reachable
  • The daemonset instance (excluded from the previous point because it doesn’t expose an endpoint) couldn't bind the port on the node

The tests won’t fail if the exporter is configured for an unreachable endpoint.

In order to run these tests, run the helm test command for the release:

helm test dr --filter "name=test-otlp-receivers"

Note that, without the name=test-otlp-receivers filter, all helm tests included in the DataRobot distribution are run.

Debugging failed tests

If any of the tests fail, the command exits with a non-zero status, and the pod where the test ran is left in an errored state.

The first step is to identify the failed test. In order to do that, run

kubectl get pod test-otlp-receivers \
    -o jsonpath='{.metadata.annotations.tested-service}'

This points to the specific OTEL collector instance the test ran against. If this pod is in an errored state, its logs should give a clear explanation of what went wrong. For example, consider this incorrect configuration for the opentelemetry-collector-daemonset, which tried to configure a processor within the service, instead of as a top-level key under config:

opentelemetry-collector-daemonset:
  config:
    service:
      processors:  # <- wrong, this should be a level up, under `config`
      pipelines:
        # etc

The helm test exits with an error, and the daemonset pods output the following:

kubectl logs observability-v2-otel-daemonset-agent-mvqsn

Error: failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):
'service' has invalid keys: processors

If the pod is instead healthy, the issue likely relates to networking.

Optional: Extending OTEL pipelines with custom processors

This completely optional section explains how to extend the OTEL pipelines with custom processors, in case you need to add, for example, a custom attribute, metric transformer, resource detector, etc. If that's not the case, it can be skipped.

In order to add an additional OTEL processor to the pipeline, after configuring it, it needs to be included in the output pipeline of each of the respective signal(s). These pipelines are defined in each of the examples values files as _[logs|metrics|traces]Pipeline anchor, like in the following example for AWS with CloudWatch:

_logsPipeline: &_logsPipeline
  receivers: [forward/logs]
  processors: []
  exporters: [awscloudwatchlogs]

_metricsPipeline: &_metricsPipeline
  receivers: [forward/metrics]
  processors: []
  exporters: [prometheusremotewrite]

_tracesPipeline: &_tracesPipeline
  receivers: [forward/traces]
  processors: []
  exporters: [awsxray]

These anchors are already referenced in the respective OTEL configs.

In the following example, a simple processor that adds a custom attribute is defined and then referenced in the logs and metrics pipeline:

processors:
  attributes/add_custom_attribute:
    actions:
      - key: "custom_attribute"
        action: insert
        value: "custom_value"

_logsPipeline: &_logsPipeline
  receivers: [forward/logs]
  processors: [attributes/add_custom_attribute]
  exporters: [awscloudwatchlogs]

_metricsPipeline: &_metricsPipeline
  receivers: [forward/metrics]
  processors: []
  exporters: [prometheusremotewrite]

_tracesPipeline: &_tracesPipeline
  receivers: [forward/traces]
  processors: [attributes/add_custom_attribute]
  exporters: [awsxray]

Daemonsets, possible conflicts, taints and tolerations

As mentioned in the observability subchart structure section, there are two sets of daemonsets, prometheus-node-exporter and opentelemetry-collector-daemonset. There are a couple of things to bear in mind with these.

Conflicts with existing daemonsets/already bound ports

There are several scenarios where the daemonsets could create conflicts for an existing setup in the cluster.

Disabling the OTEL and node-exporter daemonsets

If there is already an OTEL or node-exporter daemonset in the cluster and you don't want to install the ones bundled with DataRobot, disable them:

datarobot-observability-core:
  opentelemetry-collector-daemonset:
    enabled: false
  prometheus-node-exporter:
    enabled: false

Note that, if the already installed daemonsets aren't configured properly, as in the configuration bundled with the DataRobot chart, telemetry might be incomplete.

Alternatively, you can disable the already installed daemonsets in favor of the ones deployed with DataRobot (recommended to ensure that telemetry is complete).

Already bound ports

The OTEL collector daemon agent tries to bind ports 4317 and 4318 (for OTLP gRPC and OTLP HTTP respectively) on the host. If these are already bound on the node, configure a different set of ports for the daemonset:

datarobot-observability-core:
  opentelemetry-collector-daemonset:
    ports:
      otlp:
        hostPort: 4319  # standard is 4317
      otlp-http:
        hostPort: 4320  # standard is 4318

Taints and tolerations

Daemonsets are expected to run on every node. If nodes where DataRobot runs are tainted, these daemonsets must tolerate them.

Because this is a list and helm doesn't append values, the whole list needs to be recreated. The only existing toleration included in the chart is for the nvidia.com/gpu taint (also mentioned in the Enable the use of GPUs subsection in the Custom Models Configuration section in the installation guide), so the toleration list must look like the following:

runEverywhereTolerations: &runEverywhereTolerations
  tolerations:
    - key: "nvidia.com/gpu"  # Already defined by the observability chart; needs to be included not to have it clobbered
      operator: "Exists"
    - key: "existing-taint-1"
      operator: "Exists"
    - key: "existing-taint-2"
      operator: "Exists"
    # etc

datarobot-observability-core:
  opentelemetry-collector-daemonset:
    <<: *runEverywhereTolerations
  prometheus-node-exporter:
    <<: *runEverywhereTolerations

Running on a dedicated node group

Note: more in depth documentation about running DataRobot on a dedicated node group can be found in the installation guide in the Running DataRobot Application on a Dedicated Node Group. This section only mentions the observability subchart configuration.

As mentioned in that document, if the label used for the taint is dedicated=DatarobotNodeGroup, define it as a toleration for the subcharts, as well as the selector. You can define an anchor and reference it in the subcharts:

dedicatedNodeGroupSelectorAndTolerations: &dedicatedNodeGroupSelectorAndTolerations
  nodeSelector:
    dedicated: "DatarobotNodeGroup"
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "DatarobotNodeGroup"
      effect: "NoSchedule"

dedicatedNodeGroupSelector: &dedicatedNodeGroupSelector

datarobot-observability-core:
  kube-state-metrics:
    <<: *dedicatedNodeGroupSelectorAndTolerations
  prometheus-node-exporter:
    <<: *dedicatedNodeGroupSelectorAndTolerations
  opentelemetry-collector-deployment:
    <<: *dedicatedNodeGroupSelectorAndTolerations
  opentelemetry-collector-daemonset:
    <<: *dedicatedNodeGroupSelectorAndTolerations
  opentelemetry-collector-scraper:
    <<: *dedicatedNodeGroupSelectorAndTolerations
  opentelemetry-collector-scraper-static:
    <<: *dedicatedNodeGroupSelectorAndTolerations
  opentelemetry-collector-statsd:
    <<: *dedicatedNodeGroupSelectorAndTolerations

Community Grafana dashboards

In case Grafana is used for metrics visualization, there are several dashboards created and maintained by the community that can be easily imported.

Note that not all the panes need to necessarily be working out of the box, since they’re not tailored to DataRobot, and some labels might be missing, or some specific metrics might be dropped.

This section lists some of the dashboards that can be useful.

Grafana Dashboards repository

Dashboards available in Grafana can be imported directly by pasting the link in Dashboards/New/Import.

  • Node information: node metrics (from Prometheus Node Exporter), including overall health, network, CPU, memory and disk information
  • Docker monitoring: cAdvisor container metrics, for CPU, memory and disk IO
  • Kube State Metrics v2: overview of the state and performance of Kubernetes resources within the cluster, based on both kube state and cAdvisor metrics
  • OpenTelemetry collector: OTEL internal metrics for monitoring the health of the collectors

Other sources

To import these dashboards, the json definition needs to be pasted/loaded in Dashboards/New/Import.