Skip to content

Instrumenting DataRobot to emit application level metrics and traces

Instrumenting DataRobot to emit application level metrics and traces

Recall from the high level architecture that the application SDK instrumented telemetry (i.e. metrics and traces emitted explicitly from the code) are directly pushed to an OTEL deployment, as opposed to the rest of the telemetry which is pulled (i.e. logs and cluster level metrics). This needs to be explicitly enabled by setting the global.opentelemetry.enabled key to true:

global:
  opentelemetry:
    enabled: true

With that, all the services that are OTEL instrumented will start sending telemetry (with the Notebooks exception, explained below).

It's also possible to disable this by specific service, for example:

global:
  opentelemetry:
    enabled: true

ocr-service:
  opentelemetry:
    enabled: false

Note that in this release, enabling opentelemetry for the kubeworkers service is not a supported configuration. It will correctly not be enabled by the recommended global setting, but must not be manually enabled.

Notebooks configuration

Notebooks microservices follow a different convention. In these cases, the environmental variables (and not chart configuration keys) for the equivalent configuration are the following:

  • TELEMETRY_ENABLED - to enable/disable it (true/false)
  • TELEMETRY_OTLP__ENDPOINT - the specific endpoint. This must be set to http://observability-v2-otel-deployment:4317

Additionally, Notebooks services expose other variables:

  • TELEMETRY_EXPORTER - The actual exporter. Must be set to otlp to actually export the telemetry (default value is console). jaeger is also available for traces
  • TELEMETRY_EXPORT_INTERVAL - The interval (in milliseconds) to export telemetry at
  • TELEMETRY_OTLP__PROTOCOL - The OTLP transport protocol (grpc/http)
  • TELEMETRY_OTLP_HEADERS - A list of headers to apply to all outgoing data (default: N/A)
  • TELEMETRY_OTLP__INSECURE - Whether to enable client transport security for the exporter’s gRPC connection (true/false, default: false)

Because these are environmental variables instead of chart configuration keys, there must be directly set as such. See this example:

notebooks:
  # Common config variables for all services
  <notebook-service-name>:
    configs:
      data:
        TELEMETRY_ENABLED: true
        TELEMETRY_EXPORTER: otlp
        TELEMETRY_OTLP__PROTOCOL: grpc
        TELEMETRY_OTLP__ENDPOINT: http://my-otel-endpoint:4317
        TELEMETRY_OTLP__INSECURE: true
        TELEMETRY_METRIC_EXPORT_INTERVAL: 5000
  # Service specific configs
  orchestrator:
    configs:
      data:
        TELEMETRY_OTLP__ENDPOINT: http://another-otel-endpoint:4317
  websocket:
    configs:
      data:
        TELEMETRY_ENABLED: false

That will:

  • Enable OpenTelemetry by default for every service, and make all of them export telemetry to https://my-otel-endpoint.com endpoint
  • Configure specifically orchestrator service to export telemetry to https://another-otel-endpoint.com
  • Disable telemetry for the websocket service

Running post-installation tests

The observability-core subchart comes with helm tests that can be run after the installation or upgrade of the DataRobot chart. These tests ensure that the otlp receivers of each of the deployments are receiving and accepting telemetry. Common failure modes that these tests will detect include:

  • There’s a general otel configuration syntax error or misconfiguration
  • There’s a specific misconfiguration in the receiver preventing the collector from receiving telemetry (this shouldn’t be the case unless the receiver was explicitly overridden)
  • The service endpoints for the opentelemetry collectors are not reachable
  • The daemonset instance (excluded from the previous point because it doesn’t expose an endpoint) could not bind the port on the node

The tests won’t fail if the exporter is configured for an unreachable endpoint.

In order to run these tests, run the helm test command for the release:

helm test dr --filter "name=test-otlp-receivers"

Note that, without the name=test-otlp-receivers, all the helm tests included in the DataRobot distribution would be run.

Debugging failed tests

If any of the tests fail, the command will exit with non 0 status, and the pod where the test was run would be left in an errored state.

The first step is to identify the failed test. In order to do that, run

kubectl get pod test-otlp-receivers \
    -o jsonpath='{.metadata.annotations.tested-service}'

That will point to the specific otel collector instance the test was run against. If this pod is in an errored state, its logs should give a clear explanation of what went wrong. For example, consider this incorrect configuration for the opentelemetry-collector-daemonset, which tried to configure a processor within the service, instead of as a top-level key under config:

opentelemetry-collector-daemonset:
  config:
    service:
      processors:  # <- wrong, this should be a level up, under `config`
      pipelines:
        # etc

The helm test would exit with an error, with the daemonset pods outputting the following:

kubectl logs observability-v2-otel-daemonset-agent-mvqsn

Error: failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):
'service' has invalid keys: processors

If the pod was instead healthy, the issue would likely relate to networking.

Optional: extending OTEL pipelines with custom processors

This completely optional section explains how to extend the OTEL pipelines with custom processors, in case there's the need for e.g. adding a custom attribute, metric transformer, resource detector, etc. If that's not the case, it can be skipped.

In order to add an additional OTEL processor to the pipeline, after configuring it, it needs to be included in the output pipeline of each of the respective signal(s). These pipelines are defined in each of the examples values files as _[logs|metrics|traces]Pipeline anchor, like in the following example for AWS with CloudWatch:

_logsPipeline: &_logsPipeline
  receivers: [forward/logs]
  processors: []
  exporters: [awscloudwatchlogs]

_metricsPipeline: &_metricsPipeline
  receivers: [forward/metrics]
  processors: []
  exporters: [prometheusremotewrite]

_tracesPipeline: &_tracesPipeline
  receivers: [forward/traces]
  processors: []
  exporters: [awsxray]

These anchors are already referenced in the respective OTEL configs.

In the following example, a simple processor that adds a custom attribute is defined and then referenced in the logs and metrics pipeline:

processors:
  attributes/add_custom_attribute:
    actions:
      - key: "custom_attribute"
        action: insert
        value: "custom_value"

_logsPipeline: &_logsPipeline
  receivers: [forward/logs]
  processors: [attributes/add_custom_attribute]
  exporters: [awscloudwatchlogs]

_metricsPipeline: &_metricsPipeline
  receivers: [forward/metrics]
  processors: []
  exporters: [prometheusremotewrite]

_tracesPipeline: &_tracesPipeline
  receivers: [forward/traces]
  processors: [attributes/add_custom_attribute]
  exporters: [awsxray]

Daemonsets, possible conflicts, taints and tolerations

As mentioned in the observability subchart structure section, there are two sets of daemonsets, prometheus-node-exporter and opentelemetry-collector-daemonset. There are a couple of things to bear in mind with these.

Conflicts with existing daemonsets/already bound ports

There are several scenarios where the daemonsets could create conflicts for an existing setup in the cluster.

Disabling the OTEL and node-exporter daemonsets

If there already is an OTEL/node-exporter daemonset in the cluster, and you don't want to install the ones bundled with DataRobot, you will need to disable it:

datarobot-observability-core:
  opentelemetry-collector-daemonset:
    enabled: false
  prometheus-node-exporter:
    enabled: false

Note that, if the already installed daemonsets are not configured properly, as in the configuration bundled with the DataRobot chart, telemetry might be incomplete.

Inversely, you can also disable the already installed daemonsets in favor of the ones deployed with DataRobot (recommended for making sure that telemetry will be complete).

Already bound ports

The OTEL collector daemon agent will try to bind the port 4317 and 4318 (for OTLP gRPC and OTLP HTTP respectively) on the host. If these are already bound on the node, a different set of ports will need to be configured for the daemonset:

datarobot-observability-core:
  opentelemetry-collector-daemonset:
    ports:
      otlp:
        hostPort: 4319  # standard is 4317
      otlp-http:
        hostPort: 4320  # standard is 4318

Taints and tolerations

Daemonsets are expected to run on every node. If nodes where DataRobot is expected to be running are tainted, these daemonsets will need to tolerate them.

Because this is a list and helm doesn't append values, the whole list needs to be recreated. The only existing toleration included in the chart is for the nvidia.com/gpu taint (also mentioned in the Enable the use of GPUs subsection in the Custom Models Configuration section in the installation guide), so the toleration list would need to look like the following:

runEverywhereTolerations: &runEverywhereTolerations
  tolerations:
    - key: "nvidia.com/gpu"  # Already defined by the observability chart; needs to be included not to have it clobbered
      operator: "Exists"
    - key: "existing-taint-1"
      operator: "Exists"
    - key: "existing-taint-2"
      operator: "Exists"
    # etc

datarobot-observability-core:
  opentelemetry-collector-daemonset:
    <<: *runEverywhereTolerations
  prometheus-node-exporter:
    <<: *runEverywhereTolerations

Running on a Dedicated Node Group

Note: more in depth documentation about running DataRobot on a dedicated node group can be foud in the installation guide in the Running DataRobot Application on a Dedicated Node Group. This section only mentions the observability subchart configuration.

As mentioned in that document, if label used for the taint is dedicated=DatarobotNodeGroup, it needs to be defined as a toleration for the subcharts, as well as the selector. You can define an anchor and reference it in the subcharts:

dedicatedNodeGroupSelectorAndTolerations: &dedicatedNodeGroupSelectorAndTolerations
  nodeSelector:
    dedicated: "DatarobotNodeGroup"
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "DatarobotNodeGroup"
      effect: "NoSchedule"

dedicatedNodeGroupSelector: &dedicatedNodeGroupSelector

datarobot-observability-core:
  kube-state-metrics:
    <<: *dedicatedNodeGroupSelectorAndTolerations
  prometheus-node-exporter:
    <<: *dedicatedNodeGroupSelectorAndTolerations
  opentelemetry-collector-deployment:
    <<: *dedicatedNodeGroupSelectorAndTolerations
  opentelemetry-collector-daemonset:
    <<: *dedicatedNodeGroupSelectorAndTolerations
  opentelemetry-collector-scraper:
    <<: *dedicatedNodeGroupSelectorAndTolerations
  opentelemetry-collector-scraper-static:
    <<: *dedicatedNodeGroupSelectorAndTolerations
  opentelemetry-collector-statsd:
    <<: *dedicatedNodeGroupSelectorAndTolerations

Community Grafana dashboards

In case Grafana is used for metrics visualization, there are several dashboards created and maintained by the community that can be easily imported.

Note that not all the panes need to necessarily be working out of the box, since they’re not tailored to DataRobot, and some labels might be missing, or some specific metrics might be dropped.

This section lists some of the dashboards that can be useful.

Grafana dashboards repository

Dashboards available in Grafana can be imported directly by pasting the link in Dashboards/New/Import.

  • Node information: node metrics (from Prometheus Node Exporter), including overall health, network, CPU, memory and disk information
  • Docker monitoring: cAdvisor container metrics, for CPU, memory and disk IO
  • Kube State Metrics v2: overview of the state and performance of Kubernetes resources within the cluster, based on both kube state and cAdvisor metrics
  • OpenTelemetry collector: OTEL internal metrics for monitoring the health of the collectors

Other sources

To import these dashboards, the json definition needs to be pasted/loaded in Dashboards/New/Import.