Skip to content

OpenShift native observability

When DataRobot is deployed on OpenShift, the platform's native observability stack can be used instead of the datarobot-observability-core subchart. This approach leverages Red Hat operators and built-in components that are already part of the OpenShift ecosystem, avoiding the need to deploy and maintain a separate observability infrastructure.

備考

This section does not use the datarobot-observability-core subchart. The subchart and its subcharts (kube-state-metrics, prometheus-node-exporter, and OpenTelemetry collectors) must remain disabled (default behavior). All telemetry collection and storage is handled by OpenShift-native operators.

概要

The OpenShift native observability stack covers the three telemetry signals:

Signal コンポーネント Managed By
Metrics (cluster level) Prometheus, Thanos, Alertmanager Cluster Monitoring Operator (CMO), built-in
Metrics (application SDK) OpenTelemetry Collector + Prometheus exporter Red Hat OpenTelemetry Operator
ログ LokiStack + Vector Loki Operator + Logging Operator
Traces TempoStack Tempo Operator
Console UI (logs, traces) UIPlugin CRD Cluster Observability Operator (COO)

CMO vs COO

OpenShift has two observability-related operators that serve different purposes:

オペレーター Full Name Ships With Manages
CMO Cluster Monitoring Operator OpenShift (pre-installed) Prometheus, Thanos, Alertmanager
COO Cluster Observability Operator OperatorHub (install separately) UIPlugin CRD for console integration

CMO is pre-installed and provides metrics and alerting out of the box. COO must be installed separately and provides only the UIPlugin custom resource for adding the Logs and Traces views to the OpenShift Console. COO does not manage Loki, Tempo, or any backends.

Operators

The following Red Hat operators need to be installed via OperatorHub (OLM). All operators should use the stable channel and Automatic install plan approval unless otherwise required by your organization's change management policy.

オペレーター 目的 リファレンス
Cluster Observability Operator UIPlugin CRD for console integration (Logs/Traces views) COO overview
Loki Operator Manages LokiStack instances for log storage Logging
Red Hat OpenShift Logging Operator Manages ClusterLogForwarder (Vector collectors) Logging
Tempo Operator Manages TempoStack instances for trace storage Distributed tracing
Red Hat OpenTelemetry Operator Manages OpenTelemetryCollector instances Red Hat build of OpenTelemetry

指標

Cluster-level metrics (CMO)

OpenShift's built-in Cluster Monitoring Operator (CMO) provides Prometheus, Thanos, and Alertmanager pre-installed in the openshift-monitoring namespace. By default, CMO collects cluster-level metrics only for openshift-* namespaces. To extend the same telemetry to user namespaces (including the DataRobot namespace), user workload monitoring must be enabled:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true 

Once enabled, the same metrics that CMO already collects for openshift-* namespaces are also collected for user namespaces, including the namespace where DataRobot is deployed:

  • kube-state-metrics: Kubernetes object state (pods, deployments, services, etc.)
  • cAdvisor: Container resource usage (CPU, memory, filesystem, network)
  • kubelet: Node and pod lifecycle metrics

For full configuration details, see the Monitoring documentation.

Application SDK metrics

DataRobot services that are instrumented with the OpenTelemetry SDK emit metrics (and traces) via OTLP. These are not directly scrapeable by Prometheus. To integrate them into the OpenShift metrics stack, an OpenTelemetry Collector instance must be deployed (see OpenTelemetry Collector) that receives OTLP metrics and re-exposes them via a Prometheus exporter endpoint. A ServiceMonitor then enables the user workload Prometheus to scrape these metrics.

The collector configuration for SDK metrics uses the same pipeline described in the OpenTelemetry Collector section, with a prometheus exporter:

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
    resource_to_telemetry_conversion:
      enabled: true 

And a corresponding ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: otel-collector
  namespace: <OTEL_COLLECTOR_NAMESPACE>
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: otel-collector
  endpoints:
    - port: prometheus
      interval: 30s
      path: /metrics 

ロギング

Log collection uses the Loki Operator for storage (LokiStack) and the Logging Operator for collection (Vector daemonset via ClusterLogForwarder).

前提条件

  • S3-compatible object storage for Loki. Supported backends include AWS S3, Azure Blob Storage, Google Cloud Storage, MinIO, and OpenShift Data Foundation.

セットアップ

  1. Install the Loki Operator and Logging Operator from OperatorHub.

  2. Create the storage secret in the openshift-logging namespace with credentials for your S3-compatible storage backend.

  3. Deploy a LokiStack instance in openshift-logging. Choose an appropriate sizing for your environment.

  4. Configure RBAC for the log collector service account. The Logging Operator requires ClusterRoleBindings for collect-application-logs, collect-infrastructure-logs, and collect-audit-logs.

  5. Create a ClusterLogForwarder named collector in openshift-logging. This deploys Vector collectors on every node that forward logs to LokiStack.

  6. Create the logging UIPlugin to add the Logs view to the OpenShift Console:

    apiVersion: observability.openshift.io/v1alpha1
    kind: UIPlugin
    metadata:
      name: logging
    spec:
      type: Logging
      logging:
        lokiStack:
          name: <LOKISTACK_NAME> 
    

For detailed instructions, see:

トレース

Distributed tracing uses the Tempo Operator for storage (TempoStack) and the Red Hat OpenTelemetry Operator for the collector that forwards traces from application workloads to Tempo.

前提条件

  • S3-compatible object storage for Tempo. The same backends supported by Loki are available.

セットアップ

  1. Install the Tempo Operator from OperatorHub.

  2. Create the storage secret in the tracing namespace with credentials for your S3-compatible storage backend.

  3. Deploy a TempoStack instance. In OpenShift tenant mode (tenants.mode: openshift), the gateway enforces authentication via ServiceAccount bearer tokens and mTLS is enforced between internal components.

  4. Configure RBAC for trace reading and writing. Create ClusterRoles for tempostack-traces-reader and tempostack-traces-writer, binding readers to authenticated users and writers to the OTel Collector service account.

  5. Deploy the OpenTelemetry Collector to forward traces (see OpenTelemetry Collector).

  6. Create the distributed tracing UIPlugin to add the Traces view to the OpenShift Console:

    apiVersion: observability.openshift.io/v1alpha1
    kind: UIPlugin
    metadata:
      name: distributed-tracing
    spec:
      type: DistributedTracing
      distributedTracing:
        tempoStack:
          name: <TEMPOSTACK_NAME>
          namespace: <TEMPOSTACK_NAMESPACE> 
    

For detailed instructions, see:

OpenTelemetry Collector

An OpenTelemetry Collector acts as the bridge between DataRobot application workloads and the OpenShift tracing and metrics backends. It receives OTLP telemetry from application services (no auth required from the app side), authenticates to the TempoStack gateway using a projected ServiceAccount token, and exposes a Prometheus endpoint for SDK-emitted metrics.

Why a collector is needed

In OpenShift tenant mode (tenants.mode: openshift), the TempoStack gateway requires bearer token authentication, and the Distributor enforces mTLS. Application SDKs do not natively handle either of these. The collector abstracts this away: applications send plain OTLP to the collector, and the collector handles authentication and forwarding.

セットアップ

  1. Install the Red Hat OpenTelemetry Operator from OperatorHub (see Operators table).

  2. Create an OpenTelemetryCollector CR. The collector is deployed via the operator using the OpenTelemetryCollector custom resource. The key configuration elements are:

    • A projected ServiceAccount token volume for authenticating to the TempoStack gateway
    • The bearertokenauth extension pointing to the projected token
    • An otlphttp exporter targeting the TempoStack gateway endpoint
    • A prometheus exporter for re-exposing SDK metrics

    The trace exporter must target the TempoStack gateway (not the distributor directly) and include the tenant name in the path:

    exporters:
      otlphttp/traces:
        endpoint: https://<TEMPOSTACK_NAME>-tempo-gateway.<TEMPOSTACK_NAMESPACE>.svc:8080/api/traces/v1/<TENANT>
        tls:
          insecure_skip_verify: true
        auth:
          authenticator: bearertokenauth
    extensions:
      bearertokenauth:
        filename: /var/run/secrets/tempo/token 
    

    The projected token volume is configured as:

    volumes:
      - name: sa-token
        projected:
          sources:
            - serviceAccountToken:
                path: token
                expirationSeconds: 3600
    volumeMounts:
      - name: sa-token
        mountPath: /var/run/secrets/tempo
        readOnly: true 
    
  3. Configure RBAC for the collector's ServiceAccount. It needs:

    • Read access to pods, namespaces, and ReplicaSets for the k8sattributes and resourcedetection processors
    • The tempostack-traces-writer ClusterRole for writing traces to TempoStack
  4. Create a ServiceMonitor to scrape the collector's Prometheus exporter endpoint for SDK metrics (see Application SDK metrics).

Refer to the Red Hat build of OpenTelemetry documentation for full OpenTelemetryCollector CR configuration options.

DataRobot service configuration

To configure DataRobot services to send telemetry to the collector, see Instrumenting DataRobot to emit application level metrics and traces. The OTLP endpoint should target the OpenTelemetry Collector service deployed above, for the global.opentelemetry.exporterEndpoint key. 例:

global:
  opentelemetry:
    enabled: true
    exporterEndpoint: http://<COLLECTOR_SERVICE>.<COLLECTOR_NAMESPACE>.svc:4317 

Console access

Once all components are deployed and the UIPlugins are created:

  • Logs: OpenShift Console > Observe > Logs
  • Traces: OpenShift Console > Observe > Traces
  • Metrics: OpenShift Console > Observe > Metrics (built-in via CMO, no additional setup)