Self-managed installation and maintenance > Enterprise monitoring guide > OpenShift native observability

OpenShift native observability¶

When DataRobot is deployed on OpenShift, the platform's native observability stack can be used instead of the datarobot-observability-core subchart. This approach leverages Red Hat operators and built-in components that are already part of the OpenShift ecosystem, avoiding the need to deploy and maintain a separate observability infrastructure.

備考

This section does not use the datarobot-observability-core subchart. The subchart and its subcharts (kube-state-metrics, prometheus-node-exporter, and OpenTelemetry collectors) must remain disabled (default behavior). All telemetry collection and storage is handled by OpenShift-native operators.

概要¶

The OpenShift native observability stack covers the three telemetry signals:

Signal	コンポーネント	Managed By
Metrics (cluster level)	Prometheus, Thanos, Alertmanager	Cluster Monitoring Operator (CMO), built-in
Metrics (application SDK)	OpenTelemetry Collector + Prometheus exporter	Red Hat OpenTelemetry Operator
ログ	LokiStack + Vector	Loki Operator + Logging Operator
Traces	TempoStack	Tempo Operator
Console UI (logs, traces)	UIPlugin CRD	Cluster Observability Operator (COO)

CMO vs COO¶

OpenShift has two observability-related operators that serve different purposes:

オペレーター	Full Name	Ships With	Manages
CMO	Cluster Monitoring Operator	OpenShift (pre-installed)	Prometheus, Thanos, Alertmanager
COO	Cluster Observability Operator	OperatorHub (install separately)	UIPlugin CRD for console integration

CMO is pre-installed and provides metrics and alerting out of the box. COO must be installed separately and provides only the UIPlugin custom resource for adding the Logs and Traces views to the OpenShift Console. COO does not manage Loki, Tempo, or any backends.

Operators¶

The following Red Hat operators need to be installed via OperatorHub (OLM). All operators should use the stable channel and Automatic install plan approval unless otherwise required by your organization's change management policy.

オペレーター	目的	リファレンス
Cluster Observability Operator	UIPlugin CRD for console integration (Logs/Traces views)	COO overview
Loki Operator	Manages LokiStack instances for log storage	Logging
Red Hat OpenShift Logging Operator	Manages ClusterLogForwarder (Vector collectors)	Logging
Tempo Operator	Manages TempoStack instances for trace storage	Distributed tracing
Red Hat OpenTelemetry Operator	Manages OpenTelemetryCollector instances	Red Hat build of OpenTelemetry

指標¶

Cluster-level metrics (CMO)¶

OpenShift's built-in Cluster Monitoring Operator (CMO) provides Prometheus, Thanos, and Alertmanager pre-installed in the openshift-monitoring namespace. By default, CMO collects cluster-level metrics only for openshift-* namespaces. To extend the same telemetry to user namespaces (including the DataRobot namespace), user workload monitoring must be enabled:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true

Once enabled, the same metrics that CMO already collects for openshift-* namespaces are also collected for user namespaces, including the namespace where DataRobot is deployed:

kube-state-metrics: Kubernetes object state (pods, deployments, services, etc.)
cAdvisor: Container resource usage (CPU, memory, filesystem, network)
kubelet: Node and pod lifecycle metrics

For full configuration details, see the Monitoring documentation.

Application SDK metrics¶

DataRobot services that are instrumented with the OpenTelemetry SDK emit metrics (and traces) via OTLP. These are not directly scrapeable by Prometheus. To integrate them into the OpenShift metrics stack, an OpenTelemetry Collector instance must be deployed (see OpenTelemetry Collector) that receives OTLP metrics and re-exposes them via a Prometheus exporter endpoint. A ServiceMonitor then enables the user workload Prometheus to scrape these metrics.

The collector configuration for SDK metrics uses the same pipeline described in the OpenTelemetry Collector section, with a prometheus exporter:

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
    resource_to_telemetry_conversion:
      enabled: true

And a corresponding ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: otel-collector
  namespace: <OTEL_COLLECTOR_NAMESPACE>
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: otel-collector
  endpoints:
    - port: prometheus
      interval: 30s
      path: /metrics

ロギング¶

Log collection uses the Loki Operator for storage (LokiStack) and the Logging Operator for collection (Vector daemonset via ClusterLogForwarder).

前提条件¶

S3-compatible object storage for Loki. Supported backends include AWS S3, Azure Blob Storage, Google Cloud Storage, MinIO, and OpenShift Data Foundation.

セットアップ¶

Install the Loki Operator and Logging Operator from OperatorHub.
Create the storage secret in the openshift-logging namespace with credentials for your S3-compatible storage backend.
Deploy a LokiStack instance in openshift-logging. Choose an appropriate sizing for your environment.
Configure RBAC for the log collector service account. The Logging Operator requires ClusterRoleBindings for collect-application-logs, collect-infrastructure-logs, and collect-audit-logs.
Create a ClusterLogForwarder named collector in openshift-logging. This deploys Vector collectors on every node that forward logs to LokiStack.

Create the logging UIPlugin to add the Logs view to the OpenShift Console:

apiVersion: observability.openshift.io/v1alpha1
kind: UIPlugin
metadata:
  name: logging
spec:
  type: Logging
  logging:
    lokiStack:
      name: <LOKISTACK_NAME>

For detailed instructions, see:

トレース¶

Distributed tracing uses the Tempo Operator for storage (TempoStack) and the Red Hat OpenTelemetry Operator for the collector that forwards traces from application workloads to Tempo.

前提条件¶

S3-compatible object storage for Tempo. The same backends supported by Loki are available.

セットアップ¶

Install the Tempo Operator from OperatorHub.
Create the storage secret in the tracing namespace with credentials for your S3-compatible storage backend.
Deploy a TempoStack instance. In OpenShift tenant mode (tenants.mode: openshift), the gateway enforces authentication via ServiceAccount bearer tokens and mTLS is enforced between internal components.
Configure RBAC for trace reading and writing. Create ClusterRoles for tempostack-traces-reader and tempostack-traces-writer, binding readers to authenticated users and writers to the OTel Collector service account.
Deploy the OpenTelemetry Collector to forward traces (see OpenTelemetry Collector).

Create the distributed tracing UIPlugin to add the Traces view to the OpenShift Console:

apiVersion: observability.openshift.io/v1alpha1
kind: UIPlugin
metadata:
  name: distributed-tracing
spec:
  type: DistributedTracing
  distributedTracing:
    tempoStack:
      name: <TEMPOSTACK_NAME>
      namespace: <TEMPOSTACK_NAMESPACE>

For detailed instructions, see:

OpenTelemetry Collector¶

An OpenTelemetry Collector acts as the bridge between DataRobot application workloads and the OpenShift tracing and metrics backends. It receives OTLP telemetry from application services (no auth required from the app side), authenticates to the TempoStack gateway using a projected ServiceAccount token, and exposes a Prometheus endpoint for SDK-emitted metrics.

Why a collector is needed¶

In OpenShift tenant mode (tenants.mode: openshift), the TempoStack gateway requires bearer token authentication, and the Distributor enforces mTLS. Application SDKs do not natively handle either of these. The collector abstracts this away: applications send plain OTLP to the collector, and the collector handles authentication and forwarding.

セットアップ¶

Install the Red Hat OpenTelemetry Operator from OperatorHub (see Operators table).

Create an OpenTelemetryCollector CR. The collector is deployed via the operator using the OpenTelemetryCollector custom resource. The key configuration elements are:

A projected ServiceAccount token volume for authenticating to the TempoStack gateway
The bearertokenauth extension pointing to the projected token
An otlphttp exporter targeting the TempoStack gateway endpoint
A prometheus exporter for re-exposing SDK metrics

The trace exporter must target the TempoStack gateway (not the distributor directly) and include the tenant name in the path:

exporters:
  otlphttp/traces:
    endpoint: https://<TEMPOSTACK_NAME>-tempo-gateway.<TEMPOSTACK_NAMESPACE>.svc:8080/api/traces/v1/<TENANT>
    tls:
      insecure_skip_verify: true
    auth:
      authenticator: bearertokenauth
extensions:
  bearertokenauth:
    filename: /var/run/secrets/tempo/token

The projected token volume is configured as:

volumes:
  - name: sa-token
    projected:
      sources:
        - serviceAccountToken:
            path: token
            expirationSeconds: 3600
volumeMounts:
  - name: sa-token
    mountPath: /var/run/secrets/tempo
    readOnly: true

Configure RBAC for the collector's ServiceAccount. It needs:
- Read access to pods, namespaces, and ReplicaSets for the k8sattributes and resourcedetection processors
- The tempostack-traces-writer ClusterRole for writing traces to TempoStack
Create a ServiceMonitor to scrape the collector's Prometheus exporter endpoint for SDK metrics (see Application SDK metrics).

Refer to the Red Hat build of OpenTelemetry documentation for full OpenTelemetryCollector CR configuration options.

DataRobot service configuration¶

To configure DataRobot services to send telemetry to the collector, see Instrumenting DataRobot to emit application level metrics and traces. The OTLP endpoint should target the OpenTelemetry Collector service deployed above, for the global.opentelemetry.exporterEndpoint key. 例：

global:
  opentelemetry:
    enabled: true
    exporterEndpoint: http://<COLLECTOR_SERVICE>.<COLLECTOR_NAMESPACE>.svc:4317

Console access¶

Once all components are deployed and the UIPlugins are created, telemetry is accessible directly from the OpenShift Console under the Observe menu:

Logs: OpenShift Console > Observe > Logs
Traces: OpenShift Console > Observe > Traces
Metrics: OpenShift Console > Observe > Metrics (built-in via CMO, no additional setup)

To verify telemetry is being collected for DataRobot, filter by the DataRobot namespace using the following queries:

Logs (LogQL):

{kubernetes_namespace_name="<DR_NAMESPACE>"} | json

Traces (TraceQL):

{resource.k8s.namespace.name="<DR_NAMESPACE>"}

Metrics (PromQL):

kube_pod_status_ready{namespace="<DR_NAMESPACE>"}