OpenShift native observability¶
When DataRobot is deployed on OpenShift, the platform's native observability stack can be used instead of the datarobot-observability-core subchart. This approach leverages Red Hat operators and built-in components that are already part of the OpenShift ecosystem, avoiding the need to deploy and maintain a separate observability infrastructure.
Note
This section does not use the datarobot-observability-core subchart. The subchart and its subcharts (kube-state-metrics, prometheus-node-exporter, and OpenTelemetry collectors) must remain disabled (default behavior). All telemetry collection and storage is handled by OpenShift-native operators.
Overview¶
The OpenShift native observability stack covers the three telemetry signals:
| Signal | Component | Managed By |
|---|---|---|
| Metrics (cluster level) | Prometheus, Thanos, Alertmanager | Cluster Monitoring Operator (CMO), built-in |
| Metrics (application SDK) | OpenTelemetry Collector + Prometheus exporter | Red Hat OpenTelemetry Operator |
| Logs | LokiStack + Vector | Loki Operator + Logging Operator |
| Traces | TempoStack | Tempo Operator |
| Console UI (logs, traces) | UIPlugin CRD | Cluster Observability Operator (COO) |
CMO vs COO¶
OpenShift has two observability-related operators that serve different purposes:
| Operator | Full Name | Ships With | Manages |
|---|---|---|---|
| CMO | Cluster Monitoring Operator | OpenShift (pre-installed) | Prometheus, Thanos, Alertmanager |
| COO | Cluster Observability Operator | OperatorHub (install separately) | UIPlugin CRD for console integration |
CMO is pre-installed and provides metrics and alerting out of the box. COO must be installed separately and provides only the UIPlugin custom resource for adding the Logs and Traces views to the OpenShift Console. COO does not manage Loki, Tempo, or any backends.
Operators¶
The following Red Hat operators need to be installed via OperatorHub (OLM). All operators should use the stable channel and Automatic install plan approval unless otherwise required by your organization's change management policy.
| Operator | Purpose | Reference |
|---|---|---|
| Cluster Observability Operator | UIPlugin CRD for console integration (Logs/Traces views) | COO overview |
| Loki Operator | Manages LokiStack instances for log storage | Logging |
| Red Hat OpenShift Logging Operator | Manages ClusterLogForwarder (Vector collectors) | Logging |
| Tempo Operator | Manages TempoStack instances for trace storage | Distributed tracing |
| Red Hat OpenTelemetry Operator | Manages OpenTelemetryCollector instances | Red Hat build of OpenTelemetry |
Metrics¶
Cluster-level metrics (CMO)¶
OpenShift's built-in Cluster Monitoring Operator (CMO) provides Prometheus, Thanos, and Alertmanager pre-installed in the openshift-monitoring namespace. By default, CMO collects cluster-level metrics only for openshift-* namespaces. To extend the same telemetry to user namespaces (including the DataRobot namespace), user workload monitoring must be enabled:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
enableUserWorkload: true
Once enabled, the same metrics that CMO already collects for openshift-* namespaces are also collected for user namespaces, including the namespace where DataRobot is deployed:
- kube-state-metrics: Kubernetes object state (pods, deployments, services, etc.)
- cAdvisor: Container resource usage (CPU, memory, filesystem, network)
- kubelet: Node and pod lifecycle metrics
For full configuration details, see the Monitoring documentation.
Application SDK metrics¶
DataRobot services that are instrumented with the OpenTelemetry SDK emit metrics (and traces) via OTLP. These are not directly scrapeable by Prometheus. To integrate them into the OpenShift metrics stack, an OpenTelemetry Collector instance must be deployed (see OpenTelemetry Collector) that receives OTLP metrics and re-exposes them via a Prometheus exporter endpoint. A ServiceMonitor then enables the user workload Prometheus to scrape these metrics.
The collector configuration for SDK metrics uses the same pipeline described in the OpenTelemetry Collector section, with a prometheus exporter:
exporters:
prometheus:
endpoint: 0.0.0.0:8889
resource_to_telemetry_conversion:
enabled: true
And a corresponding ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: otel-collector
namespace: <OTEL_COLLECTOR_NAMESPACE>
spec:
selector:
matchLabels:
app.kubernetes.io/name: otel-collector
endpoints:
- port: prometheus
interval: 30s
path: /metrics
Logging¶
Log collection uses the Loki Operator for storage (LokiStack) and the Logging Operator for collection (Vector daemonset via ClusterLogForwarder).
Prerequisites¶
- S3-compatible object storage for Loki. Supported backends include AWS S3, Azure Blob Storage, Google Cloud Storage, MinIO, and OpenShift Data Foundation.
Setup¶
-
Install the Loki Operator and Logging Operator from OperatorHub.
-
Create the storage secret in the
openshift-loggingnamespace with credentials for your S3-compatible storage backend. -
Deploy a LokiStack instance in
openshift-logging. Choose an appropriate sizing for your environment. -
Configure RBAC for the log collector service account. The Logging Operator requires ClusterRoleBindings for
collect-application-logs,collect-infrastructure-logs, andcollect-audit-logs. -
Create a ClusterLogForwarder named
collectorinopenshift-logging. This deploys Vector collectors on every node that forward logs to LokiStack. -
Create the logging UIPlugin to add the Logs view to the OpenShift Console:
apiVersion: observability.openshift.io/v1alpha1 kind: UIPlugin metadata: name: logging spec: type: Logging logging: lokiStack: name: <LOKISTACK_NAME>
For detailed instructions, see:
Tracing¶
Distributed tracing uses the Tempo Operator for storage (TempoStack) and the Red Hat OpenTelemetry Operator for the collector that forwards traces from application workloads to Tempo.
Prerequisites¶
- S3-compatible object storage for Tempo. The same backends supported by Loki are available.
Setup¶
-
Install the Tempo Operator from OperatorHub.
-
Create the storage secret in the tracing namespace with credentials for your S3-compatible storage backend.
-
Deploy a TempoStack instance. In OpenShift tenant mode (
tenants.mode: openshift), the gateway enforces authentication via ServiceAccount bearer tokens and mTLS is enforced between internal components. -
Configure RBAC for trace reading and writing. Create ClusterRoles for
tempostack-traces-readerandtempostack-traces-writer, binding readers to authenticated users and writers to the OTel Collector service account. -
Deploy the OpenTelemetry Collector to forward traces (see OpenTelemetry Collector).
-
Create the distributed tracing UIPlugin to add the Traces view to the OpenShift Console:
apiVersion: observability.openshift.io/v1alpha1 kind: UIPlugin metadata: name: distributed-tracing spec: type: DistributedTracing distributedTracing: tempoStack: name: <TEMPOSTACK_NAME> namespace: <TEMPOSTACK_NAMESPACE>
For detailed instructions, see:
OpenTelemetry Collector¶
An OpenTelemetry Collector acts as the bridge between DataRobot application workloads and the OpenShift tracing and metrics backends. It receives OTLP telemetry from application services (no auth required from the app side), authenticates to the TempoStack gateway using a projected ServiceAccount token, and exposes a Prometheus endpoint for SDK-emitted metrics.
Why a collector is needed¶
In OpenShift tenant mode (tenants.mode: openshift), the TempoStack gateway requires bearer token authentication, and the Distributor enforces mTLS. Application SDKs do not natively handle either of these. The collector abstracts this away: applications send plain OTLP to the collector, and the collector handles authentication and forwarding.
Setup¶
-
Install the Red Hat OpenTelemetry Operator from OperatorHub (see Operators table).
-
Create an
OpenTelemetryCollectorCR. The collector is deployed via the operator using theOpenTelemetryCollectorcustom resource. The key configuration elements are:- A projected ServiceAccount token volume for authenticating to the TempoStack gateway
- The
bearertokenauthextension pointing to the projected token - An
otlphttpexporter targeting the TempoStack gateway endpoint - A
prometheusexporter for re-exposing SDK metrics
The trace exporter must target the TempoStack gateway (not the distributor directly) and include the tenant name in the path:
exporters: otlphttp/traces: endpoint: https://<TEMPOSTACK_NAME>-tempo-gateway.<TEMPOSTACK_NAMESPACE>.svc:8080/api/traces/v1/<TENANT> tls: insecure_skip_verify: true auth: authenticator: bearertokenauth extensions: bearertokenauth: filename: /var/run/secrets/tempo/tokenThe projected token volume is configured as:
volumes: - name: sa-token projected: sources: - serviceAccountToken: path: token expirationSeconds: 3600 volumeMounts: - name: sa-token mountPath: /var/run/secrets/tempo readOnly: true -
Configure RBAC for the collector's ServiceAccount. It needs:
- Read access to pods, namespaces, and ReplicaSets for the
k8sattributesandresourcedetectionprocessors - The
tempostack-traces-writerClusterRole for writing traces to TempoStack
- Read access to pods, namespaces, and ReplicaSets for the
-
Create a ServiceMonitor to scrape the collector's Prometheus exporter endpoint for SDK metrics (see Application SDK metrics).
Refer to the Red Hat build of OpenTelemetry documentation for full OpenTelemetryCollector CR configuration options.
DataRobot service configuration¶
To configure DataRobot services to send telemetry to the collector, see Instrumenting DataRobot to emit application level metrics and traces. The OTLP endpoint should target the OpenTelemetry Collector service deployed above, for the global.opentelemetry.exporterEndpoint key. For example:
global:
opentelemetry:
enabled: true
exporterEndpoint: http://<COLLECTOR_SERVICE>.<COLLECTOR_NAMESPACE>.svc:4317
Console access¶
Once all components are deployed and the UIPlugins are created:
- Logs: OpenShift Console > Observe > Logs
- Traces: OpenShift Console > Observe > Traces
- Metrics: OpenShift Console > Observe > Metrics (built-in via CMO, no additional setup)