Skip to content

Monitoring concepts

The Workload API exposes monitoring through two complementary surfaces. The Workload API itself owns request statistics, lifecycle events, per-replica status, and the audit trail—all keyed by Workload ID. Application OpenTelemetry (OTel) telemetry—traces, logs, and metrics emitted by your container—is exposed through a separate observability surface; the Workload API does not proxy it.

Draft and locked Workloads have the same monitoring capabilities. The only difference is retention period.

Which surface answers which question

Pick the surface that answers the question. Most production debugging touches both.

If you want to know… 使用
How many requests is the Workload serving? What's the error rate? p50/p95 latency? Workload API statsGET /workloads/{id}/stats
Did the Workload restart, scale, or have its artifact replaced? When? Workload API eventsGET /workloads/{id}/events
Why is a specific replica unhealthy? What does its log tail say? Workload API per-replica detailGET /workloads/{id}/protons/{proton_id}/statusDetails
How long did this LLM call take? Which tool was invoked? What was the prompt? OTel tracesGET /api/v2/otel/workload/{id}/traces/
How many tokens were consumed this hour? What's the cache hit rate? OTel指標
What did my application code log? OTel logs (pushed via the OTel logging handler—see Instrument a Workload with OpenTelemetry)

Monitoring capabilities

The following capabilities apply to both draft and locked Workloads, with retention as the only difference.

機能 Draft artifact Locked artifact 説明
サービスの正常性 はい はい Reports request counts, latency, error rate, and requests per minute.
Resource utilization はい はい Reports replica count and per-container CPU and memory consumption.
OTel logs はい はい Application logs that the container emits via OpenTelemetry.
OTel traces はい はい Distributed traces that the container emits via OpenTelemetry.
OTel指標 はい はい Application metrics that the container emits via OpenTelemetry.
イベント はい はい Lifecycle audit events, including create, start, stop, replace, scale, and error events.
Statistics はい はい Aggregate request statistics, including total requests, error rate, response time, and related counters.
Retention 24時間 30日 Matches lifecycle expectations for draft vs. locked Workloads.

Console visibility

All Workloads (draft and locked) appear in Console. The draft filter is off by default, so all Workloads are shown.

Access telemetry

Workload-API surfaces (request statistics, lifecycle events, replacement history, per-replica status) are documented in REST: Monitor Workloads. Application OTel telemetry is exposed through DataRobot's separate observability surface and rendered in the Console—see Monitor deployed Workloads and View deployed Workload activity.

Retention summary

Telemetry retention depends on the artifact's lifecycle status.

Artifact status Telemetry retention
draft 24 hours.
locked 30 days.

Instrument your container with OpenTelemetry

Container stdout and stderr are captured automatically at every lifecycle stage (startup, running, errored)—they appear in the Workload's Activity log > Logs tab without any SDK setup.

Logs require OTLP push—stdout scraping does not apply

The conventional OTel pattern for logs is a collector DaemonSet that scrapes pod stdout from the host filesystem. DataRobot's collector does not scrape container stdout. Plain print() calls and unconfigured logging appear in the Activity log > Logs tab but do not reach the OTel observability surface. To get logs into the OTel surface as structured records, install the OTel logging handler so the application pushes log records via OTLP HTTP—the same transport used by traces and metrics.

Explicit OTel instrumentation is needed when you want traces, application metrics, or structured logs on the OTel observability surface. When your application emits OTel signals, the platform handles transport: the OTel exporters read OTEL_EXPORTER_OTLP_ENDPOINT from the environment, which the platform sets when the container starts.

For copy-ready Python SDK snippets that configure tracing, logging, and metrics against the platform's OTLP endpoint, see Instrument a Workload with OpenTelemetry. The same snippets are surfaced in the Console on the Workload's Endpoints > Instrumentation sub-tab.

Reset Workload stats

POST /workloads/{workload_id}/promote resets statistics automatically so production starts from a clean baseline. To reset stats manually—for a specific time window or for a specific proton—use DELETE /workloads/{workload_id}/stats:

# Reset all stats for the current proton
curl -X DELETE "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/stats" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}"

# Reset stats for a time window
curl -X DELETE "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/stats?startTime=2026-04-01T00:00:00Z&endTime=2026-04-15T00:00:00Z" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" 

protonId, startTime, and endTime are optional query parameters; omit them to clear stats for the current proton across the full retention window.