Monitoring concepts¶
The Workload API exposes monitoring through two complementary surfaces. The Workload API itself owns request statistics, lifecycle events, per-replica status, and the audit trail—all keyed by Workload ID. Application OpenTelemetry (OTel) telemetry—traces, logs, and metrics emitted by your container—is exposed through a separate observability surface; the Workload API does not proxy it.
Draft and locked Workloads have the same monitoring capabilities. The only difference is retention period.
Which surface answers which question¶
Pick the surface that answers the question. Most production debugging touches both.
| If you want to know… | 使用 |
|---|---|
| How many requests is the Workload serving? What's the error rate? p50/p95 latency? | Workload API stats—GET /workloads/{id}/stats |
| Did the Workload restart, scale, or have its artifact replaced? When? | Workload API events—GET /workloads/{id}/events |
| Why is a specific replica unhealthy? What does its log tail say? | Workload API per-replica detail—GET /workloads/{id}/protons/{proton_id}/statusDetails |
| How long did this LLM call take? Which tool was invoked? What was the prompt? | OTel traces—GET /api/v2/otel/workload/{id}/traces/ |
| How many tokens were consumed this hour? What's the cache hit rate? | OTel指標 |
| What did my application code log? | OTel logs (pushed via the OTel logging handler—see Instrument a Workload with OpenTelemetry) |
Monitoring capabilities¶
The following capabilities apply to both draft and locked Workloads, with retention as the only difference.
| 機能 | Draft artifact | Locked artifact | 説明 |
|---|---|---|---|
| サービスの正常性 | はい | はい | Reports request counts, latency, error rate, and requests per minute. |
| Resource utilization | はい | はい | Reports replica count and per-container CPU and memory consumption. |
| OTel logs | はい | はい | Application logs that the container emits via OpenTelemetry. |
| OTel traces | はい | はい | Distributed traces that the container emits via OpenTelemetry. |
| OTel指標 | はい | はい | Application metrics that the container emits via OpenTelemetry. |
| イベント | はい | はい | Lifecycle audit events, including create, start, stop, replace, scale, and error events. |
| Statistics | はい | はい | Aggregate request statistics, including total requests, error rate, response time, and related counters. |
| Retention | 24時間 | 30日 | Matches lifecycle expectations for draft vs. locked Workloads. |
Console visibility¶
All Workloads (draft and locked) appear in Console. The draft filter is off by default, so all Workloads are shown.
Access telemetry¶
Workload-API surfaces (request statistics, lifecycle events, replacement history, per-replica status) are documented in REST: Monitor Workloads. Application OTel telemetry is exposed through DataRobot's separate observability surface and rendered in the Console—see Monitor deployed Workloads and View deployed Workload activity.
Retention summary¶
Telemetry retention depends on the artifact's lifecycle status.
| Artifact status | Telemetry retention |
|---|---|
draft |
24 hours. |
locked |
30 days. |
Instrument your container with OpenTelemetry¶
Container stdout and stderr are captured automatically at every lifecycle stage (startup, running, errored)—they appear in the Workload's Activity log > Logs tab without any SDK setup.
Logs require OTLP push—stdout scraping does not apply
The conventional OTel pattern for logs is a collector DaemonSet that scrapes pod stdout from the host filesystem. DataRobot's collector does not scrape container stdout. Plain print() calls and unconfigured logging appear in the Activity log > Logs tab but do not reach the OTel observability surface. To get logs into the OTel surface as structured records, install the OTel logging handler so the application pushes log records via OTLP HTTP—the same transport used by traces and metrics.
Explicit OTel instrumentation is needed when you want traces, application metrics, or structured logs on the OTel observability surface. When your application emits OTel signals, the platform handles transport: the OTel exporters read OTEL_EXPORTER_OTLP_ENDPOINT from the environment, which the platform sets when the container starts.
For copy-ready Python SDK snippets that configure tracing, logging, and metrics against the platform's OTLP endpoint, see Instrument a Workload with OpenTelemetry. The same snippets are surfaced in the Console on the Workload's Endpoints > Instrumentation sub-tab.
Reset Workload stats¶
POST /workloads/{workload_id}/promote resets statistics automatically so production starts from a clean baseline. To reset stats manually—for a specific time window or for a specific proton—use DELETE /workloads/{workload_id}/stats:
# Reset all stats for the current proton
curl -X DELETE "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/stats" \
-H "Authorization: Bearer ${DATAROBOT_API_TOKEN}"
# Reset stats for a time window
curl -X DELETE "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/stats?startTime=2026-04-01T00:00:00Z&endTime=2026-04-15T00:00:00Z" \
-H "Authorization: Bearer ${DATAROBOT_API_TOKEN}"
protonId, startTime, and endTime are optional query parameters; omit them to clear stats for the current proton across the full retention window.