Lifecycle states¶
A Workload's lifecycle is the set of states it transitions through from creation to teardown—and what triggers each transition. The platform derives states from the Kubernetes pods backing the Workload's protons.
The platform tracks state at two levels. ProtonStatus is the full runtime view computed from the Kubernetes pods backing each proton. WorkloadStatus is the user-facing subset surfaced on the Workload itself—it omits internal proton-lifecycle states (initializing, warming, draining, restarting) that the Workload-level API never reports.
The combined state table uses the Surfaces on column to indicate where each value appears.
| State | Surfaces on | Description |
|---|---|---|
unknown |
Workload + proton | State not yet reported. The platform has not produced a snapshot yet. |
submitted |
Workload + proton | Accepted by the API, not yet scheduled. Pod is Pending with no node assigned. |
initializing |
Proton only | In-place update or restart while containers are being recreated and not yet ready. Workloads collapse this into adjacent Workload-level states (launching or running). |
provisioning |
Workload + proton | Cluster resources are being allocated—resource-bundle scheduling, PVC creation, and secret injection. Pod is Pending with a node assigned. |
launching |
Workload + proton | Image pull and container start in progress, or containers are up but readiness probes have not passed yet. Pod is Running with not-yet-ready containers. |
running |
Workload + proton | Healthy and serving requests. Pod is Running and all containers are ready. |
suspended |
Workload + proton | Intentionally paused. Pods are stopped, but the Workload's identity and configuration are preserved; resume restores them without re-scheduling from scratch. |
warming |
Proton only | Running in the warmup window during a replacement. Reflected from the candidate proton. |
draining |
Proton only | Alive but no longer receiving traffic. Old proton during a replacement. |
interrupted |
Workload + proton | Runtime preempted—node eviction, spot reclaim, or scheduler-driven displacement. The platform reschedules automatically once capacity frees up. |
restarting |
Proton only | Proton being recreated in place after a configuration change that requires a pod restart but preserves proton identity. The Workload-level status stays on the adjacent visible state. |
stopping |
Workload + proton | Graceful shutdown is in progress. Pod is terminating. |
stopped |
Workload + proton | Stopped; can be restarted. No pods are present, or the pod phase is Succeeded for run-to-completion artifacts. |
errored |
Workload + proton | Failed. CrashLoopBackOff, ImagePullBackOff, or pod phase Failed. |
terminated |
Workload + proton | Permanently torn down. Workload was deleted; the proton no longer exists. |
The platform evaluates pod-state predicates in priority order to derive ProtonStatus; WorkloadStatus is then derived by collapsing the proton-only states.
Typical transitions¶
The following table lists the common state sequences for create, update, replace, stop, failure, and promote operations. The sequences are written at the proton level to show the full transition path; at the Workload level the proton-only states (initializing, warming, draining, restarting) collapse into adjacent Workload-visible states.
| Operation | Transition |
|---|---|
| Create | submitted -> provisioning -> launching -> running |
| Update | running -> initializing -> running |
| Replace | running -> running(active) + initializing/warming/running(candidate) -> switch -> draining(active) + running(new active) -> running(single active) |
| Stop | running -> stopping -> stopped |
| Suspend | running -> suspended |
| Failure | launching|running -> errored |
| Promote | running (draft) -> running (locked) (no restart) |
Pod-based truth¶
The platform computes Workload status from pod predicates in priority order:
- For multi-replica Workloads, worst-state-wins aggregation applies.
- Sidecars factor into container readiness and failure evaluation.
- Init container statuses don't factor into the predicates—they affect startup but do not directly trigger Workload-level
errored.
Why errored is sticky¶
Once any container hits CrashLoopBackOff, the Workload reports errored and won't return to running until the offending pod is replaced or the failing container starts succeeding again. A Workload also stays in launching until every container's readiness probe passes—a typo in readinessProbe.path or a sidecar that never becomes ready keeps the Workload from reaching running even when the primary container is up.
Debug an errored Workload¶
statusDetails on a proton includes the failing container name and reason. The /events endpoint gives a lifecycle audit trail; logs give application-level context.
To pull lifecycle events and per-proton status to pinpoint where a Workload went wrong:
# Lifecycle events: what changed and when
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/events" \
-H "Authorization: Bearer ${DATAROBOT_API_TOKEN}"
# Per-proton status
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/protons" \
-H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.data[] | {id, role, status}'
Protons: runtime backing and aggregation¶
A proton is the runtime primitive behind a Workload—the actual running container instances. Workloads are the governed identity; protons are the execution.
Object hierarchy¶
A Workload always has at least one proton. During a replace it can temporarily have two—old and new—with Workload traffic routing deciding which proton receives requests.
A proton consists of one or more Kubernetes pods:
- A single-replica Workload maps to one pod.
- A multi-replica Workload maps to one pod per replica.
A Workload is the stable address and governance wrapper; a proton is the runtime backing that executes the artifact behind that address.
How proton status is computed¶
The platform computes a proton's status by aggregating pod states using predicates in priority order.
For multi-replica protons, worst-state-wins aggregation applies.
Predicates are evaluated in priority order; the highest-priority match wins.
| Pod condition | Proton state | Workload state | Priority |
|---|---|---|---|
Any container in CrashLoopBackOff |
errored |
errored |
7 (highest) |
Any container in ImagePullBackOff |
errored |
errored |
7 |
Pod phase Failed |
errored |
errored |
6 |
Pod phase Succeeded |
stopped |
stopped |
5 |
Pod phase Pending (no node yet) |
submitted |
submitted |
4 |
Pod phase Pending (scheduled to node) |
provisioning |
provisioning |
4 |
Pod phase Running, not all containers ready |
launching |
launching |
3 |
Pod phase Running, all containers ready |
running |
running |
2 |
| Node eviction in progress | interrupted |
interrupted |
— |
| User-issued suspend | suspended |
suspended |
— |
| Proton recreated after config change | restarting |
(proton-only—Workload state unchanged) | — |
| No pods present | stopped (during shutdown) or errored (unexpected) |
same | — |
Predicates examine every container in each pod, including sidecars. If a sidecar has a wrong image URI, the proton reports errored even if the primary container is healthy. Init container statuses are not evaluated by the predicates—they affect startup but do not directly trigger errored.
Multi-container examples¶
| Scenario | Pod phase | Container states | Workload state |
|---|---|---|---|
| App and sidecar both healthy | Running | app ready, sidecar ready | running |
App running, sidecar in ImagePullBackOff |
Running | app ready, sidecar ImagePullBackOff |
errored |
App in CrashLoopBackOff |
Running | app CrashLoopBackOff |
errored |
| App starting, sidecar running | Running | app ContainerCreating, sidecar ready |
launching |
| Pod scheduled, containers waiting | Pending | both Waiting (node assigned) |
provisioning |
| Pod pending, no node yet | Pending | both Waiting (no node) |
submitted |
Inspect protons¶
Most users interact through Workloads. Proton endpoints are useful for inspecting runtime state and testing candidates during replace operations.
To list the protons backing a Workload to see the active and any candidate instance:
# List protons for a workload
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/protons" \
-H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.'
To see the full set of proton endpoints available on the API:
GET /workloads/{workload_id}/protons
GET /workloads/{workload_id}/protons/{proton_id}
GET /workloads/{workload_id}/protons/{proton_id}/statusDetails
OpenTelemetry traces, logs, and metrics are exposed on the platform observability surface at /api/v2/otel/workload/{workload_id}/{logs,metrics,traces} (singular workload, keyed by Workload ID—not proton ID). The Workload API itself does not expose /otel/* routes.