Skip to content

Lifecycle states

A Workload's lifecycle is the set of states it transitions through from creation to teardown—and what triggers each transition. The platform derives states from the Kubernetes pods backing the Workload's protons.

The platform tracks state at two levels. ProtonStatus is the full runtime view computed from the Kubernetes pods backing each proton. WorkloadStatus is the user-facing subset surfaced on the Workload itself—it omits internal proton-lifecycle states (initializing, warming, draining, restarting) that the Workload-level API never reports.

The combined state table uses the Surfaces on column to indicate where each value appears.

State Surfaces on Description
unknown Workload + proton State not yet reported. The platform has not produced a snapshot yet.
submitted Workload + proton Accepted by the API, not yet scheduled. Pod is Pending with no node assigned.
initializing Proton only In-place update or restart while containers are being recreated and not yet ready. Workloads collapse this into adjacent Workload-level states (launching or running).
provisioning Workload + proton Cluster resources are being allocated—resource-bundle scheduling, PVC creation, and secret injection. Pod is Pending with a node assigned.
launching Workload + proton Image pull and container start in progress, or containers are up but readiness probes have not passed yet. Pod is Running with not-yet-ready containers.
running Workload + proton Healthy and serving requests. Pod is Running and all containers are ready.
suspended Workload + proton Intentionally paused. Pods are stopped, but the Workload's identity and configuration are preserved; resume restores them without re-scheduling from scratch.
warming Proton only Running in the warmup window during a replacement. Reflected from the candidate proton.
draining Proton only Alive but no longer receiving traffic. Old proton during a replacement.
interrupted Workload + proton Runtime preempted—node eviction, spot reclaim, or scheduler-driven displacement. The platform reschedules automatically once capacity frees up.
restarting Proton only Proton being recreated in place after a configuration change that requires a pod restart but preserves proton identity. The Workload-level status stays on the adjacent visible state.
stopping Workload + proton Graceful shutdown is in progress. Pod is terminating.
stopped Workload + proton Stopped; can be restarted. No pods are present, or the pod phase is Succeeded for run-to-completion artifacts.
errored Workload + proton Failed. CrashLoopBackOff, ImagePullBackOff, or pod phase Failed.
terminated Workload + proton Permanently torn down. Workload was deleted; the proton no longer exists.

The platform evaluates pod-state predicates in priority order to derive ProtonStatus; WorkloadStatus is then derived by collapsing the proton-only states.

Typical transitions

The following table lists the common state sequences for create, update, replace, stop, failure, and promote operations. The sequences are written at the proton level to show the full transition path; at the Workload level the proton-only states (initializing, warming, draining, restarting) collapse into adjacent Workload-visible states.

Operation Transition
Create submitted -> provisioning -> launching -> running
Update running -> initializing -> running
Replace running -> running(active) + initializing/warming/running(candidate) -> switch -> draining(active) + running(new active) -> running(single active)
Stop running -> stopping -> stopped
Suspend running -> suspended
Failure launching|running -> errored
Promote running (draft) -> running (locked) (no restart)

Pod-based truth

The platform computes Workload status from pod predicates in priority order:

  • For multi-replica Workloads, worst-state-wins aggregation applies.
  • Sidecars factor into container readiness and failure evaluation.
  • Init container statuses don't factor into the predicates—they affect startup but do not directly trigger Workload-level errored.

Why errored is sticky

Once any container hits CrashLoopBackOff, the Workload reports errored and won't return to running until the offending pod is replaced or the failing container starts succeeding again. A Workload also stays in launching until every container's readiness probe passes—a typo in readinessProbe.path or a sidecar that never becomes ready keeps the Workload from reaching running even when the primary container is up.

Debug an errored Workload

statusDetails on a proton includes the failing container name and reason. The /events endpoint gives a lifecycle audit trail; logs give application-level context.

To pull lifecycle events and per-proton status to pinpoint where a Workload went wrong:

# Lifecycle events: what changed and when
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/events" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}"

# Per-proton status
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/protons" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.data[] | {id, role, status}'

Protons: runtime backing and aggregation

A proton is the runtime primitive behind a Workload—the actual running container instances. Workloads are the governed identity; protons are the execution.

Object hierarchy

A Workload always has at least one proton. During a replace it can temporarily have two—old and new—with Workload traffic routing deciding which proton receives requests.

A proton consists of one or more Kubernetes pods:

  • A single-replica Workload maps to one pod.
  • A multi-replica Workload maps to one pod per replica.

A Workload is the stable address and governance wrapper; a proton is the runtime backing that executes the artifact behind that address.

How proton status is computed

The platform computes a proton's status by aggregating pod states using predicates in priority order.

For multi-replica protons, worst-state-wins aggregation applies.

Predicates are evaluated in priority order; the highest-priority match wins.

Pod condition Proton state Workload state Priority
Any container in CrashLoopBackOff errored errored 7 (highest)
Any container in ImagePullBackOff errored errored 7
Pod phase Failed errored errored 6
Pod phase Succeeded stopped stopped 5
Pod phase Pending (no node yet) submitted submitted 4
Pod phase Pending (scheduled to node) provisioning provisioning 4
Pod phase Running, not all containers ready launching launching 3
Pod phase Running, all containers ready running running 2
Node eviction in progress interrupted interrupted
User-issued suspend suspended suspended
Proton recreated after config change restarting (proton-only—Workload state unchanged)
No pods present stopped (during shutdown) or errored (unexpected) same

Predicates examine every container in each pod, including sidecars. If a sidecar has a wrong image URI, the proton reports errored even if the primary container is healthy. Init container statuses are not evaluated by the predicates—they affect startup but do not directly trigger errored.

Multi-container examples

Scenario Pod phase Container states Workload state
App and sidecar both healthy Running app ready, sidecar ready running
App running, sidecar in ImagePullBackOff Running app ready, sidecar ImagePullBackOff errored
App in CrashLoopBackOff Running app CrashLoopBackOff errored
App starting, sidecar running Running app ContainerCreating, sidecar ready launching
Pod scheduled, containers waiting Pending both Waiting (node assigned) provisioning
Pod pending, no node yet Pending both Waiting (no node) submitted

Inspect protons

Most users interact through Workloads. Proton endpoints are useful for inspecting runtime state and testing candidates during replace operations.

To list the protons backing a Workload to see the active and any candidate instance:

# List protons for a workload
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/protons" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.'

To see the full set of proton endpoints available on the API:

GET /workloads/{workload_id}/protons
GET /workloads/{workload_id}/protons/{proton_id}
GET /workloads/{workload_id}/protons/{proton_id}/statusDetails

OpenTelemetry traces, logs, and metrics are exposed on the platform observability surface at /api/v2/otel/workload/{workload_id}/{logs,metrics,traces} (singular workload, keyed by Workload ID—not proton ID). The Workload API itself does not expose /otel/* routes.