Workload API > Operate running Workloads > Lifecycle states

Lifecycle states¶

A Workload's lifecycle is the set of states it transitions through from creation to teardown—and what triggers each transition. The platform derives states from the Kubernetes pods backing the Workload's protons.

The platform tracks state at two levels. ProtonStatus is the full runtime view computed from the Kubernetes pods backing each proton. WorkloadStatus is the user-facing subset surfaced on the Workload itself—it omits internal proton-lifecycle states (initializing, warming, draining, restarting) that the Workload-level API never reports.

The combined state table uses the Surfaces on column to indicate where each value appears.

State	Surfaces on	Description
`unknown`	Workload + proton	State not yet reported. The platform has not produced a snapshot yet.
`submitted`	Workload + proton	Accepted by the API, not yet scheduled. Pod is `Pending` with no node assigned.
`initializing`	Proton only	Legacy state retained for backward compatibility. New protons use `provisioning` instead. Workloads collapse this into adjacent Workload-level states (`launching` or `running`).
`provisioning`	Workload + proton	Cluster resources are being allocated—resource-bundle scheduling, PVC creation, and secret injection. Pod is `Pending` with a node assigned.
`launching`	Workload + proton	Image pull and container start in progress, or containers are up but readiness probes have not passed yet. Pod is `Running` with not-yet-ready containers.
`running`	Workload + proton	Healthy and serving requests. Pod is `Running` and all containers are ready.
`suspended`	Workload + proton	Intentionally paused. Pods are stopped, but the Workload's identity and configuration are preserved; `start` restores them.
`warming`	Proton only	Running in the warmup window during a replacement. Reflected from the candidate proton.
`draining`	Proton only	Alive but no longer receiving traffic. Old proton during a replacement.
`interrupted`	Proton only	Runtime preempted—node eviction, spot reclaim, or scheduler-driven displacement. The platform reschedules automatically once capacity frees up. At the Workload level, `interrupted` maps to `stopped`.
`restarting`	Proton only	Proton being recreated in place after a configuration change that requires a pod restart but preserves proton identity. The Workload-level status stays on the adjacent visible state.
`stopping`	Workload + proton	Graceful shutdown is in progress. Pod is terminating.
`stopped`	Workload + proton	Stopped; can be restarted. No pods are present, or the pod phase is `Succeeded` for run-to-completion artifacts.
`errored`	Workload + proton	Failed. `CrashLoopBackOff`, `ImagePullBackOff`, or pod phase `Failed`.
`terminated`	Workload + proton	Permanently torn down. Workload was deleted; the proton no longer exists.

The platform evaluates pod-state predicates in priority order to derive ProtonStatus; WorkloadStatus is then derived by collapsing the proton-only states.

Typical transitions¶

The following table lists the common state sequences for create, update, replace, stop, failure, and promote operations. The sequences are written at the proton level to show the full transition path; at the Workload level the proton-only states (initializing, warming, draining, restarting) collapse into adjacent Workload-visible states.

Operation	Transition
Create	`submitted -> provisioning -> launching -> running`
Update	`running -> initializing -> running`
Replace	`running -> running(active) + initializing/warming/running(candidate) -> switch -> draining(active) + running(new active) -> running(single active)`
Stop	`running -> stopping -> stopped`
Suspend	`running -> suspended`
Failure	`launching\|running -> errored`
Promote	`running (draft) -> running (locked)` (no restart)

Pod-based truth¶

The platform computes Workload status from pod predicates in priority order:

For multi-replica Workloads, worst-state-wins aggregation applies.
Sidecars factor into container readiness and failure evaluation.
Init container statuses don't factor into the predicates—they affect startup but do not directly trigger Workload-level errored.

Why errored is sticky¶

Once any container hits CrashLoopBackOff, the Workload reports errored and won't return to running until the offending pod is replaced or the failing container starts succeeding again. A Workload also stays in launching until every container's readiness probe passes—a typo in readinessProbe.path or a sidecar that never becomes ready keeps the Workload from reaching running even when the primary container is up.

CrashLoopBackOff is not surfaced as a named string in the Workload or proton status field—the status reads errored with no further qualification. To confirm a crash loop, call statusDetails on the proton: a container with restartCount > 0 and status: "waiting" indicates a crash loop in progress.

Debug an errored Workload¶

statusDetails on a proton includes the failing container name and reason. The /events endpoint records DataRobot lifecycle transitions—state changes, replacements, promotions—and is useful for timeline reconstruction. It does not forward Kubernetes pod-level events such as crash restarts or image-pull failures; for those, statusDetails on the proton is the primary diagnostic source. Logs give application-level context.

To pull lifecycle events and per-proton status to pinpoint where a Workload went wrong:

# Lifecycle events: what changed and when
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/events" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}"

# Per-proton status
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/protons" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.data[] | {id, role, status}'

Protons: runtime backing and aggregation¶

A proton is the runtime primitive behind a Workload—the actual running container instances. Workloads are the governed identity; protons are the execution.

Object hierarchy¶

A Workload always has at least one proton. During a replace it can temporarily have two—old and new—with Workload traffic routing deciding which proton receives requests.

A proton consists of one or more Kubernetes pods:

A single-replica Workload maps to one pod.
A multi-replica Workload maps to one pod per replica.

A Workload is the stable address and governance wrapper; a proton is the runtime backing that executes the artifact behind that address.

How proton status is computed¶

The platform computes a proton's status by aggregating pod states using predicates in priority order.

For multi-replica protons, worst-state-wins aggregation applies.

Predicates are evaluated in priority order; the highest-priority match wins.

Pod condition	Proton state	Workload state	Priority
Any container in `CrashLoopBackOff`	`errored`	`errored`	7 (highest)
Any container in `ImagePullBackOff`	`errored`	`errored`	7
Pod phase `Failed`	`errored`	`errored`	6
Pod phase `Succeeded`	`stopped`	`stopped`	5
Pod phase `Pending` (no node yet)	`submitted`	`submitted`	4
Pod phase `Pending` (scheduled to node)	`provisioning`	`provisioning`	4
Pod phase `Running`, not all containers ready	`launching`	`launching`	3
Pod phase `Running`, all containers ready	`running`	`running`	2
Node eviction in progress	`interrupted`	`stopped`	—
User-issued suspend	`suspended`	`suspended`	—
Proton recreated after config change	`restarting`	(proton-only—Workload state unchanged)	—
No pods present	`stopped` (during shutdown) or `errored` (unexpected)	same	—

Predicates examine every container in each pod, including sidecars. If a sidecar has a wrong image URI, the proton reports errored even if the primary container is healthy. Init container statuses are not evaluated by the predicates—they affect startup but do not directly trigger errored.

Multi-container examples¶

Scenario	Pod phase	Container states	Workload state
App and sidecar both healthy	Running	app ready, sidecar ready	`running`
App running, sidecar in `ImagePullBackOff`	Running	app ready, sidecar `ImagePullBackOff`	`errored`
App in `CrashLoopBackOff`	Running	app `CrashLoopBackOff`	`errored`
App starting, sidecar running	Running	app `ContainerCreating`, sidecar ready	`launching`
Pod scheduled, containers waiting	Pending	both `Waiting` (node assigned)	`provisioning`
Pod pending, no node yet	Pending	both `Waiting` (no node)	`submitted`

Inspect protons¶

Most users interact through Workloads. Proton endpoints are useful for inspecting runtime state and testing candidates during replace operations.

To list the protons backing a Workload to see the active and any candidate instance:

# List protons for a workload
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/protons" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.'

To see the full set of proton endpoints available on the API:

GET /workloads/{workload_id}/protons
GET /workloads/{workload_id}/protons/{proton_id}
GET /workloads/{workload_id}/protons/{proton_id}/statusDetails

OpenTelemetry traces, logs, and metrics are exposed on the platform observability surface at /api/v2/otel/workload/{workload_id}/{logs,metrics,traces} (singular workload, keyed by Workload ID—not proton ID). The Workload API itself does not expose /otel/* routes.