# Best practices and troubleshooting

> Best practices and troubleshooting - Container design, production hardening, security, and recovery
> steps for common failures.

This Markdown file sits beside the HTML page at the same path (with a `.md` suffix). It summarizes the topic and lists links for tools and LLM context.

Companion generated at `2026-06-22T16:50:38.247721+00:00` (UTC).

## Primary page

- [Best practices and troubleshooting](https://docs.datarobot.com/en/docs/workload-api/get-started-workloads/best-practices.html.md): Full documentation for this topic (Markdown sidecar).

## Sections on this page

- [Container design](https://docs.datarobot.com/en/docs/workload-api/get-started-workloads/best-practices.html.md#container-design): In-page section heading.
- [Production hardening](https://docs.datarobot.com/en/docs/workload-api/get-started-workloads/best-practices.html.md#production-hardening): In-page section heading.
- [Security](https://docs.datarobot.com/en/docs/workload-api/get-started-workloads/best-practices.html.md#security): In-page section heading.
- [Troubleshooting](https://docs.datarobot.com/en/docs/workload-api/get-started-workloads/best-practices.html.md#troubleshooting): In-page section heading.
- [Inspect a Workload](https://docs.datarobot.com/en/docs/workload-api/get-started-workloads/best-practices.html.md#inspect-workload): In-page section heading.
- [Drill into protons](https://docs.datarobot.com/en/docs/workload-api/get-started-workloads/best-practices.html.md#inspect-protons): In-page section heading.
- [Per-replica readiness](https://docs.datarobot.com/en/docs/workload-api/get-started-workloads/best-practices.html.md#per-replica-status): In-page section heading.
- [Application logs](https://docs.datarobot.com/en/docs/workload-api/get-started-workloads/best-practices.html.md#application-logs): In-page section heading.
- [Workload stuck inlaunching](https://docs.datarobot.com/en/docs/workload-api/get-started-workloads/best-practices.html.md#stuck-initializing): In-page section heading.
- [Workload reportserrored](https://docs.datarobot.com/en/docs/workload-api/get-started-workloads/best-practices.html.md#errored-status): In-page section heading.

## Related documentation

- [Workload API](https://docs.datarobot.com/en/docs/workload-api/index.html.md): Linked from this page.
- [Get started: Workload API](https://docs.datarobot.com/en/docs/workload-api/get-started-workloads/index.html.md): Linked from this page.
- [Artifact concepts](https://docs.datarobot.com/en/docs/workload-api/build-artifacts/artifacts-concepts.html.md): Linked from this page.
- [Workload concepts](https://docs.datarobot.com/en/docs/workload-api/create-workloads/workload-concepts.html.md): Linked from this page.
- [Lifecycle states](https://docs.datarobot.com/en/docs/workload-api/operate-workloads/lifecycle-states.html.md): Linked from this page.
- [Container health and readiness](https://docs.datarobot.com/en/docs/workload-api/monitor-workloads/health-readiness.html.md): Linked from this page.
- [Promote to production](https://docs.datarobot.com/en/docs/workload-api/update-workloads/promote-production.html.md): Linked from this page.
- [Scaling metrics](https://docs.datarobot.com/en/docs/workload-api/operate-workloads/runtime-settings.html.md#scaling-metrics): Linked from this page.
- [View Workload logs](https://docs.datarobot.com/en/docs/workload-api/monitor-workloads/activity-logs-ui/logs.html.md): Linked from this page.
- [Replace and roll out](https://docs.datarobot.com/en/docs/workload-api/update-workloads/replace-artifact-rollouts.html.md): Linked from this page.

## Documentation content

This page collects recommendations for designing containers, hardening production Workloads, handling secrets, and [recovering from common failures](https://docs.datarobot.com/en/docs/workload-api/get-started-workloads/best-practices.html.md#troubleshooting). The conceptual background is outlined in [Artifact concepts](https://docs.datarobot.com/en/docs/workload-api/build-artifacts/artifacts-concepts.html.md), [Workload concepts](https://docs.datarobot.com/en/docs/workload-api/create-workloads/workload-concepts.html.md), and [Lifecycle states](https://docs.datarobot.com/en/docs/workload-api/operate-workloads/lifecycle-states.html.md).

# Best practices

Consider the following when creating and running containerized workloads.

### Container design

When working with containers, the platform polls a readiness probe to decide when a container can receive traffic. Liveness and startup probes are optional but recommended for resilient services.

| Practice | Why it matters |
| --- | --- |
| Implement readinessProbe on every primary container, plus livenessProbe and startupProbe where they help. | Readiness gates the transition to running. Liveness restarts wedged-but-still-running containers. Startup gives slow boots more time before the other probes take over. See Container health and readiness. |
| Tune probe timing for slow-starting Workloads. | ProbeConfig defaults are 30 s for initialDelaySeconds, periodSeconds, and timeoutSeconds. Slow-starting containers may need a higher initialDelaySeconds or a dedicated startupProbe. |
| Right-size resourceAllocation (cpu, memory, gpu) per container. | Set on runtime.containerGroups[].containers[].resourceAllocation. |
| Keep sidecars healthy. | Predicates examine every container in the pod, including sidecars. A sidecar in ImagePullBackOff reports the Workload as errored even if the primary container is fine; give sidecars their own readiness probes when they take time to start. |

### Production hardening

Locked artifacts and explicit governance are the difference between a quick draft and a production deployment. Consider these best practices:

| Practice | Why it matters |
| --- | --- |
| Lock the artifact before serving production traffic. | Locked artifacts are immutable and can back unlimited Workloads. Locking is one-way: PATCH /artifacts/{id} with {"status": "locked"} or use POST /workloads/{id}/promote to do it in-place. See Promote to production. |
| Set importance deliberately on locked Workloads. | importance defaults to low; set it explicitly for production Workloads. It is a priority hint the platform uses for resource prioritization and operational triage under cluster contention, and does not affect routing, autoscaling, or QoS guarantees. |
| Configure autoscaling to match traffic. | See Scaling metrics for predefined scalingMetric values and scale-to-zero behavior. Only httpRequestsConcurrency scales replicas to zero when the proton is idle (minCount: 0). |
| Use resource bundles for Workload-level resource selection. | Per-container resourceAllocation (under runtime.containerGroups[].containers[]) declares what each container gets; runtime.containerGroups[].resourceBundles selects a platform bundle for the group. See Runtime settings. |

### Security

Container images, secrets, and probe traffic all benefit from sensible defaults.

| Practice | Why it matters |
| --- | --- |
| Pull from private, authenticated registries. | The platform pulls images at scheduling time. Public images are fine for evaluation; production Workloads should use registries you control. |
| Inject secrets through CredentialEnvironmentVariable (DataRobot Credentials), not hardcoded values. | Container env vars accept a CredentialEnvironmentVariable entry with source: dr-credential, drCredentialId, and key to look up a value from the DataRobot Credentials service at runtime. Plain StringEnvironmentVariable is fine for non-sensitive configuration; never commit tokens, API keys, or database passwords as string values. |
| Use scheme: HTTPS on probes when the container terminates TLS internally. | ProbeConfig.scheme accepts HTTP (default) or HTTPS. Match the container listener so probes don't fail the handshake before traffic arrives. |

## Troubleshooting

When a Workload misbehaves, use the events stream first, then drill into protons and per-replica detail.

### Inspect a Workload

The fastest signal is the Workload object itself, then the lifecycle event log.

```
# Workload-level summary
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '{status, importance, artifactId, replacement}'

# Lifecycle events: what changed and when
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/events" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}"
```

### Drill into protons

List the protons backing a Workload to see active and candidate instances and their current status.

```
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/protons" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.data[] | {id, status, role}'
```

For container-level conditions, readiness, restart counts, and the startup `logTail`, call the per-proton status-details endpoint shown in the next section.

### Per-replica readiness

For container-level conditions and pod phase per replica, use the dedicated status-details endpoint. This is where readiness conditions, container restart counts, and per-replica failure reasons live.

```
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/protons/${PROTON_ID}/statusDetails" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" \
  | jq '.replicas[] | {name, status, conditions, containers}'|
```

A `204` response means no status snapshot has arrived yet; retry shortly.

### Application logs

Container output gives application-level context that complements lifecycle events and per-replica status.`stdout` and `stderr` are captured automatically at every lifecycle stage—startup, running, and errored—and surface on the Workload's Activity log > Logs tab. See [View Workload logs](https://docs.datarobot.com/en/docs/workload-api/monitor-workloads/activity-logs-ui/logs.html.md) for how to filter, search, and copy log output.

### Workload stuck in launching

If the pod is scheduled but not all containers have passed readiness yet, or cluster resources are still being allocated ( `provisioning`), check these common causes:

| Cause | What to check |
| --- | --- |
| Wrong readinessProbe.path or non-2xx during warmup | Confirm the path returns 2xx when the app is ready and matches the container port. |
| resourceAllocation exceeds cluster capacity | Ensure a node can schedule the pod; watch for long Pending with insufficient CPU, memory, or GPU. |
| Sidecar still starting without its own probe | Add a readiness probe to sidecars that take time to boot so they do not block the primary container from passing readiness. |
| Long installs or model downloads | Raise initialDelaySeconds or add a startupProbe so probes do not fail the Workload early. |

Check `events`, then per-replica `statusDetails`, for unmet conditions.

### Workload reports errored

If at least one container is in `CrashLoopBackOff` or `ImagePullBackOff`, or the pod entered phase `Failed`, check these common causes:

| Cause | What to check |
| --- | --- |
| Bad imageUri or unreachable registry | Look for ImagePullBackOff in statusDetails and verify registry auth and network paths. |
| Container exits non-zero on startup | Inspect statusDetails.logTail and OpenTelemetry logs for startup exceptions. |
| Out-of-memory kill | Raise resourceAllocation.memory or fix a memory leak in the application. |
| Sidecar misconfiguration | Predicates take the most severe container state across the entire pod—a single sidecar in CrashLoopBackOff or ImagePullBackOff marks the whole pod as unhealthy and surfaces the Workload as errored. |

`errored` is sticky: the Workload won't return to `running` until the failing pod is replaced or the failing container starts succeeding. Fix the underlying cause and allow the platform to restart the container, or trigger a replacement (see [Replace and roll out](https://docs.datarobot.com/en/docs/workload-api/update-workloads/replace-artifact-rollouts.html.md)).
