Skip to content

Best practices and troubleshooting

This page collects recommendations for designing containers, hardening production Workloads, handling secrets, and recovering from common failures. The conceptual background is outlined in Artifact concepts, Workload concepts, and Lifecycle states.

ベストプラクティス

Consider the following when creating and running containerized workloads.

Container design

When working with containers, the platform polls a readiness probe to decide when a container can receive traffic. Liveness and startup probes are optional but recommended for resilient services.

Practice Why it matters
Implement readinessProbe on every primary container, plus livenessProbe and startupProbe where they help. Readiness gates the transition to running. Liveness restarts wedged-but-still-running containers. Startup gives slow boots more time before the other probes take over. See Container health and readiness.
Tune probe timing for slow-starting Workloads. ProbeConfig defaults are 30 s for initialDelaySeconds, periodSeconds, and timeoutSeconds. Slow-starting containers may need a higher initialDelaySeconds or a dedicated startupProbe.
Right-size resourceAllocation (cpu, memory, gpu) per container. Set on runtime.containerGroups[].containers[].resourceAllocation.
Keep sidecars healthy. Predicates examine every container in the pod, including sidecars. A sidecar in ImagePullBackOff reports the Workload as errored even if the primary container is fine; give sidecars their own readiness probes when they take time to start.

Production hardening

Locked artifacts and explicit governance are the difference between a quick draft and a production deployment. Consider these best practices:

Practice Why it matters
Lock the artifact before serving production traffic. Locked artifacts are immutable and can back unlimited Workloads. Locking is one-way: PATCH /artifacts/{id} with {"status": "locked"} or use POST /workloads/{id}/promote to do it in-place. See Promote to production.
Set importance deliberately on locked Workloads. importance defaults to low; set it explicitly for production Workloads. It is a priority hint the platform uses for resource prioritization and operational triage under cluster contention, and does not affect routing, autoscaling, or QoS guarantees.
Configure autoscaling to match traffic. See Scaling metrics for predefined scalingMetric values and scale-to-zero behavior. Only httpRequestsConcurrency scales replicas to zero when the proton is idle (minCount: 0).
Use resource bundles for Workload-level resource selection. Per-container resourceAllocation (under runtime.containerGroups[].containers[]) declares what each container gets; runtime.containerGroups[].resourceBundles selects a platform bundle for the group. See Runtime settings.

セキュリティ

Container images, secrets, and probe traffic all benefit from sensible defaults.

Practice Why it matters
Pull from private, authenticated registries. The platform pulls images at scheduling time. Public images are fine for evaluation; production Workloads should use registries you control.
Inject secrets through CredentialEnvironmentVariable (DataRobot Credentials), not hardcoded values. Container env vars accept a CredentialEnvironmentVariable entry with source: dr-credential, drCredentialId, and key to look up a value from the DataRobot Credentials service at runtime. Plain StringEnvironmentVariable is fine for non-sensitive configuration; never commit tokens, API keys, or database passwords as string values.
Use scheme: HTTPS on probes when the container terminates TLS internally. ProbeConfig.scheme accepts HTTP (default) or HTTPS. Match the container listener so probes don't fail the handshake before traffic arrives.

トラブルシューティング

When a Workload misbehaves, use the events stream first, then drill into protons and per-replica detail.

Inspect a Workload

The fastest signal is the Workload object itself, then the lifecycle event log.

# Workload-level summary
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '{status, importance, artifactId, replacement}'

# Lifecycle events: what changed and when
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/events" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" 

Drill into protons

List the protons backing a Workload to see active and candidate instances and their current status.

curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/protons" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.data[] | {id, status, role}' 

For container-level conditions, readiness, restart counts, and the startup logTail, call the per-proton status-details endpoint shown in the next section.

Per-replica readiness

For container-level conditions and pod phase per replica, use the dedicated status-details endpoint. This is where readiness conditions, container restart counts, and per-replica failure reasons live.

curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/protons/${PROTON_ID}/statusDetails" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" \
  | jq '.replicas[] | {name, status, conditions, containers}'| 

A 204 response means no status snapshot has arrived yet; retry shortly.

アプリケーションログ

Container output gives application-level context that complements lifecycle events and per-replica status. stdout and stderr are captured automatically at every lifecycle stage—startup, running, and errored—and surface on the Workload's Activity log > Logs tab. See View Workload logs for how to filter, search, and copy log output.

Workload stuck in launching

If the pod is scheduled but not all containers have passed readiness yet, or cluster resources are still being allocated (provisioning), check these common causes:

原因 確認事項
Wrong readinessProbe.path or non-2xx during warmup Confirm the path returns 2xx when the app is ready and matches the container port.
resourceAllocation exceeds cluster capacity Ensure a node can schedule the pod; watch for long Pending with insufficient CPU, memory, or GPU.
Sidecar still starting without its own probe Add a readiness probe to sidecars that take time to boot so they do not block the primary container from passing readiness.
Long installs or model downloads Raise initialDelaySeconds or add a startupProbe so probes do not fail the Workload early.

Check events, then per-replica statusDetails, for unmet conditions.

Workload reports errored

If at least one container is in CrashLoopBackOff or ImagePullBackOff, or the pod entered phase Failed, check these common causes:

原因 確認事項
Bad imageUri or unreachable registry Look for ImagePullBackOff in statusDetails and verify registry auth and network paths.
Container exits non-zero on startup Inspect statusDetails.logTail and OpenTelemetry logs for startup exceptions.
Out-of-memory kill Raise resourceAllocation.memory or fix a memory leak in the application.
Sidecar misconfiguration Predicates take the most severe container state across the entire pod—a single sidecar in CrashLoopBackOff or ImagePullBackOff marks the whole pod as unhealthy and surfaces the Workload as errored.

errored is sticky: the Workload won't return to running until the failing pod is replaced or the failing container starts succeeding. Fix the underlying cause and allow the platform to restart the container, or trigger a replacement (see Replace and roll out).