Workload API > Get started: Workload API > Best practices and troubleshooting

Best practices and troubleshooting¶

This page collects recommendations for designing containers, hardening production Workloads, handling secrets, and recovering from common failures. The conceptual background is outlined in Artifact concepts, Workload concepts, and Lifecycle states.

ベストプラクティス¶

Consider the following when creating and running containerized workloads.

Container design¶

When working with containers, the platform polls a readiness probe to decide when a container can receive traffic. Liveness and startup probes are optional but recommended for resilient services.

Practice	Why it matters
Implement `readinessProbe` on every primary container, plus `livenessProbe` and `startupProbe` where they help.	Readiness gates the transition to `running`. Liveness restarts wedged-but-still-running containers. Startup gives slow boots more time before the other probes take over. See Container health and readiness.
Tune probe timing for slow-starting Workloads.	`ProbeConfig` defaults are 30 s for `initialDelaySeconds`, `periodSeconds`, and `timeoutSeconds`. Slow-starting containers may need a higher `initialDelaySeconds` or a dedicated `startupProbe`.
Right-size `resourceAllocation` (`cpu`, `memory`, `gpu`) per container.	Set on `runtime.containerGroups[].containers[].resourceAllocation`.
Keep sidecars healthy.	Predicates examine every container in the pod, including sidecars. A sidecar in `ImagePullBackOff` reports the Workload as `errored` even if the primary container is fine; give sidecars their own readiness probes when they take time to start.
仕様内で`entrypoint`を設定すると、何がオーバーライドされるかを確認します。	アーティファクトの仕様で`entrypoint`を設定すると、Dockerfile内で`ENTRYPOINT`と`CMD`の両方が置き換えられます。イメージが固定の`ENTRYPOINT`とともにデフォルト引数として`CMD`に依存している場合は、仕様の`entrypoint`配列に完全なコマンドを含めてください。
イメージタグは固定（pin）し、決して`:latest`を使用しないでください。	プラットフォームでは`imagePullPolicy: IfNotPresent`が使用されているため、新しいバージョンをプッシュしても、キャッシュされた`:latest`イメージが再取得されない場合があります。特定のダイジェストまたはバージョンタグ（例：`myimage:1.4.2`）に固定することで、すべてのデプロイで期待通りのイメージが確実に取得されます。
グループ内のコンテナ間通信には`localhost`を使用します。	同じ`containerGroup`内のコンテナは同じノード上で実行され、`localhost:<port>`を通じて相互に通信できます。外部に公開されるのは、`primary: true`に設定されたコンテナのポートのみです。追加のネットワーク設定を必要としないサイドカーパターン（エンベッダー + ベクターデータベース、エージェント + キャッシュ）には、これを使用します。

Production hardening¶

Locked artifacts and explicit governance are the difference between a quick draft and a production deployment. Consider these best practices:

Practice	Why it matters
Lock the artifact before serving production traffic.	Locked artifacts are immutable and can back unlimited Workloads. Locking is one-way: `PATCH /artifacts/{id}` with `{"status": "locked"}` or use `POST /workloads/{id}/promote` to do it in-place. See Promote to production.
Set `importance` deliberately on locked Workloads.	`importance` defaults to `low`; set it explicitly for production Workloads. It is a priority hint the platform uses for resource prioritization and operational triage under cluster contention, and does not affect routing, autoscaling, or QoS guarantees.
Configure autoscaling to match traffic.	See Scaling metrics for predefined `scalingMetric` values and scale-to-zero behavior. Protonがアイドル状態（`minReplicaCount: 0`）の場合、レプリカ数をゼロまでスケーリングできるのは`httpRequestsConcurrency`のみです。
Use resource bundles for Workload-level resource selection.	Per-container `resourceAllocation` (under `runtime.containerGroups[].containers[]`) declares what each container gets; `runtime.containerGroups[].resourceBundles` selects a platform bundle for the group. See Runtime settings.

セキュリティ¶

Container images, secrets, and probe traffic all benefit from sensible defaults.

Practice	Why it matters
一般に公開されているイメージを使用するか、プライベートレジストリへのアクセスについては管理者にご確認ください。	プラットフォームは、アーティファクトの作成時ではなく、スケジューリング時にレジストリからイメージを取得します。認証が必要なプライベートリポジトリ内のイメージについては、DataRobotの管理者がプラットフォームレベルでそのレジストリのイメージプルシークレットを設定していない限り、`ImagePullBackOff`で失敗します。プライベートレジストリには、アーティファクト単位やエンドユーザー単位の資格情報フィールドは存在しません。迷った場合は、一般に公開されているイメージを使用してください。
ハードコードされた値ではなく、資格情報に裏付けられた環境変数を介してシークレットを注入します。	コンテナ環境変数は、`source`フィールドの判別可能ユニオン（discriminated union）です。詳細については、環境変数のタイプを参照してください。実行時にDataRobot資格情報サービスから値を検索するには、`drCredentialId`および`key`を指定した`CredentialEnvironmentVariable`（`source: "dr-credential"`）を使用します。ワークロードごとのAPIトークンとして、`ApiKeyEnvironmentVariable`（`source: "api-key"`）を使用してください。これは、呼び出し元ユーザーをスコープとし、Protonの作成時に自動的に解決されます。 `name`はオプションであり、デフォルトは`DATAROBOT_API_TOKEN`です。コンテナが別の環境変数名を想定している場合にのみ、明示的に設定してください。機微情報が含まれない設定には通常の`StringEnvironmentVariable`（`source: "string"`または`source`を省略）で問題ありません。トークン、APIキー、またはデータベースのパスワードを文字列値としてコミットしないでください。
`StringEnvironmentVariable`の値はAPIレスポンスに表示されることに注意してください。	アーティファクト仕様内の文字列環境変数は、`GET /artifacts/{id}`によってプレーンテキストで返されます。アーティファクトへのAPI読み取りアクセス権を持つユーザーであれば、誰でもこれらの値を取得できます。機微情報については、`CredentialEnvironmentVariable`を使用してください。
Use `scheme: HTTPS` on probes when the container terminates TLS internally.	`ProbeConfig.scheme` accepts `HTTP` (default) or `HTTPS`. Match the container listener so probes don't fail the handshake before traffic arrives.

トラブルシューティング¶

When a Workload misbehaves, use the events stream first, then drill into protons and per-replica detail.

Inspect a Workload¶

The fastest signal is the Workload object itself, then the lifecycle event log.

# Workload-level summary
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '{status, importance, artifactId, replacement}'

# Lifecycle events: what changed and when
curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/events" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}"

Drill into protons¶

List the protons backing a Workload to see active and candidate instances and their current status.

curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/protons" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.data[] | {id, status, role}'

For container-level conditions, readiness, restart counts, and the startup logTail, call the per-proton status-details endpoint shown in the next section.

Per-replica readiness¶

For container-level conditions and pod phase per replica, use the dedicated status-details endpoint. This is where readiness conditions, container restart counts, and per-replica failure reasons live.

curl -s "${DATAROBOT_ENDPOINT}/workloads/${WORKLOAD_ID}/protons/${PROTON_ID}/statusDetails" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" \
  | jq '.replicas[] | {name, status, conditions, containers}'|

A 204 response means no status snapshot has arrived yet; retry shortly.

アプリケーションログ¶

Container output gives application-level context that complements lifecycle events and per-replica status. stdout and stderr are captured automatically at every lifecycle stage—startup, running, and errored—and surface on the Workload's Activity log > Logs tab. See View Workload logs for how to filter, search, and copy log output.

Workload stuck in `launching`¶

If the pod is scheduled but not all containers have passed readiness yet, or cluster resources are still being allocated (provisioning), check these common causes:

原因	確認事項
Wrong `readinessProbe.path` or non-2xx during warmup	Confirm the path returns 2xx when the app is ready and matches the container port.
`resourceAllocation` exceeds cluster capacity	Ensure a node can schedule the pod; watch for long `Pending` with insufficient CPU, memory, or GPU.
Sidecar still starting without its own probe	Add a readiness probe to sidecars that take time to boot so they do not block the primary container from passing readiness.
Long installs or model downloads	Raise `initialDelaySeconds` or add a `startupProbe` so probes do not fail the Workload early.

Check events, then per-replica statusDetails, for unmet conditions.

Workload reports `errored`¶

If at least one container is in CrashLoopBackOff or ImagePullBackOff, or the pod entered phase Failed, check these common causes:

原因	確認事項
`imageUri`の不備またはアクセスできないレジストリ	イメージの取得失敗（`ErrImagePull`、`ImagePullBackOff`）はインフラストラクチャレベルのエラーであり、アクティビティログ > ログタブには表示されません。レプリカごとの`statusDetails`でそれらを探し、イメージが一般に公開されていることを確認します。
イメージのアーキテクチャが間違っている	プラットフォームは`linux/amd64`上で動作します。単純な`docker build`によりApple Silicon上に構築されたイメージから、ARMイメージが生成されますが、`statusDetails`に`exec format error`と表示され、クラッシュループに陥ります。正しいアーキテクチャを生成するには、`docker buildx build --platform linux/amd64`を使用して構築します。
Container exits non-zero on startup	Inspect `statusDetails.logTail` and OpenTelemetry logs for startup exceptions.
Out-of-memory kill	Raise `resourceAllocation.memory` or fix a memory leak in the application.
Sidecar misconfiguration	Predicates take the most severe container state across the entire pod—a single sidecar in `CrashLoopBackOff` or `ImagePullBackOff` marks the whole pod as unhealthy and surfaces the Workload as `errored`.

errored is sticky: the Workload won't return to running until the failing pod is replaced or the failing container starts succeeding. Fix the underlying cause and allow the platform to restart the container, or trigger a replacement (see Replace and roll out).

/eventsエンドポイントにはクラッシュの詳細は表示されません

/eventsはDataRobotのライフサイクル遷移（状態変化、置換、昇格）を記録するものであり、クラッシュによる再起動イベントやイメージの取得失敗は含まれません。 errored状態のワークロードについては、GET /workloads/{id}/protons/{proton_id}/statusDetailsに直接アクセスして、コンテナの再起動回数、失敗理由、およびログの末尾を確認してください。

置換が失敗したか、スタックした¶

置換がerroredに達するか、候補のProtonがwarming/initializingのままになる場合は、以下の一般的な原因を確認してください。

原因	確認事項
候補のReadinessプローブが失敗	候補のProtonは、トラフィックスイッチの前にすべてのReadinessプローブに合格する必要があります。候補のステータスについては`GET /workloads/{id}/protons`を確認し、レプリカごとの詳細については`GET /workloads/{id}/protons/{proton_id}/statusDetails`を確認してください。
候補イメージのプル失敗	新しいアーティファクト（またはサイドカー）の`imageUri`に問題があると、アクティブなProtonは実行を継続する一方、候補のインスタンスで`ImagePullBackOff`が発生します。イメージ参照を修正し、新しい置換を開始します。
候補のクラッシュループ	新しいアーティファクトのコンテナが起動時にクラッシュします。候補のProtonの`logTail`と再起動回数については、`statusDetails`を確認してください。

置換の失敗から回復するには、アーティファクトの根本的な問題を修正し、修正済みのartifactIdを使用して、再度POST /workloads/{id}/replacementを実行してください。以前のアーティファクトに戻すには、元のartifactIdを指定してPOST /replacementを実行します。

エラーレスポンス¶

APIは、4xxおよび5xxのレスポンスに対して、構造化されたエラー本文を返します。検証エラー（422）は、以下の形式で表示されます。

{
  "detail": [
    {
      "path": "artifact.spec.containerGroups.0.containers.0.port",
      "message": "ensure this value is greater than or equal to 1024",
      "code": "value_error"
    }
  ]
}

一般的なエラーコード

ステータス	発生タイミング
`400`	リクエスト本文の形式が正しくない。
`401`	認証トークンがないか無効である。
`403`	リソースに対する権限が不十分であるか、アカウントに設定された上限を超えています。 `GET /account/info/`は、アカウントの`maxConcurrentWorkloads`および`maxWorkloadReplicas`を返します。値が`0`の場合は無制限を意味します。いずれかの上限を超過した場合、`403`が返されます。
`404`	リソースが見つからない（または論理的に削除された）。
`409`	競合 - たとえば、置換処理中に`PATCH /settings`を実行した場合や、ロックされているアーティファクトに対して`DELETE /artifacts/{id}`を実行した場合など。
`422`	検証エラー - フィールドの制約に違反している（ポート番号が1024未満、プライマリーコンテナが複数存在、置換時のアーティファクトのステータスが一致しないなど）。
`502`	アップストリームサービス（Public API、Covalent）を利用できません。