Skip to content

Best practices and troubleshooting

This page covers recommended practices for container design, production deployments, security, and troubleshooting steps for common issues.

Best practices

Container design

  • Implement health checks: Always provide liveness, readiness, and startup probes.
  • Set appropriate timeouts: GPU-heavy workloads need longer startup times.
  • Use appropriate resource requests: Right-size CPU and memory to avoid overprovisioning.

Production deployments

  • Promote artifacts: Lock artifacts before production deployment for immutability.
  • Enable autoscaling: Configure scaling policies appropriate for your traffic patterns.
  • Use resource bundles: Leverage predefined resource bundles for consistent GPU allocation.

Security

  • Use private registries: Store container images in secure, private registries.
  • Avoid hardcoded secrets: Use environment variables or secret management.
  • Implement HTTPS probes: Use scheme: HTTPS for health checks when appropriate.

Troubleshooting

Checking workload status details

When a workload enters an unexpected state, the statusDetails field provides diagnostic information:

curl -s -X GET "${DATAROBOT_ENDPOINT}/console/workloads/{workloadId}" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.statusDetails'

The statusDetails object contains two key fields:

Field Description
conditions Array of Kubernetes-style conditions indicating component states.
logTail Array of recent container log lines captured during startup.

Common issues

Workload stuck in initializing status

  • Check statusDetails.conditions for scheduling issues.
  • Verify the container image is accessible from the cluster.
  • Verify resource requests don't exceed cluster capacity.
  • Review startup probe configuration.

Workload enters errored status

When a workload fails, start by inspecting statusDetails:

curl -s -X GET "${DATAROBOT_ENDPOINT}/console/workloads/{workloadId}" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '{status, statusDetails}'