Best practices and troubleshooting¶
This page covers recommended practices for container design, production deployments, security, and troubleshooting steps for common issues.
Best practices¶
Container design¶
- Implement health checks: Always provide liveness, readiness, and startup probes.
- Set appropriate timeouts: GPU-heavy workloads need longer startup times.
- Use appropriate resource requests: Right-size CPU and memory to avoid overprovisioning.
Production deployments¶
- Promote artifacts: Lock artifacts before production deployment for immutability.
- Enable autoscaling: Configure scaling policies appropriate for your traffic patterns.
- Use resource bundles: Leverage predefined resource bundles for consistent GPU allocation.
Security¶
- Use private registries: Store container images in secure, private registries.
- Avoid hardcoded secrets: Use environment variables or secret management.
- Implement HTTPS probes: Use
scheme: HTTPSfor health checks when appropriate.
Troubleshooting¶
Checking workload status details¶
When a workload enters an unexpected state, the statusDetails field provides diagnostic information:
curl -s -X GET "${DATAROBOT_ENDPOINT}/console/workloads/{workloadId}" \
-H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.statusDetails'
The statusDetails object contains two key fields:
| Field | Description |
|---|---|
conditions |
Array of Kubernetes-style conditions indicating component states. |
logTail |
Array of recent container log lines captured during startup. |
Common issues¶
Workload stuck in initializing status¶
- Check
statusDetails.conditionsfor scheduling issues. - Verify the container image is accessible from the cluster.
- Verify resource requests don't exceed cluster capacity.
- Review startup probe configuration.
Workload enters errored status¶
When a workload fails, start by inspecting statusDetails:
curl -s -X GET "${DATAROBOT_ENDPOINT}/console/workloads/{workloadId}" \
-H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '{status, statusDetails}'