Custom Models Configuration¶
Custom Models can be configured with different configs.
依存関係¶
- Image build service installed.
- NGINX ingress controller installed with a valid hostname for the NGINX service (with HTTPS trusted by DR Core).
Configuration Values¶
To configure these option, refer to the Tuning Datarobot Environment Variables section of this guide.
| 設定 | 説明 | デフォルト |
|---|---|---|
IMAGE_BUILDER_SERVICE_URL |
URL for the image build service | "http://build-service" |
IMAGE_BUILDER_CUSTOM_MODELS_REGISTRY_HOST |
Hostname for registry used for custom model images | DOCKER_REGISTRY_URL installation variable |
IMAGE_BUILDER_CUSTOM_MODELS_REGISTRY_REPO |
Repostory in registry for custom model images | managed-image |
IMAGE_BUILDER_EPHEMERAL_CUSTOM_MODELS_REGISTRY_REPO |
Repostory in registry for ephemeral custom model images | ephemeral-image |
IMAGE_BUILDER_CUSTOM_MODELS_ENVIRONMENT_REGISTRY_REPO |
Repository in registry for environment images | base-image |
IMAGE_BUILDER_WORKERS |
Number of concurrent image build workers | 4 |
CUSTOM_MODELS_FQDN |
NGINX Ingress controller service hostname(1) | Value of global.domain in the umbrella charts |
LONG_RUNNING_SERVICES_OPERATOR_INSTANCE_NAME |
Namespace where the LRS operator / chart is installed | Namespace of the DR Core installation |
LONG_RUNNING_SERVICES_RESOURCES_NAMESPACE |
Namespace where the model deployments are created. | The value of LONG_RUNNING_SERVICES_OPERATOR_INSTANCE_NAME |
LONG_RUNNING_SERVICES_SECURITY_CONTEXT_USER_ID |
User ID for built model images + runAsUser on deployment |
1000 |
LONG_RUNNING_SERVICES_SECURITY_CONTEXT_GROUP_ID |
Group ID for built model images + runAsGroup on deployment |
1000 |
LONG_RUNNING_SERVICES_READY_TIMEOUT_SECONDS |
The number of seconds to wait for a LRS resource to become ready. This time can include waiting for a new node to be spun up (if cluster auto-scaling is supported) and for the image to be pulled (if it is not already cached on the node) and finally for any configured health checks to succeed. | 1920 |
LRS_CUSTOM_MODEL_STARTUP_TIMEOUT_SECONDS |
The number of seconds to wait for the LRS health status to return success. | 960 |
LRS_CUSTOM_MODEL_STARTUP_PROBE_DISABLED |
The boolean to enable startup probe. | False |
LRS_CUSTOM_MODEL_STARTUP_PROBE_PORT |
Port for startup probe | 8080 |
LRS_CUSTOM_MODEL_STARTUP_PROBE_INITIAL_DELAY_SECONDS |
Defines how much time the startup probe should wait until sending requests | 1 |
LRS_CUSTOM_MODEL_STARTUP_PROBE_TIMEOUT_SECONDS |
Defines timeout for each startup probe request | 5 |
LRS_CUSTOM_MODEL_STARTUP_PROBE_PERIOD_SECONDS |
Defines the period between the consequential requests | 10 |
LRS_CUSTOM_MODEL_STARTUP_PROBE_SUCCESS_THRESHOLD |
Defines the amount of successful responses to claim the container started | 1 |
LRS_CUSTOM_MODEL_STARTUP_PROBE_FAILURE_THRESHOLD |
Defines the amount of failed responses to claim the container failed | 360 |
LRS_CUSTOM_MODEL_READINESS_PROBE_DISABLED |
The boolean to enable readiness probe | False |
LRS_CUSTOM_MODEL_READINESS_PROBE_PORT |
Port for readiness probe | 8080 |
LRS_CUSTOM_MODEL_READINESS_PROBE_INITIAL_DELAY_SECONDS |
Defines how much time the readiness probe should wait until sending requests | 1 |
LRS_CUSTOM_MODEL_READINESS_PROBE_TIMEOUT_SECONDS |
Defines timeout for each readiness probe request | 5 |
LRS_CUSTOM_MODEL_READINESS_PROBE_PERIOD_SECONDS |
Defines the period between the consequential requests | 5 |
LRS_CUSTOM_MODEL_READINESS_PROBE_SUCCESS_THRESHOLD |
Defines the amount of successful responses to claim the container ready | 1 |
LRS_CUSTOM_MODEL_READINESS_PROBE_FAILURE_THRESHOLD |
Defines the amount of failed responses to claim the container failed | 6 |
LRS_CUSTOM_MODEL_LIVENESS_PROBE_DISABLED |
The boolean to enable liveness probe | False |
LRS_CUSTOM_MODEL_LIVENESS_PROBE_PORT |
Port for liveness probe | 8080 |
LRS_CUSTOM_MODEL_LIVENESS_PROBE_INITIAL_DELAY_SECONDS |
Defines how much time the liveness probe should wait until sending requests | 3 |
LRS_CUSTOM_MODEL_LIVENESS_PROBE_TIMEOUT_SECONDS |
Defines timeout for each liveness probe request | 5 |
LRS_CUSTOM_MODEL_LIVENESS_PROBE_PERIOD_SECONDS |
Defines the period between the consequential requests | 10 |
LRS_CUSTOM_MODEL_LIVENESS_PROBE_SUCCESS_THRESHOLD |
Defines the amount of successful responses to claim the container alive | 1 |
LRS_CUSTOM_MODEL_LIVENESS_PROBE_FAILURE_THRESHOLD |
Defines the amount of failed responses to claim the container failed | 30 |
LRS_CUSTOM_MODEL_DEFAULT_NIM_HEALTH_ENDPOINT |
Defines a default Nvidia health check endpoint that would be used for all probes if a custom endpoint is not provided through the templates. | /v1/health/ready |
LRS_CUSTOM_MODEL_DEFAULT_NIM_PORT |
Defines a default Nvidia health check port that would be used for all probes if a custom port is not provided through the templates. | 8000 |
ENABLE_PDB_FOR_CUSTOM_MODELS_LRS_WITH_MULTIPLE_REPLICAS |
The boolean to enable PDB for newly created model deployments LRSes with more than one replica | False |
ENABLE_HA_FOR_ALL_MODEL_DEPLOYMENTS |
The boolean to enable HA affinity rules for all newly created model deployments LRSes | False |
(1) See Ingress Settings for details on ingress-nginx. The hostname must support HTTPS.
(2) LRS_CUSTOM_MODEL_STARTUP_TIMEOUT_SECONDS variable was removed. Instead, please use LRS_CUSTOM_MODEL_STARTUP_PROBE_PERIOD_SECONDS & LRS_CUSTOM_MODEL_STARTUP_PROBE_FAILURE_THRESHOLD vaiables to define this variable (startup_timeout_seconds = startup_probe_period_seconds x startup_probe_failure_threshold).
Image Build Performance Tuning¶
The image build worker (queue-exec-manager-build-service) starts build jobs by making calls to the image build service. Relevant configurations and chart values for image building:
- By default, each worker pod is set to run a max of four concurrent builds. The concurrency level can be tuned using the environment variable
IMAGE_BUILDER_WORKERSlisted in Configuration Values section. - The replica count for
queue-exec-manager-build-servicecan be changed by setting the valuequeue-exec-manager.component.build-service.replicaCountto the desired amount at install time in the umbrella chart.
The image build service manages resources and timeout for build pods. The resources can be tuned by setting the build-service.imageBuilder values in the umbrella chart. The following are the relevant values for tweaking the performance of image builds:
build-service:
imageBuilder:
podTimeout: "14400000"
resources:
limits:
memory: "4G"
cpu: "3"
requests:
memory: "4G"
cpu: "3"
Prediction Server Performance Tuning¶
For installations that are mostly used for serving custom models as opposed to native DataRobot models, we recommend tuning the amount of API workers to 2-3x of number of cores allocated to the prediction server, since the workers will be mostly I/O-heavy.
# datarobot umbrella chart values.yaml excerpt
prediction-server:
component:
server:
# provided that prediction-server.component.server.computeResources.cpu is 2
predictionApiWorkers: "6"
In case of mixed native/custom model usage and autoscaling enabled for the prediction server, for smoother scaling we recommend lowering the CPU threshold:
# datarobot umbrella chart values.yaml excerpt
prediction-server:
autoscaling:
targetCPUUtilizationPercentage: 35
Service Account for Model Deployments¶
A default service account is used for all custom model deployments / LRS resources. The service account can be changed by setting the value lrs-operator.operator.config.defaultServiceAccountName at install time in the umbrella chart.
Network Access¶
Custom Models support the configuration of network isolation for models. This functionality is based on standard Kubernetes network policies. The cluster must be running a plugin that supports network policies for this to be enforced (e.g. Project Calico).
If you have Calico, Cilium or anything else that enforces network policies, our DR provided policies for custom models requires no further configuration. By default a deny-all policy is created as part of installing custom models:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
annotations:
...
labels:
...
name: default-lrs-resource-denyall-policy
namespace: ..
spec:
podSelector:
matchLabels:
datarobot-type: lrs
policyTypes:
- Egress
- Ingress
GPU Configuration¶
Starting from version 10.0, DataRobot supports inference on GPU in Custom Models as a premium entitlement. After assuring all the steps below are complete, the customer will also need the ENABLE_CUSTOM_MODEL_GPU_INFERENCE feature flag (and it's dependencies) enabled as well to be able to use GPU resource bundles with custom models.
GPU Requirements¶
For simple deep learning models or LLMs that are no larger than 8 billion parameters, DataRobot recommends using the following hardware bundles:
| Bundle Name | GPU Device | GPU Memory | RAM | CPUコア | ストレージ | VM Types |
|---|---|---|---|---|---|---|
| GPU - S | Nvidia T4 | 16GiB | 16GiB | 4 | 200GiB | Azure: Standard_NC4as_T4_v3 AWS: g4dn.xlarge Google Cloud: n1-standard-4 |
| GPU - M | Nvidia T4 or Nvidia L4 | 16GiB | 32GiB | 8 | 200GiB | Azure: Standard_NC8as_T4_v3 AWS: g4dn.2xlarge Google Cloud: g2-standard-8 |
| GPU - L | Nvidia A10 or Nvidia A100 | 24GiB | 32GiB | 8 | 200GiB | Azure: Standard_NC24ads_A100_v4 AWS: g5.2xlarge Google Cloud: a2-highgpu-1g |
If your hardware is substantially different from the recommended bundles above or you are interested in deploying LLMs that are larger than 8 billion parameters, please reach out to your DataRobot CFDS representative to identify an optimal configuration.
Enable the use of GPUs¶
After you have prepared your Kubernetes cluster and your GPU nodes (see
general GPU instructions), enable the usage of GPU nodes for
the Custom Models in the values.yaml file of the DataRobot helm chart. Please
use the configuration example below as a starting point.
core:
config_env_vars:
LRS_GPU_CONTAINER_SIZES: |-
[
{
"id": "gpu.medium",
"name": "GPU - M",
"description": "1 x NVIDIA A10 | 24GB VRAM | 8 CPU | 32GB RAM",
"gpu_maker": "nvidia",
"gpu_type_label": "nvidia-a10g-1x",
"gpu_count": 1,
"gpu_memory_mb": 24576,
"cpu_count": 8,
"memory_mb": 32768,
"use_cases": ["customModel"]
},
{
"id": "h100.one",
"name": "GPU - L",
"description": "1 x NVIDIA H100 | 80GB VRAM | 8 CPU | 32GB RAM",
"gpu_maker": "nvidia",
"gpu_type_label": "nvidia-h100-1x",
"gpu_count": 1,
"gpu_memory_mb": 81920,
"cpu_count": 8,
"memory_mb": 32768,
"use_cases": ["customModel"]
},
{
"id": "h100.two",
"name": "GPU - 2XL",
"description": "2 x NVIDIA H100 | 160GB VRAM | 16 CPU | 64GB RAM",
"gpu_maker": "nvidia",
"gpu_type_label": "nvidia-h100-2x",
"gpu_count": 2,
"gpu_memory_mb": 163840,
"cpu_count": 16,
"memory_mb": 65536,
"use_cases": ["customModel"]
}
]
lrs-operator:
operator:
config:
nodeGroups: |-
{
"labelToNodeGroupMap": {
"datarobot-gpu-type=nvidia-h100-1x": "gpu-h100-1x",
"datarobot-gpu-type=nvidia-h100-2x": "gpu-h100-2x",
"datarobot-gpu-type=nvidia-a10g-1x": "gpu-a10-1x"
},
"nodeGroupsByName": {
"gpu-h100-1x": {
"taint": "nvidia.com/gpu=true:NoExecute",
"affinityLabel": "intent=1xH100GpuWorker"
},
"gpu-h100-2x": {
"taint": "nvidia.com/gpu=true:NoExecute",
"affinityLabel": "intent=2xH100GpuWorker"
},
"gpu-a10-1x": {
"taint": "nvidia.com/gpu=true:NoExecute",
"affinityLabel": "intent=1xA10GpuWorker"
}
}
}
備考:
- Field id must be unique across all the items of the list.
- Fields name and description will be displayed in UI.
- Field gpu_maker must be "nvidia".
- Field gpu_type_label must have a matching item under labelToNodeGroupMap.
- Keys under nodeGroupsByName should match values in labelToNodeGroupMap.
- Value of sub-field taint within nodeGroupsByName will be applied as a
toleration to the corresponding Kubernetes pods. If more than one toleration
need to be applied, such tolerations need to be comma-separated.
- Value of sub-field affinityLabel within nodeGroupsByName will be applied
as requiredDuringSchedulingIgnoredDuringExecution node affinity type. Only one
affinity item is supported.
- Field gpu_memory_mb is used to help recommend the appropriate resource bundles for
a given LLM based on heuristics.
- Fields LRS_GPU_CONTAINER_SIZES and nodeGroups are JSON strings so take care to
make sure it is valid (i.e. no trailing commas). Incorrect syntax results in an error
being logged but the application will still start.
LLM Startup Tuning¶
In 10.2 we have moved the custom model startup to a dedicated queue so slow-to-start models do not block other operations in the application. As such, we believe to have tuned the LONG_RUNNING_SERVICES_READY_TIMEOUT_SECONDS and LRS_CUSTOM_MODEL_STARTUP_TIMEOUT_SECONDS configuration values appropriately. However, if you are finding custom models being killed while they are still in the process of starting up, consider increasing these values. See the table above for descriptions on each configuration. The values can be changed on a live system using System Configuration or via values.yaml.
Note: the LRS probe configuration has changed in 11.2.0 release. Please see the Configuration Values section.
Serving LLMs over multiple GPUs¶
In 10.2 and beyond, it is possible to serve a single model spread across multiple GPUs as long as the GPUs are present in the same node. In addition, it is recommended to serve an LLM over a homogenous set of GPUs. To run an LLM over multiple GPUs simply select a resource bundle with more than one GPU allocated, i.e. "gpu_count": 2 and they will automatically be reserved by Kubernetes and the inference server will make use of them.
Be aware that a single, large GPU will have better throughput performance than multiple small GPUs but you may see better latency characteristics using multiple, smaller GPUs.
Model-level training data assignment removal in 10.1¶
Starting in version 10.1, the model-level training data assignment API (api/v2/customModels/<custom model ID>/trainingData/) is removed after 3 deprecation cycles. Requests to the API return a 404. The new model version-level training data assignment must be used instead.
If your organization has critical flows using this API, it can be re-enabled for a 1-month grace period to migrate to the new logic. To enable the deprecated API, set the following feature flag configuration: DISABLE_CUSTOM_MODEL_DISABLE_MODEL_LEVEL_TRAINING_DATA_ASSIGNMENT = True.
High Availability scenarios for Custom Models¶
To enable high availability within the cluster, two configuration flags are available. These settings leverage Kubernetes Pod Disruption Budgets (PDBs) and pod anti-affinity rules.
Pod Disruption Budget Configuration¶
ENABLE_PDB_FOR_CUSTOM_MODELS_LRS_WITH_MULTIPLE_REPLICAS controls the application of PDB policies for custom model deployments. When enabled, any custom model with two or more replicas is assigned a PDB with maxUnavailable = 1. This ensures that Kubernetes will allow only a single replica to be taken down at any given time. By default, this setting is disabled.
Pod Anti-Affinity Configuration¶
Pod anti-affinity behavior is governed by the ENABLE_HA_FOR_ALL_MODEL_DEPLOYMENTS flag. When enabled, each replica of a custom model is scheduled onto a distinct node. The configuration uses requiredDuringSchedulingIgnoredDuringExecution, meaning Kubernetes can schedule only as many replicas as there are nodes available. For example, if a deployment requests four replicas but the cluster has only three nodes, only three replicas will be scheduled. By default, this setting is disabled.