Prepare your Kubernetes cluster to run GPU workloads¶

DataRobot offers to utilize GPUs in different services of the applications e.g. AutoML, Notebooks, OCR and the Generative AI service.

All these services have in common, that they utilize Cuda for GPU acceleration. There are a couple of steps you have to take in order to prepare your GPU nodes and the cluster to run Cuda workloads. While the following requirements are necessary for all the services, also check the service-specific instructions. as these list additional requirements and instructions: Notebooks, Generative AI, OCR.

The following two section outline the general setup and constraints. For detailed instructions specific to your vendor take a look at their instructions: Amazon EKS, Google GKE, Microsoft AKS, Red Hat OpenShift. You could also install the Nvidia GPU Operator, which takes care of installing the requirements on the nodes as well as in the cluster.

GPU Node requirements¶

Each GPU node in your cluster has to fulfill the following requirements. All the cloud vendors either offer operating system images with these two requirements pre-installed or a way to automatically provision GPU nodes. - The nodes need to be able to run GPU accelerated containers. For this reason you need to install the Nvidia Container Toolkit on the GPU nodes and configure it as the default runtime - The GPU nodes need to have Nvidia drivers installed. In the following table you can see the minimum required versions. Install the latest available version when possible:

サービス	Used Cuda version	Required Nvidia driver version
Notebooks	12.x	>=525.60.13
AutoML training	11.8	>=450.80.02
生成AI	12.8	>=525.60.13
カスタムモデル	12.x	>=525.60.13
OCR	12.x	>=570.195.03

Kubernetes cluster requirements¶

Besides preparing the GPU nodes, you also have to prepare the Kubernetes cluster. - Install the NVIDIA device plugin for Kubernetes, so Kubernetes gets aware of the GPUs of the GPU nodes and assigns GPU capacities correctly. You can either install the device plugin standalone or install it as part of the NVIDIA GPU Operator

- Assign your nodes a taint of nvidia.com/gpu=true:NoExecute, such that Kubernetes won't schedule non GPU workload pods on them.

You can check if your setup was successful by describing on of the GPU nodes. You should see that the node has a capacity of one or multiple nvidia.com/gpu resources assigned to it.

kubectl describe node ip-10-152-221-225.ec2.internal --show-events=false
Name:               ip-10-152-221-225.ec2.internal
...
Taints:             nvidia.com/gpu=true:NoExecute
...
Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         8
  ephemeral-storage:           209702892Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      32386544Ki
  nvidia.com/gpu:              1
  pods:                        29
...