Skip to content

Migrate Custom Templates Job

The code for the custom templates is maintained in a DataRobot proprietary repository. The main artifact from that repository is a container that is responsible for two things: 1. Creating/updating execution environments used by custom templates 2. Creating/updating the custom templates for custom-metrics, custom-applications, and NVIDIA NIM Containers

The custom template code is kept in a separate repository to allow for maintenance independent of the main DataRobot repository. This allows updating on-prem instantiations with the custom templates updates without updating the entire DataRobot application. The separate repository also has the benefits of allowing more focused testing of the template code.

As of DataRobot 11.1.1, there are four execution environments. The list is subject to change over time.

Included execution environments

  • [DataRobot] NodeJS 22.14.0 Applications Base
  • [DataRobot] Python 3.11 Custom Metrics
  • [DataRobot] Python 3.12 Applications Base
  • DRUM NIM Sidecar

There are more than a hundred custom templates with a variety of template types. These templates are created/updated as required when the migration is done.

Execution

The custom templates installation/migration means running the datarobot-custom-templates container which contains the information required to create/update the execution environments and custom templates. The use of an admin API key is required, because normal users cannot manage globally provided templates. Here's a sample of the required permissions for successfully installing/migrating the custom templates: * Enable Admin Api Access * Enable Custom Templates (Enabled by default) * Enable Custom Jobs Template Gallery (Enabled by default) * Can Manage Public Custom Environments * Enable Resource Bundles

The following table highlights some of the options available in the container that help control the installation behavior. The container's --help option will provide the most accurate information, and takes precedence over the table with regard to functionality and accuracy.

CLI Option Environment Variable Default Notes
--admin-api-token DR_API_KEY Must be provided This is the admin API key
--dr-host DR_HOST Must be provided URL to the base host (e.g. https://staging.datarobot.com/) with or without /api/v2/
--[no-]update-exec-envs UPDATE_EXEC_ENVS False, no updates When True, it creates or updates the execution environments. When False, it only creates missing execution environments.
--template-types TEMPLATE_TYPES all Allows update of limited template types for unusual circumstances. Allowed values are: all, applications, jobs, metrics, or models.
--max-version-wait VERSION_TIMEOUT 1200 seconds Updating an execution-environment version involves building a new container which may exceed the "standard" 20 minute period.
--force FORCE_TEMPLATE_UPDATE False Force update of existing templates even when matches server. Useful if issues with detecting whether update is needed.
--use-prebuilt-images USE_PREBUILT_IMAGES False Use prebuilt execution-environment images instead of building in DR application
--sleep-seconds 0.25 seconds Sleep time between template updates to avoid failures due to rate-limiting.
--log-level INFO Updating log levels can provide more information about execution.
--dry-run False Checks whether templates need updating without updating them.

The above options/flags can be directly provided as docker/podman arguments for running the container. For options/flags with an environment variable, the value can be set using an environment variable or a direct CLI argument.

The following command snippets are functionally equivalent:

docker run -i  datarobot/datarobot-custom-templates:<tag> --max-version-wait 300
docker run -e VERSION_TIMEOUT=300 datarobot/datarobot-custom-templates:<tag>

Image

When a datarobot-custom-templates container is built, it contains all the execution-environment and template information such that running the container always produces the same result. Building the execution environments will cause require external network access, but all the template information/code is frozen at the time of building the datarobot-custom-templates container.

The image is included in the datarobot-app-charts repository's core-integration-tasks package. In your helm artifact this should be located at charts/core-integration-tasks/Chart.yaml

No external network case

This job typically requires an external network connection because the execution environment images are built by the DR application while the job is executing. You can use the --use-prebuilt-images option (or environment variable) to use a set of pre-built images which are included in the datarobot-custom-templates image.

Please contact DataRobot support for further assistance related to this topic.

Cronjob

This task is run as a cronjob because it can take a long time to build all the execution environments.

The migrate_custom_templates cronjob is defined in the chart values, it is expected to run at the top of the hour every hour 0 * * * *. It will attempt to install all missing execution environments and custom templates, and update any custom templates that are needed.

[!WARNING] Local changes will be overwritten by the cronjob

The cronjob can be described using kubectl like this:

kubectl describe cronjob core-integration-tasks-custom-templates -n $NAMESPACE

The pod (while running) can be seen here:

kubectl get pods -n $NAMESPACE -l role=core-integration-tasks-custom-templates

Upgrading execution environments

When upgrading DataRobot versions, execution environments provisioned by the Custom Templates are NOT automatically upgraded to the latest versions. It is necessary to trigger the upgrade of execution environments in order to receive latest updates to the affected features. This operation will upgrade all the execution environments mentioned above.

docker run -i  datarobot/datarobot-custom-templates:<tag> \
  --admin-api-token <ADMIN_API_TOKEN> \
  --dr-host <DR_HOST> \
  --use-prebuilt-images \
  --update-exec-envs
This will upload new versions of the execution environments that exist inside the datarobot-custom-templates image. Using the prebuilt images avoids involving Image Build Service.

This can also be accomplished using kubectl with a chart that looks like (updated with pointers for image, DR_HOST, and DR_API_KEY):

apiVersion: v1
kind: Pod
metadata:
  name: custom-template-env-upgrade
  namespace: datarobot
spec:
  containers:
    - name: custom-template-env-upgrade
      image: datarobot/datarobot-custom-templates:11.1.438-image
      env:
      - name: USE_PREBUILT_IMAGES
        value: 'true'
      - name: UPDATE_EXEC_ENVS
        value: 'true'
      - name: DR_HOST
        value: http://datarobot-public-api:8004
      - name: DR_API_KEY
        valueFrom:
          secretKeyRef:
            name: ui-admin-credentials
            key: api_key

The biggest difference from example chart above to the standard helm-chart is that UPDATE_EXEC_ENVS is set to true. This setting tells the pod to upload new containers for all the execution environments.

The USE_PREBUILT_IMAGES setting of true means that the execution environment containers included in the datarobot-custom-templates will be uploaded.

Deprecated Execution Environments

In release/11.4, some of the execution environments may get renamed to include [Deprecated] in the name, and new execution environments to be created. The "deprecated" execution environments may still be in use, so they were not deleted. However, no new versions (e.g. CVE fixes) will be applied to those environments. Once the "deprecated" environments are no longer used, they can be removed.

Troubleshooting

Failed to build execution environment

In the event of Image Build Service errors or misconfiguration, the symptom on the application side would be a failed or constantly-stuck (submitted status that never resolves) status of execution environments installed by the Custom Templates job, e.g. [DataRobot] Python 3.11 Custom Metrics.

In this case it is possible to retrigger the job, forcing the latest updates as new versions of these environments using the --update-exec-envs flag.

The easiest way to do this is to manually run the job via:

docker run -i  datarobot/datarobot-custom-templates:<tag> \
  --admin-api-token <ADMIN_API_TOKEN> \
  --dr-host <DR_HOST> \
  --use-prebuilt-images \
  --update-exec-envs

The above command will update the execution environments, and then point all the templates to the latest successful version of the respective execution environment.

Custom-metric jobs fail for odd reasons (e.g. function not found)

Custom-metric jobs fail for odd reasons (e.g. function not found).

Traceback (most recent call last):
  File "/opt/code/main.py", line 24, in <module>
    from dmm import log_parameters
ImportError: cannot import name 'log_parameters' from 'dmm' (/usr/local/lib/python3.11/dist-packages/dmm/__init__.py)

This can happen when the templates are updated to use functionality in a newer client library that is NOT in the [DataRobot] Python 3.11 Custom Metrics container. In this case, the execution environment needs to be updated. See the section on a failed build of an execution-environment.

File is not utf-8 encoded

I see an error saying: datarobot.errors.ClientError: 422 client error: {'message': 'File is not utf-8 encoded.'}

The installation/migration compares all the files in a template to see if they're changed before making a decision about whether the template needs to be updated. In cases where binary files are used, the comparison currently ignores these files for the sake of determining if the template needs to be updated.

If the binary file(s) are the only changes to the template and it is important to use the latest binary file(s), then you should use the --force command to cause all the templates to be updated.

docker run -i  datarobot/datarobot-custom-templates:<tag> \
  --admin-api-token <ADMIN_API_TOKEN> \
  --dr-host <DR_HOST> \
  --no-update-exec-envs \
  --use-prebuilt-images \
  --force

Unrecognized flag during migration

usage: migrate_templates.py [-h] [--admin-api-token ADMIN_API_TOKEN] [--dr-host DR_HOST] [--update-exec-envs | --no-update-exec-envs] [--template-types {all,applications,jobs,metrics,models}] [--delete-outdated-templates | --no-delete-outdated-templates] [--use-prebuilt-images | --no-use-prebuilt-images][--dry-run] [--force] [--sleep-seconds SLEEP_TIME] [--max-version-wait MAX_VERSION_WAIT] [--log-level LEVEL]

migrate_templates.py: error: unrecognized arguments: --use-generic-custom-templates-api

The --use-generic-custom-templates-api and --no-use-generic-custom-templates-api flags were used to control the API endpoint that was used during migration. These flags were added when the API was moving from only custom-metrics (using /api/v2/customMetricsTemplates/) to custom-metrics, custom-jobs, custom-applications, and custom-models (using the generic path /api/v2/customTemplates/). Those flags were deprecated in 11.1, and the container is always run with the equivalent of --use-generic-custom-templates-api.

References to the old flags should be removed.