Skip to content

Long Running Services (LRS) backup considerations

DataRobot provides a Kubernetes operator known as the Long Running Services (LRS) Operator to fulfill dynamic compute requests for DataRobot Feature Services. These workloads are intended to run for long periods, or indefinitely, after deployment. The LRS Operator itself does not require any backup or restore procedures beyond the standard installation instructions. However, applications launched via the LRS Operator will have varying backup and restoration requirements depending on the workload type.

DataRobot identifies the LRS workload type using a label on the Kubernetes resource: datarobot-lrs-type=<workload-type>.

Interactive Spark sessions

Interactive Spark sessions support much of DataRobot's wrangling capabilities. While these sessions are stateful, they are also typically short-lived. The feature is specifically designed to handle session loss and to allow automatic restoration by the application. As a result, it is unnecessary to back up any active LRS Spark sessions when the DataRobot application is brought down for a backup.

To avoid the application reconnecting to stale sessions after a restore and restart, remove any active LRS Spark sessions during the backup process using the following command:

kubectl delete lrs -A -l datarobot-lrs-type=spark_app 

カスタムアプリケーション

Custom Applications can be created from Application Sources, which are stored in the Mongo database. The Mongo database backup guide should cover the backup of Application Sources.

Custom Applications can also be created from Docker Images. To back up the Application Image, build the Application Image with docker build [PATH] | [URL] --tag [IMAGE NAME] and upload the image to the DataRobot Image Registry, which has the corresponding backup policy. The Image Registry is either DataRobot managed Dockerhub repositories or private Elastic Container Registry.

Custom Applications can also be created from Application Templates, which are stored in GitHub repositories. You can clone the open source Foundational Application repositories to create a local copy.

Custom Applications use DataRobot key-value store API and file storage to provide persistent storage. The Mongo database and file storage backup guide should cover the backup of Application data, such as chat history, user settings, preferences, dataset, and metadata.

Custom (Training) Task

Custom Tasks are part of the Custom Models backup. They require no extra actions on their own. Backup only affects deployments with a Custom Task. All other actions (training, scoring, insights, etc.) use ephemeral LRSs that are created and destroyed as needed.

カスタムモデル

Backup procedure for Custom Model LRSes might be useful during app upgrades or cluster migrations. There are two possible ways for backing up Custom Models: dumping LRS YAML files directly from the k8s cluster, or deactivating/reactivating custom models from the UI.

Dumping YAML LRS definitions

Saving YAML files could be particularly useful when some operations are performed on the k8s cluster (like upgrading k8s version or migrating to a different CNI plugin, etc.). The benefit of this approach is that the access to the DataRobot App is not needed, k8s access is sufficient. The downside is that it only stores the low-level definition of Custom Model workload from the cluster - changes from the upstream sources won't be taken into the account.

The following shell script can be used to get a dump of all Custom Model LRSes:

#!/bin/bash

OUTPUT_DIR="lrs_backup_$(date "+%Y_%m_%d_%H_%M")"
mkdir -p "$OUTPUT_DIR"

for name in $(kubectl get lrs -A -l datarobot-lrs-type=deployment -o jsonpath='{.items[*].metadata.name}'); do
  kubectl get lrs "$name" -n "$DR_CORE_NAMESPACE" -o yaml > "$OUTPUT_DIR/${name}.yaml"
done

echo "Backup completed. Files saved in $OUTPUT_DIR/" 

This script creates a dump of all existing LRSes on the k8s cluster (both running and failed) which you can use later to restore if something goes wrong. It populates a separate YAML file per each LRS so you can easily restore any particular LRS. Files are placed into a newly created folder (lrs_backup_yyyy_mm_dd_HH_MM).

備考

To use this script you need to have configured kubectl in your shell session. Also, you need to specify DR_CORE_NAMESPACE which is the namespace where DataRobot is installed.

Deactivating/reactivating Custom Models from the UI

Another way is deactivating/reactivating Custom Models. It's a more convenient way because it uses the DataRobot API to stop Custom Models which includes all business rules and code needed for that. This is safer than just dumping LRS YAML definitions. Although, this creates downtime for your deployments but we assume it's expected since cluster migration/app upgrade is happening already.

Model Deployments can be deactivated manually through the UI. This is a more controlled way. If there are a lot of deployments and deactivating them all feels cumbersome, the following script can be used:

kubectl -n $DR_CORE_NAMESPACE exec deploy/mmapp-app -it -- /entrypoint python3 tools/custom_model/deployments_tool.py --api-url <DATAROBOT_URL> --api-token <DATAROBOT_API_TOKEN> --action deactivate 

各パラメーターについて説明します。

  • DR_CORE_NAMESPACE - the namespace where DataRobot is installed
  • DATAROBOT_URL - the DataRobot application endpoint (e.g., https://app.datarobot.com)
  • DATAROBOT_API_TOKEN - an API token for the user with the MLOps admin role and access to all target deployments. The token can be retrieved in the API keys and tools section in Settings on the UI.

This script will:

  • Take all deployments visible to the owner of the API TOKEN (please ensure that the user has MLOps admin role & access to all target deployments)
  • Send a deactivation request per each deployment
  • By default it won't wait for deactivation process to finish but it can be visible on the UI (Deployments section in the Console tab)

Downsides

  • To use it, you should have a user with MLOps admin access to all deployments in the environment
  • Custom metrics and monitoring jobs that are scheduled on the deployments will fail during downtime. Although, they will catch up after deployment reactivation
  • Service stats will be cleared after a deployment deactivation (they will be recalculated during reactivation)

備考

The tool is available in 11.1.1. If you use earlier versions of DataRobot and want to use the script, contact support to get the script.