Long Running Services Backup¶

DataRobot provides a Kubernetes operator known as the Long Running Services (LRS) Operator to fulfill dynamic compute requests for DataRobot Feature Services. These workloads are intended to run for long periods of time, or indefinitely after deployment. The operator itself does not require any backup or restore procedures beyond the standard installation instructions. However, applications launched via the LRS Operator will have varying backup and restoration requirements depending on the workload type.

DataRobot identifies the LRS workload type using a label on the resource: datarobot-lrs-type=<workload-type>.

LRS Backup Procedure by Workload Type¶

Interactive Spark Sessions¶

Interactive Spark sessions support much of the functionality of the DataRobot Wrangler feature. While these sessions are stateful, they are also short-lived. The feature is specifically designed to handle session loss and to allow automatic restoration by the application. As a result, it is unnecessary to back up any active session once the DataRobot application has been brought down for backup.

To avoid the application reconnecting to stale sessions after restart, administrators are encouraged to remove any active LRS Spark sessions during the backup process using the following command:

kubectl delete lrs -A -l datarobot-lrs-type=spark_app

カスタムアプリケーション¶

Custom Applications use a three-tier storage architecture that requires comprehensive backup coverage for complete disaster recovery.

Storage Architecture¶

MongoDB Database Collections

The following MongoDB collections are critical for Custom Application backup: - custom_applications - Application metadata, configuration, and status - longrunningservices - Critical: Kubernetes deployment configurations for running applications - custom_application_images - Application Source metadata with references to source files - custom_application_image_versions - Application Source version information - execute_docker_images - Docker image metadata and registry references - workspace_items - References to source files stored in file storage

File Storage (S3/Gluster/HDFS)

File storage contains the actual application assets: - Source code files (.py, .js, .html, .css, etc.) uploaded to Application Sources - Runtime application data (chat history, user settings, preferences, datasets) - Application state and persistent data saved via key-value store API

Docker Registry (ECR/Dockerhub)

Docker Registry stores: - Built Docker images for running applications (stored in custom-apps/managed-image repository) - Pre-built Docker images for applications created from existing images

Backup Strategy¶

Minimum Viable Recovery (Business Continuity) - MongoDB custom_applications and longrunningservices collections - Docker Registry images - Enables restarting applications with full functionality - Limitation: Loses ability to edit source code and all runtime user data

Complete Recovery (Full Data Protection) - All MongoDB collections listed above - File Storage (S3/Gluster/HDFS) backup - Docker Registry backup - Preserves source code editing capabilities and all user data

Critical Note on LRS Objects¶

Important: The longrunningservices collection contains the actual Kubernetes deployment configurations. Without this collection, Custom Applications will appear in the DataRobot UI with their status (since that comes from custom_applications collection), but they will not be functional. Attempting to access applications without their LRS objects will result in "The application is temporarily unavailable" errors.

The LRS object ID intentionally matches the Custom Application ID for 1:1 mapping. This relationship is created when an application is started (see public_api/custom_applications/helpers.py line 1153).

Creation Methods¶

Application Sources: Backup requires MongoDB (metadata + file references) + File Storage (actual source files) + Docker Registry (built images)

Pre-built Docker Images: Backup requires MongoDB (configuration and image references) + Docker Registry (original and managed images)

Application Templates: - Global Templates: Bundled with DataRobot installation (YAML files) - backed up with application deployment - Custom Templates: Stored in MongoDB (metadata) + File Storage (template files)

Custom (Training) Task¶

Custom Tasks are part of the Custom Models backup. They require no extra actions on their own. Backup only affects deployments with a Custom Task. All other actions (training, scoring, insights, ...) use ephemeral LRSs that are created and destroyed as needed.

カスタムモデル¶

Backup procedure for Custom Model LRSes might be useful during app upgrades or cluster migrations. There are two possible ways for backing up Custom Models. The first one is dumping LRS yaml files directly from the k8s cluster and the second - deactivating/reactivating custom models from the UI.

Dumping YAML LRS definitions¶

Saving YAML files could be particularly useful when some operations are performed on the k8s cluster (like upgrading k8s version or migrating to a different CNI plugin etc). The benefit of this approach is that the access to the DataRobot App is not needed, k8s access is sufficient. The downside is that it only stores the low-level definition of Custom Model workload from the cluster - changes from the upstream sources won't be taken into the account.

The following shell script could be used the get a dump of all Custom Model LRSes:

#!/bin/bash

OUTPUT_DIR="lrs_backup_$(date "+%Y_%m_%d_%H_%M")"
mkdir -p "$OUTPUT_DIR"

for name in $(kubectl get lrs -A -l datarobot-lrs-type=deployment -o jsonpath='{.items[*].metadata.name}'); do
  kubectl get lrs "$name" -n "$DR_CORE_NAMESPACE" -o yaml > "$OUTPUT_DIR/${name}.yaml"
done

echo "Backup completed. Files saved in $OUTPUT_DIR/"

It will create a dump of all existing LRSes on the k8s cluster (both running and failed) which you can use later to restore if something goes wrong. It will populate a separate yaml file per each LRS so you can easily restore any particular LRS. It will put files into newly created folder (lrs_backup_yyyy_mm_dd_HH_MM). Please note - to use it you need to have configured kubectl in your shell session. Also, you need to specify DR_CORE_NAMESPACE which is a namespace where DataRobot is installed.

Deactivating/Reactivating Custom Models from the UI¶

Another way is deactivating/reactivating Custom Models. It's a more convenient way because it would use DataRobot API to stop Custom Models which will include all business rules and code needed for that. This is safer than just dumping LRS yaml definitions. Although, this would create a downtime for your deployments but we assume it's acceptable since cluster migration/app upgrade is happening already.

Model Deployments could be deactivated manually through the UI. This is a more controlled way. If there are a lot of deployments and deactivating them all feels a cumbersome work the following script could be used:

kubectl -n $DR_CORE_NAMESPACE exec deploy/mmapp-app -it -- /entrypoint python3 tools/custom_model/deployments_tool.py --api-url <DATAROBOT_URL> --api-token <DATAROBOT_API_TOKEN> --action deactivate --wait-to-finilize

where: - DR_CORE_NAMESPACE - a namespace where DataRobot is installed; - DATAROBOT_URL - an DataRobot application endpoint (e.g. https://app.datarobot.com); - DATAROBOT_API_TOKEN - an API token for the user with the MLOps admin role and access to all target deployments in the organisation that's set up in your DatarRobot app. The token could be retrieved in the API keys and tools section in Settings on the UI. Please note, if you have multiple organisations in your DataRobot app then you would have to run the script for each organisation separately.

This script will: - take all deployments visible to the owner of the API TOKEN (again, please ensure that the user has MLOps admin role & access to all target deployments); - send a deactivation request per each deployment; - by default it will wait for each deactivation process to finish sequentially. If a deployment failed to activate then you would see corresponding message. If you remove --wait-to-finilize flag the script won't wait for finilization (in such case you can track the progress in Deployments section in the Console tab).

This method has couple of downsides worth of mentioning: - to use it, you should have a user with MLOps admin access to all deployments in the environment; - custom metrics and monitoring jobs that are scheduled on the deployments would fail during downtime. Although, they will catch up after deployment reactivation; - service stats would be cleared after a deployment deactivation (it would be recalculated during reactivation).

Please note - the tool might be missing in older versions of the app. If you tried to run the command above and it failed because script is missing then contact our support team and they can provide the script manually. It is required to run this script with Python 3.10 and requests & urllib3 packages installed. You can execute the command like this, assuming you have already installed Python on your environment:

pip3 install requests urllib3
python3 <PATH_TO_THE_SCRIPT>/deployments_tool.py --api-url <DATAROBOT_URL> --api-token <DATAROBOT_API_TOKEN> --action deactivate --wait-to-finilize