Back up long running services¶

DataRobot provides a Kubernetes operator known as the Long Running Services (LRS) Operator to fulfill dynamic compute requests for DataRobot Feature Services. These workloads are intended to run for long periods of time, or indefinitely after deployment. The operator itself doesn't require any backup or restore procedures beyond the standard installation instructions. However, applications launched via the LRS Operator have varying backup and restoration requirements depending on the workload type.

DataRobot identifies the LRS workload type using a label on the resource: datarobot-lrs-type=<workload-type>.

LRS Backup procedure by workload type¶

備考

You must fulfill the prerequisites before proceeding.

Export the DataRobot application Kubernetes namespace:

export LRS_DATAROBOT_URL="DATAROBOT_URL"
export LRS_DATAROBOT_API_TOKEN="DATAROBOT_API_TOKEN"

備考

Replace DATAROBOT_URL with the DataRobot application endpoint (e.g., https://app.datarobot.com)
Replace DATAROBOT_API_TOKEN with an API token for the user with the MLOps admin role and access to all target deployments in the organization that's set up in your DataRobot app. The token can be retrieved in the API keys and tools section in Settings on the UI.

Interactive Spark sessions¶

Interactive Spark sessions support much of DataRobot's wrangling capabilities. While these sessions are stateful, they're also short-lived. The feature is specifically designed to handle session loss and to allow automatic restoration by the application. As a result, it's unnecessary to back up any active session once the DataRobot application has been brought down for backup.

To avoid the application reconnecting to stale sessions after restart, administrators are encouraged to remove any active LRS Spark sessions during the backup process using the following command:

kubectl delete lrs -A -l datarobot-lrs-type=spark_app

カスタムアプリケーション¶

Custom Applications use a three-tier storage architecture that requires comprehensive backup coverage for complete disaster recovery.

Storage architecture¶

MongoDB Database Collections

The following MongoDB collections are critical for Custom Application backup:

custom_applications: Application metadata, configuration, and status
longrunningservices: Critical: Kubernetes deployment configurations for running applications
custom_application_images: Application Source metadata with references to source files
custom_application_image_versions: Application Source version information
execute_docker_images: Docker image metadata and registry references
workspace_items: References to source files stored in file storage

File Storage (S3/Gluster/HDFS)

File storage contains the actual application assets:

Source code files (.py, .js, .html, .css, etc.) uploaded to Application Sources
Runtime application data (chat history, user settings, preferences, datasets)
Application state and persistent data saved via key-value store API

Docker Registry (ECR/Docker Hub)

Docker Registry stores:

Built Docker images for running applications (stored in custom-apps/managed-image repository)
Pre-built Docker images for applications created from existing images

Backup strategy¶

Minimum Viable Recovery (Business Continuity)

MongoDB custom_applications and longrunningservices collections
Docker Registry images
Enables restarting applications with full functionality
Limitation: Loses ability to edit source code and all runtime user data

Complete Recovery (Full Data Protection)

All MongoDB collections listed above
File Storage (S3/Gluster/HDFS) backup
Docker Registry backup
Preserves source code editing capabilities and all user data

Critical note on LRS objects¶

Important: The longrunningservices collection contains the actual Kubernetes deployment configurations. Without this collection, Custom Applications appear in the DataRobot UI with their status (since that comes from the custom_applications collection), but they're not functional. Attempting to access applications without their LRS objects results in "The application is temporarily unavailable" errors.

The LRS object ID intentionally matches the Custom Application ID for 1:1 mapping. This relationship is created when an application is started (see public_api/custom_applications/helpers.py line 1153).

Creation methods¶

Application Sources: Backup requires MongoDB (metadata + file references) + File Storage (actual source files) + Docker Registry (built images)

Pre-built Docker Images: Backup requires MongoDB (configuration and image references) + Docker Registry (original and managed images)

Application Templates:

Global Templates: Bundled with DataRobot installation (YAML files) - backed up with application deployment
Custom Templates: Stored in MongoDB (metadata) + File Storage (template files)

Custom (training) task¶

Custom Tasks are part of the Custom Models backup. They require no extra actions on their own. Backup only affects deployments with a Custom Task. All other actions (training, scoring, insights, ...) use ephemeral LRSs that are created and destroyed as needed.

カスタムモデル¶

Backup procedure for Custom Model LRSes might be useful during app upgrades or cluster migrations. There are two possible ways for backing up Custom Models. The first one is dumping LRS yaml files directly from the k8s cluster and the second - deactivating/reactivating custom models from the UI.

Dumping YAML LRS definitions¶

Saving YAML files could be particularly useful when some operations are performed on the k8s cluster (like upgrading k8s version or migrating to a different CNI plugin etc). The benefit of this approach is that the access to the DataRobot App isn't needed, k8s access is sufficient. The downside is that it only stores the low-level definition of Custom Model workload from the cluster - changes from the upstream sources won't be taken into the account.

The following shell script could be used the get a dump of all Custom Model LRSes:

#!/bin/bash

OUTPUT_DIR="lrs_backup_$(date "+%Y_%m_%d_%H_%M")"
mkdir -p "$OUTPUT_DIR"

for name in $(kubectl get lrs -A -l datarobot-lrs-type=deployment -o jsonpath='{.items[*].metadata.name}'); do
  kubectl get lrs "$name" -n "$NAMESPACE" -o yaml > "$OUTPUT_DIR/${name}.yaml"
done

echo "Backup completed. Files saved in $OUTPUT_DIR/"

It creates a dump of all existing LRSes on the k8s cluster (both running and failed) which you can use later to restore if something goes wrong. It populates a separate yaml file per each LRS so you can easily restore any particular LRS. It puts files into a newly created folder (lrs_backup_yyyy_mm_dd_HH_MM).

備考

To use this script you need to have configured kubectl in your shell session. Also, you need to specify NAMESPACE which is the namespace where DataRobot is installed.

Deactivating/reactivating custom models from the UI¶

Another approach is deactivating or reactivating Custom Models. This approach uses the DataRobot API to stop Custom Models and includes all business rules and code needed. This is safer than just dumping LRS yaml definitions. This creates downtime for your deployments; the procedure assumes that's acceptable during a cluster migration or app upgrade.

Model Deployments could be deactivated manually through the UI. This is a more controlled way. If there are a lot of deployments and deactivating them all feels cumbersome, the following script could be used:

kubectl -n $NAMESPACE exec deploy/mmapp-app -it -- /entrypoint python3 tools/custom_model/deployments_tool.py --api-url $LRS_DATAROBOT_URL --api-token $LRS_DATAROBOT_API_TOKEN --action deactivate --wait-to-finilize

備考

If you have multiple organizations in your DataRobot app, run the script for each organization separately.

This script:

Takes all deployments visible to the owner of the API TOKEN (ensure that the user has MLOps admin role and access to all target deployments)
Sends a deactivation request per each deployment
By default waits for each deactivation process to finish sequentially. If a deployment fails to activate, you see a corresponding message. If you remove the --wait-to-finilize flag, the script doesn't wait for finalization (in this case, track the progress in the Deployments section in the Console tab).

Downsides

To use it, you must have a user with MLOps admin access to all deployments in the environment.
Custom metrics and monitoring jobs that are scheduled on the deployments fail during downtime. They catch up after deployment reactivation.
Service stats are cleared after a deployment deactivation (they're recalculated during reactivation).

備考

The tool might be missing in older versions of the app. If you tried to run the command above and it failed because the script is missing, contact the DataRobot support team; they can provide the script manually. Run this script with Python 3.10 and requests and urllib3 packages installed. Execute the command like this, assuming you have already installed Python on your environment:

pip3 install requests urllib3
python3 deployments_tool.py --api-url $LRS_DATAROBOT_URL --api-token $LRS_DATAROBOT_API_TOKEN --action deactivate --wait-to-finilize

Tracking and management¶

Tracking and management agents are stateless, so all data generated and required by them is already captured in existing datastore backup and restore procedures (PostgreSQL/MongoDB/etc.). Tracking Agents CRDs can be backed up similar to the way other LRSs are backed up, but with datarobot-lrs-type=management_agent and tracking_agent.