Skip to content

Restore long running services

DataRobot provides a Kubernetes operator known as the Long Running Services (LRS) operator to fulfill dynamic compute requests for DataRobot feature services. These workloads are intended to run for long periods of time, or indefinitely after deployment. The operator itself doesn't require any backup or restore procedures beyond the standard installation instructions. However, applications launched via the LRS operator have varying backup and restoration requirements depending on the workload type.

DataRobot identifies the LRS workload type using a label on the resource: datarobot-lrs-type=<workload-type>.

Interactive Spark sessions

No restoration process is necessary for Interactive Spark sessions. Once the core DataRobot application is back online, Wrangler automatically initiates a new LRS session, provided any stale sessions were removed during the backup procedure.

Custom applications

Custom applications require restoration of both MongoDB collections and LRS objects for complete functionality.

Prerequisites

Note

You must fulfill the prerequisites before proceeding.

Ensure the following have been restored before proceeding:

  • MongoDB collections: custom_applications, longrunningservices, custom_application_images, custom_application_image_versions, execute_docker_images, workspace_items.
  • File storage (S3/Gluster/HDFS) if applications use application sources or persistent data.
  • Docker registry images if using the same or a replicated registry.

Automatic restoration via MongoDB

If you have backed up and restored the longrunningservices collection along with custom_applications, the applications should automatically be restored and functional. The LRS operator detects the restored LRS objects and recreates the necessary Kubernetes deployments.

To verify restoration:

  1. Navigate to Registry > Applications in the DataRobot UI
  2. Check that your Custom Applications appear in the list
  3. Click Open on an application
  4. If the application loads successfully, restoration is complete

Manual restoration (if LRS objects are missing)

If you didn't back up the longrunningservices collection, you must manually recreate applications through the UI. This approach requires users to re-upload or rebuild their applications:

To restore custom applications from application sources:

  1. Go to Registry > Applications > Application sources tab
  2. Select the application source from which you want to build the application
  3. Click Build application

To restore Custom Applications from Docker Image:

  1. Go to Registry > Applications page
  2. Click the dropdown next to Add new application source
  3. Select Upload application
  4. Upload the Application Docker Image
  5. Click Create application

To restore Custom Applications from Application Templates:

  1. Go to Registry > Applications page
  2. Click the dropdown next to Add new application source
  3. Select Create new application from template

Alternatively, on the DataRobot homepage, click Explore application templates, either Open in a codespace or Copy repository URL. Follow the README instructions to create the Application.

Note

Custom Applications are paused after a period of inactivity. To resume a paused custom application, click Open. A loading screen appears while the Application restarts. The Application data is persisted after the Application restarts.

Custom (training) task

Custom Tasks are part of the Custom Models restore. They require no extra actions on their own. Restore only affects deployments with a Custom Task. All other actions (training, scoring, insights, etc.) use ephemeral LRSs that are created and destroyed as needed.

Custom models LRSes

If an error occurs during app upgrades or cluster migrations, you may need to perform the restore procedure for custom model LRSes. There are two ways to restore custom models: using LRS YAML files that were created according to the LRS backup instructions or reactivating custom models from the UI.

Recreate LRSes from YAML LRS definitions

Assuming you have already backed up custom model LRSes into the folder lrs_backup_2025_01_01_12_00 using the LRS backup instructions, if for some reason you need to restore an LRS resource with a name lrs-xxxxx directly in the k8s cluster, you can run the following command:

kubectl apply -f lrs_backup_2025_01_01_12_00/lrs-xxxxx.yaml

This recreates the LRS if it doesn't exist or rolls it back to the backed up state if there were any changes done to its definition.

Note

The prerequisite for this command to run is a configured kubectl config in your shell.

Deactivate/Reactivate custom models from the UI

Another way is deactivating/reactivating Custom Models. It's a more convenient way because it uses the DataRobot API to start Custom Models which includes all business rules and code needed for that. This is safer than just recreating from LRS YAML definitions.

Custom Models can be reactivated from the UI manually. This way you would start only deployments you need. If you want to reactivate all deployments you have, the following script can be used:

kubectl -n $NAMESPACE exec deploy/mmapp-app -it -- /entrypoint python3 tools/custom_model/deployments_tool.py --api-url DATAROBOT_URL --api-token DATAROBOT_API_TOKEN --action activate

Note

  • Replace DATAROBOT_URL with the DataRobot app URL (e.g., https://app.datarobot.com).
  • Replace DATAROBOT_API_TOKEN with an API token for the user with the MLOps admin role and access to all target deployments. You can retrieve the token in the API keys and tools section in Account settings in the UI.

Note

If you have multiple organizations in your DataRobot app, run the script for each organization separately.

This script:

  • Take all deployments visible to the owner of the DATAROBOT_API_TOKEN (please ensure that the user has MLOps admin role & access to all target deployments)
  • Send an activation request per each deployment
  • By default it won't wait for activation process to finish but it can be visible on the UI (Deployments section in the Console tab). If a deployment fails to activate, you see a corresponding message. If you remove the --wait-to-finilize flag, the script doesn't wait for finalization (in this case, track the progress in the Deployments section in the Console tab).

Downsides

  • To use it, you should have a user with MLOps admin access to all deployments in the environment
  • Custom metrics and monitoring jobs that are scheduled on the deployments fail during downtime. They catch up after deployment reactivation.
  • Service stats are cleared after a deployment deactivation (they're recalculated during reactivation).

Note

The tool might be missing in older versions of the app. If you tried to run the command above and it failed because the script is missing, contact DataRobot support so they can provide the script manually. Run this script with Python 3.10 and requests and urllib3 packages installed. Execute the command like this, assuming you have already installed Python on your environment:

pip3 install requests urllib3
python3 PATH_TO_THE_SCRIPT/deployments_tool.py --api-url DATAROBOT_URL --api-token DATAROBOT_API_TOKEN --action activate --wait-to-finilize

Note

  • Replace PATH_TO_THE_SCRIPT with the path to the deployments_tool.py script.
  • Replace DATAROBOT_URL with the DataRobot app URL (e.g., https://app.datarobot.com).
  • Replace DATAROBOT_API_TOKEN with an API token for the user with the MLOps admin role and access to all target deployments. The token can be retrieved in the API keys and tools section in Settings on the UI.

Tracking and management

Tracking and management agents are stateless, so all data generated and required by them is already captured in existing datastore backup and restore procedures (postgres/mongodb/etc). Tracking Agents CRDs can be restored similar to the way other LRSs are restored, but with datarobot-lrs-type=management_agent and tracking_agent.