Skip to content

Long Running Services Restoration Process

DataRobot provides a Kubernetes operator known as the Long Running Services (LRS) Operator to fulfill dynamic compute requests for DataRobot Feature Services. These workloads are intended to run for long periods of time, or indefinitely after deployment. The operator itself does not require any backup or restore procedures beyond the standard installation instructions. However, applications launched via the LRS Operator will have varying backup and restoration requirements depending on the workload type.

DataRobot identifies the LRS workload type using a label on the resource: datarobot-lrs-type=<workload-type>.

LRS Restoration Procedure by Workload Type

Interactive Spark Sessions

No restoration process is necessary for Interactive Spark sessions. Once the core DataRobot application is back online, Wrangler will automatically initiate a new LRS session—provided any stale sessions were removed during the backup procedure.

カスタムアプリケーション

Custom Applications require restoration of both MongoDB collections and LRS objects for complete functionality.

前提条件

Ensure the following have been restored before proceeding: - MongoDB collections: custom_applications, longrunningservices, custom_application_images, custom_application_image_versions, execute_docker_images, workspace_items - File Storage (S3/Gluster/HDFS) if applications use Application Sources or persistent data - Docker Registry images if using the same or replicated registry

Automatic Restoration via MongoDB

If you have backed up and restored the longrunningservices collection along with custom_applications, the applications should automatically be restored and functional. The LRS Operator will detect the restored LRS objects and recreate the necessary Kubernetes deployments.

To verify restoration: 1. Navigate to Registry → Applications in the DataRobot UI 2. Check that your Custom Applications appear in the list 3. Click "Open" on an application 4. If the application loads successfully, restoration is complete

Manual Restoration (If LRS Objects Are Missing)

If you did not back up the longrunningservices collection, you will need to manually recreate applications through the UI. This approach requires users to re-upload or rebuild their applications:

From Application Sources: 1. Navigate to Registry → Applications → Application sources tab 2. Select the application source from which you want to build the application 3. Click "Build application"

From Docker Image: 1. Navigate to Registry → Applications page 2. Click the dropdown next to "Add new application source" 3. Select "Upload application" 4. Upload the Application Docker Image 5. Click "Create application"

From Application Templates: 1. Navigate to Registry → Applications page 2. Click the dropdown next to "Add new application source" 3. Select "Create new application from template"

Alternatively, from the DataRobot homepage: 1. Click "Explore application templates" 2. Either "Open in a codespace" or "Copy repository URL" 3. Follow the README instructions to create the Application

トラブルシューティング

Symptom: Application appears in UI as "running" but shows "The application is temporarily unavailable" when accessed.

Cause: The custom_applications collection was restored but the longrunningservices collection was not.

Solution: Either: - Restore the missing longrunningservices collection from backup, or - Manually stop and restart the application through the UI to recreate the LRS object, or - Delete and recreate the application through the UI

Note: Custom Applications may be paused after a period of inactivity. To resume a paused application, click "Open" - a loading screen will appear while the application restarts. Application data persisted in file storage will be available after restart.

Custom (Training) Task

Custom Tasks are part of the Custom Models resore. They require no extra actions on their own. Restore only affects deployments with a Custom Task. All other actions (training, scoring, insights, ...) use ephemeral LRSs that are created and destroyed as needed.

カスタムモデル

Restore procedure for Custom Model LRSes might be needed after app upgrades or cluster migrations if something went wrong. There are two possible ways of restoring Custom Models. The first is using LRS yaml files that were created according to the instructions from LRS backup doc and the second - reactivating custom models from the UI.

Recreating LRSes from YAML LRS definitions

Let's assume that you have already backed up custom model LRSes into the folder lrs_backup_2025_01_01_12_00 using the instruction from LRS backup doc. Now, let's say if for some reason you need to restore LRS resource with a name lrs-xxxxx directly in the k8s cluster then you can run the following command:

kubectl apply -f lrs_backup_2025_01_01_12_00/lrs-xxxxx.yaml 

This would recreate the LRS if it doesn't exist or roll it back to the backed up state if there were any changes done to it's definition. The prerequisite for this command to run is configured kubectl config in your shell.

Deactivating/Reactivating Custom Models from the UI

Another way is deactivating/reactivating Custom Models. It's a more convenient way because it would use DataRobot API to start Custom Models which will include all business rules and code needed for that. This is safer than just recreating from LRS yaml definitions.

Custom Models could be reactivated from the UI manually. This way you would start only deployments you need. If you want to reactivate all deployments you have, the following script could be used:

kubectl -n $DR_CORE_NAMESPACE exec deploy/mmapp-app -it -- /entrypoint python3 tools/custom_model/deployments_tool.py --api-url <DATAROBOT_URL> --api-token <DATAROBOT_API_TOKEN> --action activate --wait-to-finilize 

where: - DR_CORE_NAMESPACE - a namespace where DataRobot is installed; - DATAROBOT_URL - an DR app url (e.g. https://app.datarobot.com); - DATAROBOT_API_TOKEN - an API token for the user with the MLOps admin role and access to all target deployments in the organisation that's set up in your DatarRobot app. The token could be retrieved in the API keys and tools section in Settings on the UI. Please note, if you have multiple organisations in your DataRobot app then you would have to run the script for each organisation separately.

This script would: - take all deployments visible to the owner of the DATAROBOT_API_TOKEN (again, please ensure that the user has MLOps admin role & access to all target deployments); - send an activation request per each deployment; - by default it will wait for each activation process to finish sequentially. If a deployment failed to activate then you would see corresponding message. If you remove --wait-to-finilize flag the script won't wait for finilization (in such case you can track the progress in Deployments section in the Console tab).

This method has couple of downsides worth of mentioning: - to use it, you should have a user with MLOps admin access to all deployments in the environment; - custom metrics and monitoring jobs that are scheduled on the deployments would fail during downtime. Although, they will catch up after deployment reactivation; - service stats would be cleared after a deployment deactivation (it would be recalculated during reactivation).

Please note - the tool might be missing in older versions of the app. If you tried to run the command above and it failed because script is missing then contact our support team and they can provide the script manually. It is required to run this script with Python 3.10 and requests & urllib3 packages installed. You can execute the command like this, assuming you have already installed Python on your environment:

pip3 install requests urllib3
python3 <PATH_TO_THE_SCRIPT>/deployments_tool.py --api-url <DATAROBOT_URL> --api-token <DATAROBOT_API_TOKEN> --action activate --wait-to-finilize