Restore long running services¶
DataRobot provides a Kubernetes operator known as the Long Running Services (LRS) operator to fulfill dynamic compute requests for DataRobot feature services. These workloads are intended to run for long periods of time, or indefinitely after deployment. The operator itself doesn't require any backup or restore procedures beyond the standard installation instructions. However, applications launched via the LRS operator have varying backup and restoration requirements depending on the workload type.
DataRobot identifies the LRS workload type using a label on the resource: datarobot-lrs-type=<workload-type>.
Interactive Spark sessions¶
No restoration process is necessary for Interactive Spark sessions. Once the core DataRobot application is back online, Wrangler automatically initiates a new LRS session, provided any stale sessions were removed during the backup procedure.
Custom applications¶
Custom applications require restoration of both MongoDB collections and LRS objects for complete functionality.
Prerequisites¶
Note
You must fulfill the prerequisites before proceeding.
Ensure the following have been restored before proceeding:
- MongoDB collections:
custom_applications,longrunningservices,custom_application_images,custom_application_image_versions,execute_docker_images,workspace_items. - File storage (S3/Gluster/HDFS) if applications use application sources or persistent data.
- Docker registry images if using the same or a replicated registry.
Automatic restoration via MongoDB¶
If you have backed up and restored the longrunningservices collection along with custom_applications, the applications should automatically be restored and functional. The LRS operator detects the restored LRS objects and recreates the necessary Kubernetes deployments.
To verify restoration:
- Navigate to Registry > Applications in the DataRobot UI
- Check that your Custom Applications appear in the list
- Click Open on an application
- If the application loads successfully, restoration is complete
Manual restoration (if LRS objects are missing)¶
If you didn't back up the longrunningservices collection, you must manually recreate applications through the UI. This approach requires users to re-upload or rebuild their applications:
To restore custom applications from application sources:
- Go to Registry > Applications > Application sources tab
- Select the application source from which you want to build the application
- Click Build application
To restore Custom Applications from Docker Image:
- Go to Registry > Applications page
- Click the dropdown next to Add new application source
- Select Upload application
- Upload the Application Docker Image
- Click Create application
To restore Custom Applications from Application Templates:
- Go to Registry > Applications page
- Click the dropdown next to Add new application source
- Select Create new application from template
Alternatively, on the DataRobot homepage, click Explore application templates, either Open in a codespace or Copy repository URL. Follow the README instructions to create the Application.
Note
Custom Applications are paused after a period of inactivity. To resume a paused custom application, click Open. A loading screen appears while the Application restarts. The Application data is persisted after the Application restarts.
Custom (training) task¶
Custom Tasks are part of the Custom Models restore. They require no extra actions on their own. Restore only affects deployments with a Custom Task. All other actions (training, scoring, insights, etc.) use ephemeral LRSs that are created and destroyed as needed.
Custom models LRSes¶
If an error occurs during app upgrades or cluster migrations, you may need to perform the restore procedure for custom model LRSes. There are two ways to restore custom models: using LRS YAML files that were created according to the LRS backup instructions or reactivating custom models from the UI.
Recreate LRSes from YAML LRS definitions¶
Assuming you have already backed up custom model LRSes into the folder lrs_backup_2025_01_01_12_00 using the LRS backup instructions, if for some reason you need to restore an LRS resource with a name lrs-xxxxx directly in the k8s cluster, you can run the following command:
kubectl apply -f lrs_backup_2025_01_01_12_00/lrs-xxxxx.yaml
This recreates the LRS if it doesn't exist or rolls it back to the backed up state if there were any changes done to its definition.
Note
The prerequisite for this command to run is a configured kubectl config in your shell.
Deactivate/Reactivate custom models from the UI¶
Another way is deactivating/reactivating Custom Models. It's a more convenient way because it uses the DataRobot API to start Custom Models which includes all business rules and code needed for that. This is safer than just recreating from LRS YAML definitions.
Custom Models can be reactivated from the UI manually. This way you would start only deployments you need. If you want to reactivate all deployments you have, the following script can be used:
kubectl -n $NAMESPACE exec deploy/mmapp-app -it -- /entrypoint python3 tools/custom_model/deployments_tool.py --api-url DATAROBOT_URL --api-token DATAROBOT_API_TOKEN --action activate
Note
- Replace
DATAROBOT_URLwith the DataRobot app URL (e.g.,https://app.datarobot.com). - Replace
DATAROBOT_API_TOKENwith an API token for the user with the MLOps admin role and access to all target deployments. You can retrieve the token in theAPI keys and toolssection in Account settings in the UI.
Note
If you have multiple organizations in your DataRobot app, run the script for each organization separately.
This script:
- Take all deployments visible to the owner of the DATAROBOT_API_TOKEN (please ensure that the user has MLOps admin role & access to all target deployments)
- Send an activation request per each deployment
- By default it won't wait for activation process to finish but it can be visible on the UI (
Deploymentssection in theConsoletab). If a deployment fails to activate, you see a corresponding message. If you remove the--wait-to-finilizeflag, the script doesn't wait for finalization (in this case, track the progress in theDeploymentssection in theConsoletab).
Downsides
- To use it, you should have a user with MLOps admin access to all deployments in the environment
- Custom metrics and monitoring jobs that are scheduled on the deployments fail during downtime. They catch up after deployment reactivation.
- Service stats are cleared after a deployment deactivation (they're recalculated during reactivation).
Note
The tool might be missing in older versions of the app. If you tried to run the command above and it failed because the script is missing, contact DataRobot support so they can provide the script manually. Run this script with Python 3.10 and requests and urllib3 packages installed. Execute the command like this, assuming you have already installed Python on your environment:
pip3 install requests urllib3
python3 PATH_TO_THE_SCRIPT/deployments_tool.py --api-url DATAROBOT_URL --api-token DATAROBOT_API_TOKEN --action activate --wait-to-finilize
Note
- Replace
PATH_TO_THE_SCRIPTwith the path to thedeployments_tool.pyscript. - Replace
DATAROBOT_URLwith the DataRobot app URL (e.g.,https://app.datarobot.com). - Replace
DATAROBOT_API_TOKENwith an API token for the user with the MLOps admin role and access to all target deployments. The token can be retrieved in theAPI keys and toolssection in Settings on the UI.
Tracking and management¶
Tracking and management agents are stateless, so all data generated and required by them is already captured in existing datastore backup and restore procedures (postgres/mongodb/etc). Tracking Agents CRDs can be restored similar to the way other LRSs are restored, but with datarobot-lrs-type=management_agent and tracking_agent.