Notebooks¶
Hosted data science notebooks where users can execute DataRobot API and other custom code-first workflows. Notebooks are Jupyter compatible and provide centralized governance, enterprise-grade security, and reproducibility for data science teams to collaborate across the code development experience.
Starting from DataRobot 9.1 Notebooks will be enabled by default.
Starting from DataRobot 10.0 Codespaces will be enabled by default.
Starting from DataRobot 10.2 custom execution environments will be enabled by default.
Please take note that Notebooks need additional CRD.
⚠️ Warning: Please be aware that some notebooks services are now installed as a separate sub-chart and no longer share configs via
notebooks.<service>section. Please refer to the 11.0 changes at the end of this document for more details.
Here is list of Notebooks services (services prefixed with notebooks. are configured under notebooks section):
- notebooks.audit-log
- notebooks.code-assistance
- notebooks.code-nuggets
- notebooks.data-retention
- notebooks.environment-variables
- notebooks.execution-environments
- notebooks.filesystems
- notebooks.notebook-import
- notebooks.notebook-revisions
- notebooks.notebooks
- notebooks.orchestrator
- notebooks.permissions
- notebooks.scheduling
- notebooks.terminals
- notebooks.usage-tracking
- notebooks.users
- nbx-ingress
- nbx-websockets
- nbx-operator
- nbx-vcs
- cfx-session-port-proxy
By default, each service goes with replica count of 1.
To configure it for a particular service you can use values.yaml like this:
notebooks: # for services with prefix `notebooks.` from the list above
<notebook-service-name>:
deployment:
replicaCount: 2
nbx-ingress: # for services without prefix `notebooks.`
deployment:
replicaCount: 2
cfx-session-port-proxy:
...
or you can override for all notebooks services:
notebooks: # only affects services with prefix `notebooks.` from the list above
global:
deployment:
replicaCountOverride: 2
Configuration¶
To override configs for a notebooks service you should use configs.data section:
notebooks:
# Common config variables for all services
<notebook-service-name>:
configs:
data:
ENVIRONMENT: production
LOG_LEVEL: INFO
DEBUG: false
TELEMETRY_ENABLED: false
TELEMETRY_EXPORTER: otlp
TELEMETRY_OTLP__PROTOCOL: grpc
TELEMETRY_OTLP__ENDPOINT: http://localhost:8200
TELEMETRY_OTLP__INSECURE: true
TELEMETRY_METRIC_EXPORT_INTERVAL: 5000
# Service specific configs
orchestrator:
configs:
data:
ACTIVE_NOTEBOOK_SESSIONS_LIMIT_PER_USER: 2 # This value is the limit on active running sessions per user
RUNNER_EPHEMERAL_STORAGE_LIMIT: "2Gi"
NOTEBOOK_STARTUP_TIMEOUT_MIN: 15 # This value is used as a limit for session startup
KERNEL_EPHEMERAL_STORAGE_LIMIT: "10Gi"
filesystems:
configs:
data:
MOUNT_DIR: "/home/notebooks/storage"
ACCESS_MODES: '["ReadWriteOnce"]'
SIZE: "10Gi"
STORAGE_CLASS_NAME: default # to override default storage class for notebooks filesystem (https://kubernetes.io/docs/concepts/storage/storage-classes/)
websocket:
configs:
data:
WEBSOCKET_MAX_PARSER_BODY_SIZE: "64mb"
WEBSOCKET_TOKEN_EXPIRATION_TIME: "1d"
WEBSOCKET_IO_PATH: "/socket.io-nbx"
Also, you can do it for all notebooks services prefixed with notebooks.:
notebooks:
global:
configMap:
data:
ENVIRONMENT: production
LOG_LEVEL: INFO
DEBUG: false
A runner configuration of note is the ability to alter the default values for liveness and readiness probes.
The values in the operator's config section can be optionally altered like so:
RUNNER_CONFIG: |
startupProbe:
initialDelaySeconds: 42
failureThreshold: 333
livenessProbe:
initialDelaySeconds: 42
timeoutSeconds: 41
periodSeconds: 40
failureThreshold: 39
httpGet:
path: /someOtherPath
port: 8976
Notebooks currently has one (deprecated) private preview feature: - Notebooks Filesystem Management - Persistent filesystem management for notebooks (deprecated in 9.2)
This feature can be toggled in User Settings or in the values.yaml like this:
core:
config_env_vars:
# ...
ENABLE_NOTEBOOKS_FILESYSTEM_MANAGEMENT: "False" # deprecated in 9.2, use codespaces instead
To configure these options, refer to the Tuning Datarobot Environment Variables section of this guide.
Notebooks and codespaces active sessions limit¶
As seen above in Configuration section there is a cluster-level environment variable ACTIVE_NOTEBOOK_SESSIONS_LIMIT_PER_USER. This is the amount of DR Notebook/Codespace active sessions that are allowed per user. To Note: This works in tandem with an Org-level limit for active sessions that can be set in the Admin UI. This Org-level limit defaults to 5 and can be set as high as 100.
Notebooks machine sizes¶
Notebooks supports the following machine sizes: - XS - S - M - L - XL - XXL - XXXL
The difference between machine sizes correlates to different resource allocation (CPU, RAM).
But not all of them are available by default. Initially, only XS, S, M and L are present in the list of sizes.
To be able to select a bigger one, you need to update NOTEBOOKS_MAX_MACHINE_SIZE eng config on monolith side (default value is L), e.g.
core:
config_env_vars:
NOTEBOOKS_MAX_MACHINE_SIZE: "XL"
Machine size changes only CPU and RAM resource request. There is also a way to change persistent file storage size:
yaml
notebooks:
filesystems:
configs:
data:
SIZE: "20Gi"
Notebooks filesystem¶
If either the notebook filesystem or codespaces feature (enabled by default) is enabled, persistent volumes need to be configured by setting storage class name.
There are several ways to do it in such priority:
- STORAGE_CLASS_NAME config item (for more details check Configuration section)
- storageClassName from global section (for more info check Persistent Storage)
- If previous values are empty Helm will try to run lookup operation to get default storage class name (pay attention: you need to have permissions in order to run such operation)
- default value is ebs-standard
TO NOTE: If Volume Snapshots{ target=blank } are not configured then Codespaces are _not restorable.
Data Retention¶
The data-retention service handles hard deletion on a periodic basis. By default a notebook and its related resources (at time of deletion) are soft-deleted. The data-retention service then hard-deletes them (assuming there's been no request for restoration) after a certain period of time. The default retention period is 30 days. This value however can be altered to be either shorter or longer. Please see below for examples.
notebooks:
data-retention:
configs:
data:
RETENTION_WINDOW: "7d" # 7 days retention - example of other valid values: "30min", "6h", "90d", etc.
TO NOTE: If Volume Snapshots{ target=blank } are not configured then Codespaces are _not restorable.
Codespaces¶
Codespaces are the next generation of the notebook filesystem functionality.
Starting from 9.2, we are deprecating the notebook filesystem feature (feature flag: ENABLE_NOTEBOOKS_FILESYSTEM_MANAGEMENT). For new installations (or if the notebook filesystem feature had not been previously enabled), only Codespaces should be used (enabled by default).
For existing installations that already have the notebook filesystem feature enabled, codespaces should be used (enabled by default) along with the existing notebook filesystem feature.
Custom Environments¶
Custom Environments in Notebooks allow users to build, manage, and use their specific environment images tailored to their data science workflows.
The custom environments you feature has been enabled by default since DataRobot 10.2
This feature is built on top of Custom Models please check this page as well.
Please note that in order for custom environments to work properly in addition to Custom Models enabled + configured you need to make sure that notebooks environments service is aware of Custom Models Image Registry.
There are 2 ways to achieve that, first one is where you need to make sure that both IMAGE_BUILDER_CUSTOM_MODELS_REGISTRY_HOST and IMAGE_BUILDER_CUSTOM_MODELS_ENVIRONMENT_REGISTRY_REPO are configured globally (via modeling-envvars configmap) according to this section. The second option is to set the EXECUTION_ENVIRONMENTS_REGISTRY_REPO config directly for the notebooks execution-environments service like this:
notebooks:
execution-environments:
configs:
data:
EXECUTION_ENVIRONMENTS_REGISTRY_REPO: "<custom models base image registry>" # usually equals to custom models IMAGE_BUILDER_CUSTOM_MODELS_REGISTRY_HOST + IMAGE_BUILDER_CUSTOM_MODELS_ENVIRONMENT_REGISTRY_REPO configuration
# some example values:
# EXECUTION_ENVIRONMENTS_REGISTRY_REPO: 1234567890.dkr.ecr.us-east-1.amazonaws.com (base-image suffix added automatically)
# EXECUTION_ENVIRONMENTS_REGISTRY_REPO: 1234567890.dkr.ecr.us-east-1.amazonaws.com/custom-models (base-image suffix added automatically)
# EXECUTION_ENVIRONMENTS_REGISTRY_REPO: 1234567890.dkr.ecr.us-east-1.amazonaws.com/custom-models/base-image
# EXECUTION_ENVIRONMENTS_REGISTRY_REPO: 1234567890.dkr.ecr.us-east-1.amazonaws.com/custom-models/<custom-image-name>
To disable custom environments on cluster level, use DISABLE_CUSTOM_ENVS config (false by default):
notebooks:
execution-environments:
configs:
data:
DISABLE_CUSTOM_ENVS: true
Create an environment¶
These steps will use the "python311_notebook_base" environment.
The environment data must be uploaded as a .tar, .gz, or .zip.
We provide example public environments here: https://github.com/datarobot/datarobot-user-models/tree/master/public_dropin_notebook_environments
You can clone the repository and create the tarball yourself or use the pre-made tarball called py311_notebook_dropin.tar.gz in the public_dropin_notebook_environments folder.
- Navigate to Registry > Environments
- Click "+ Add environment"
- Give the environment a name, upload the tarball, set the language to Python and select Environment type as Notebooks
- Press "Add environment" button
Notebook instance deployments¶
Each notebook instance has two running deployments: kernel and runner. Here is default config of notebook operator that's responsible for running these deployments
nbx-operator:
imagePullSecrets:
- name: datarobot-image-pullsecret
By default, it uses datarobot-image-pullsecret (common DR secret)
If you need to pull notebooks images with another pull secrets -- you could specify it here.
Revision storage at AWS S3¶
On-premise installations of DataRobot support storage of notebooks revision files in Amazon S3.
To configure it you need to set IAM role with appropriate permissions (e.g. AWS_IRSA_ROLE_NAME) in values.yaml like this:
notebooks:
notebook-revisions:
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: AWS_IRSA_ROLE_NAME
Notebooks revision files will be stored under <S3_BUCKET>/<FILE_STORAGE_PREFIX>nbx_rev folder. (FILE_STORAGE_PREFIX already contains a trailing slash, e.g. data/)
GPU Usage Configuration¶
To manage GPU utilization, the operator utilizes GPU-related configurations, which can be set by MACHINE_GPU_1, MACHINE_GPU_2, MACHINE_GPU_3, MACHINE_GPU_4 env vars.
By default, we don't have kernels configured to work with GPUs. To use a GPU, create a custom notebook environment. Find our Python GPU Notebook at https://github.com/datarobot/datarobot-user-models/tree/master/public_dropin_notebook_environments/python39_notebook_gpu.
For setup instructions, see Custom Environments.
NOTE: in release/10.2 all notebooks dropins environments, except Python 3.11 Base, have been deprecated and removed.
Node Selectors¶
Node selectors are utilized to ensure that the pods are scheduled on nodes with GPU capabilities. They are key-value pairs that are added to the nodeSelector term of the pod specification, enforcing that the pod is scheduled on nodes with the corresponding labels.
Example:
nodeSelectors:
datarobot.com/instance-type: accelerated-computing
datarobot.com/node-capability: gpu
datarobot.com/node-type: on-demand
Toleration¶
Toleration enables the pods to be scheduled on nodes with specific taints, in this case, generally used for GPU resources. To disable the usage of toleration, you can either omit the field or explicitly set it to null.
Example with toleration:
toleration:
effect: "NoExecute"
key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
toleration: null
Usage¶
Example Helm Values:
notebooks:
execution-environments:
configs:
data:
MACHINE_GPU_1: |
name: "nvidia-g6-single"
cpu: "5000m"
memory: "54Gi"
gpu: "1"
ephemeral_storage: "15Gi"
node_selector:
datarobot.com/node-capability: gpu
datarobot.com/node-type: on-demand
datarobot.com/gpu-type: nvidia-g6-single
tolerations:
- effect: "NoExecute"
key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
where:
- name: could be any name for the GPU resource, this is what will be shown in the UI
- cpu, memory: provisioned resource values for CPU and RAM
- ephemera_storage: temp storage size (not permanent, will be destroyed when notebook stops)
- gpu: GPU count (should be minimum 1 for GPU config)
- node_selector, tolerations: see Node Selectors and Tolerations sections
⚠️ Notebooks require an additional 1Gb/0.5-1 vCPU core for system needs, so recommendation here is reserve those resources. E.g. if required GPU config is 16Gb, 5vCPU, better to setup it as 15Gb/4vCPU
Notebooks service supports up to four GPU machine configuration, which could be defined by MACHINE_GPU_1, MACHINE_GPU_2, MACHINE_GPU_3 or MACHINE_GPU_4 env vars.
There is no recommended RAM/CPU values for GPU configs, everything depends on node capacity and machine type. Check as reference: * Google Compute Engine * Amazon EC2 Instances with GPU
Default GPU Configurations¶
If MACHINE_GPU_* is not defined, the execution environments utilizes the following default configurations:
- NodeSelectors:
datarobot.com/node-capability: "gpu"datarobot.com/node-type: "on-demand"datarobot.com/gpu-type: "nvidia-g6-single"- Toleration:
effect: "NoExecute"key: "nvidia.com/gpu"operator: "Equal"value: "true"
Notebook resource configuration¶
Each of notebook resource types (XS, S, M, L, XL, XXL, XXXL) can be configured by environment variables. Default values are:
notebooks:
execution-environments:
configs:
data:
# XS - 1 CPU - 4 GB RAM
MACHINE_XSMALL__CPU: '1000m'
MACHINE_XSMALL__RAM: '4Gi'
# S - 2 CPU - 8 GB RAM
MACHINE_SMALL__CPU: '2000m'
MACHINE_SMALL__RAM: '8Gi'
# M - 4 CPU - 16 GB RAM
MACHINE_MEDIUM__CPU: '4000m'
MACHINE_MEDIUM__RAM: '16Gi'
# L - 8 CPU - 32 GB RAM
MACHINE_LARGE__CPU: '8000m'
MACHINE_LARGE__RAM: '32Gi'
# XL - 15 CPU - 63 GB RAM
MACHINE_XLARGE__CPU: '15000m'
MACHINE_XLARGE__RAM: '63Gi'
# XXL - 16 CPU - 256 GB RAM
MACHINE_XXLARGE__CPU: '16000m'
MACHINE_XXLARGE__RAM: '256Gi'
# XXXL - 32 CPU - 256 GB RAM
MACHINE_XXXLARGE__CPU: '32000m'
MACHINE_XXXLARGE__RAM: '256Gi'
Private Git Repositories¶
The private git integration is available since DataRobot 11.0. It supports only cloud versions of Github, Gitlab, Bitbucket and Microsoft Azure Repos. The self-hosted versions are to be supported in the future releases.
If you want to work with private git repositories stored in Github Cloud, Gitlab Cloud, Bitbucket Cloud or Azure Repos Cloud, you should configure the OAuth applications first in the DataRobot cluster.
After that, the DataRobot users will be able to authorize their accounts and get access to their private repos.
A note for Github users. You will have to install the OAuth application created above into your organization in order to give that OAuth Application access to the organization's private repositories. The specific user will need to install the same OAuth application into their Github accounts if they want to work with their personal private git repositories.
Kernel Gateway Authorization¶
There is now an opt-in feature that will add the usage of KG_AUTH_TOKEN for all kernel gateway requests. These are any requests made to the kernel pods (from the associated runner pod). Docs can be seen here for the package.
To opt in to this add the ENABLE_KERNEL_GATEWAY_TOKEN value, set to True to the orchestrator config.
This feature is an added layer of security for the runner to kernel communication that can be helpful if the default cluster networking policies are not in place. Meaning, that by default runner A can only communicate with kernel A and not kernel B. However as a way to prevent communication via respective 401 HTTP responses this feature can be enabled for an on-prem setup that does not have these default networking policies in place.
This feature is not enabled by default because it's not normally needed and also since it relies on a modified code path in the NBX operator which then slightly lowers reliability of the startup probe for the kernel. It uses TCP for the probe in that case instead of HTTP and therefore there's a slight chance that the indication the kernel gateway server is ready before it is fully up and running - this is however quite a small chance and an acceptable risk in this case.
Troubleshooting¶
Hydra (10.0 or below)¶
Notebooks services are using Hydra, and can possibly have some instabilities. During provisioning, hydra pods may fail on DB migration.
One solution is to remove hydra pods, and it should come up next time in a healthy state.
If you still have issue with notebooks service it's good to check customer resources oauth2clients.hydra.ory.sh
To check it runs: kubectl -n DR_CORE_NAMESPACE get oauth2clients.hydra.ory.sh
It should return several records with nbx- prefix (it could be more but these should be present for notebooks).
Each of these clients should have an associated secret. To check it run kubectl -n DR_CORE_NAMESPACE get secrets | grep "oauth2".
If the secret was not created, try to check hydra and hydra-maester pods.
You can try to remove these pods and wait for new pods to be created. Appropriate secrets should be created after that.
If the secret was not created after previous operation, you can try to delete the oauth2clients custom resource using this command:
kubectl get oauth2clients.hydra.ory.sh -o custom-columns=NAME:.metadata.name | grep 'nbx-' | xargs kubectl delete oauth2clients.hydra.ory.sh
helm upgrade. That should trigger hydra-maester to retry creating the secret.
Scaledown to 0¶
Scaling cluster down to 0 can lead to the problem with hydra secrets as well. To mitigate the risks of such problems please consider keeping hydra deployments with one pod per deployment. In case of this issue try to do the same procedures described above.
kubectl -n DR_CORE_NAMESPACE delete oauth2clients.hydra.ory.sh <oauth2_client_name>
helm upgrade --install ...
Helm release limitation for Version 9.1.1 and Prior¶
For version 9.1.1 and previous versions, there is a limitation regarding the Helm release name.
It must use dr in the Helm release name.
This limitation has been identified and fixed in the forthcoming 9.1.2 and 9.2 releases.
Notebooks Upgrade Guide¶
In Helm, when you execute the helm upgrade command, it does not automatically upgrade the Custom Resource Definitions (CRDs). Outdated CRDs can lead to compatibility issues and may cause unexpected behavior in the functionality of notebooks. Follow the instructions related to notebooks CRDs.
9.1¶
In the recent 9.1 upgrade, we've made significant changes to the naming standards in the values file for notebooks.
Changes:¶
Prior to version 9.1, services related to notebooks in the values file used a prefix notebooks-. For instance, the notebook operator was defined as:
notebooks:
notebooks-notebook-operator:
imagePullSecrets:
- name: datarobot-image-pullsecret
notebooks- prefix, leading to a more simplified structure. The above example would now be:
notebooks:
notebook-operator:
imagePullSecrets:
- name: datarobot-image-pullsecret
10.1¶
In 10.1 we've added notebooks machine resource configuration.
Changes:¶
For execution-environments service there are separate configuration for each machine size (XS, S, M, L, XL, XXL, XXXL)
Default values are:
notebooks:
execution-environments:
configs:
data:
# XS - 1 CPU - 4 GB RAM
MACHINE_XSMALL__CPU: '1000m'
MACHINE_XSMALL__RAM: '4Gi'
# S - 2 CPU - 8 GB RAM
MACHINE_SMALL__CPU: '2000m'
MACHINE_SMALL__RAM: '8Gi'
# M - 4 CPU - 16 GB RAM
MACHINE_MEDIUM__CPU: '4000m'
MACHINE_MEDIUM__RAM: '16Gi'
# L - 8 CPU - 32 GB RAM
MACHINE_LARGE__CPU: '8000m'
MACHINE_LARGE__RAM: '32Gi'
# XL - 15 CPU - 63 GB RAM
MACHINE_XLARGE__CPU: '15000m'
MACHINE_XLARGE__RAM: '63Gi'
# XXL - 16 CPU - 256 GB RAM
MACHINE_XXLARGE__CPU: '16000m'
MACHINE_XXLARGE__RAM: '256Gi'
# XXXL - 32 CPU - 256 GB RAM
MACHINE_XXXLARGE__CPU: '32000m'
MACHINE_XXXLARGE__RAM: '256Gi'
# GPU
MACHINE_GPU_DEFAULT__CPU: '8000m'
MACHINE_GPU_DEFAULT__RAM: '61Gi'
10.2¶
In the 10.2 release, we've made significant changes to notebooks GPU configuration.
Changes:¶
Notebooks GPU configs have moved to the execution-environments service and supports up to four separate configurations. GPU_CONFIG from the operator's service has been removed.
E.g. of new configuration setup:
notebooks:
execution-environments:
configs:
data:
MACHINE_GPU_1: |
name: "nvidia-g6-single"
cpu: "5000m"
memory: "54Gi"
gpu: "1"
ephemeral_storage: "15Gi"
node_selector:
datarobot.com/node-capability: gpu
datarobot.com/node-type: on-demand
datarobot.com/gpu-type: nvidia-g6-single
tolerations:
- effect: "NoExecute"
key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
11.0¶
The major change in the 11.0 release is extraction of some notebooks services from the notebooks umbrella chart into separate sub-charts (and repositories).
This change requires a special level of attention during the upgrade process since some configs previously defined in the notebooks section now must be set on a per-service basis.
Changes:¶
Some services have been extracted from the notebooks umbrella chart into separate sub-charts:
- notebooks.ingress -> nbx-ingress
- notebooks.websockets -> nbx-websockets
- notebooks.operator -> nbx-operator
- notebooks.session-port-proxy -> cfx-session-port-proxy
To ensure a smooth upgrade, please follow the instructions below:
1. Backup your current values.yaml file. Before making any changes, ensure you have a backup of your current configuration.
2. Move service-specific overrides for all services from the list above into dedicated top level sections For example, for notebooks.notebooks-operator service:
notebooks:
notebooks-notebook-operator: # this is how it's configured before 11.0
imagePullSecrets:
- name: datarobot-image-pullsecret
nbx-operator: # this is how it must be configured after 11.0
imagePullSecrets:
- name: datarobot-image-pullsecret
notebooks.global section as well as any other "non-service-specific" related overrides in the notebooks. section
and make sure to replicate them for each of the new sub-charts following example in point 2.
11.4¶
As of 11.4 MAX_FILES_UPLOAD_REQUEST_SIZE and MAX_FILES_UPLOAD_SIZE values for filesystems service are no longer used.
Changes:¶
MAX_FILES_UPLOAD_REQUEST_SIZE and MAX_FILES_UPLOAD_SIZE values for filesystems service are no longer used for the limits applied to uploading Codespace files. Instead there is an Org-level setting (in DR Admin UI) "Max codespace file upload size (in megabytes)". This is the value that is used for what max size is allowed.
