Skip to content

title: Object Storage Configuration description: DataRobot support the following object storages:


Object Storage Configuration

DataRobot support the following object storages:

  • AWS S3
  • S3 compatible
  • Azure Blob Storage
  • Google Cloud Storage

AWS S3

File storage configuration for S3

FILE_STORAGE_PREFIX: Represents the prefix applied to all paths in the file storage medium after the root path.

FILE_STORAGE_TYPE: Set to s3 for AWS storage.

global:
  filestore:
    type: s3
    environment:
      S3_HOST: s3.us-east-1.amazonaws.com
      S3_BUCKET: <bucket>
      S3_IS_SECURE: "True"
      S3_VALIDATE_CERTS: "True"
      S3_REGION: us-east-1
      S3_PORT: "443"
      S3_SERVER_SIDE_ENCRYPTION: DISABLED

If using the s3 storage type, you must additionally set the variables below.

S3_BUCKET : Name of the S3 bucket to store DataRobot application files in. Your access key ID must belong to an account that has write, read, and list permissions on this bucket.

S3_HOST: IP or hostname of the S3 appliance (e.g. s3.us-east-1.amazonaws.com)

You may additionally set the S3_REGION variable if you want to explicitly specify which region you run on, or if you are using a storage provider which provides an S3-compatible API.

S3_IS_SECURE: Whether or not the service is using HTTPS - The True value, which is the Default, has only been tested in AWS S3

S3_PORT: The port on which the S3 service is running

You may also specify MULTI_PART_S3_UPLOAD: false to disable multipart file uploads if you encounter uploading issues. In general, multipart uploads are well tested and support much larger file uploads, so you will likely not need to change the default.

DataRobot recommends using AWS IRSA roles to authenticate with S3 storage. If you prefer to use keys, or are connecting to an S3-compatible API, you will additionally need to add your credentials as environment variables:

AWS_ACCESS_KEY_ID : Access key ID for the account you want to use to connect to S3 storage. AWS_SECRET_ACCESS_KEY : Secret access key for authenticating your AWS account.

S3 Ingestion

To enable data ingestion from private objects stored in S3, see the AWS S3 Ingest guide.

Disabling TLS verification

If there is a customer environment that needs to connect to its own object storage using unverified TLS:

global:
  filestore:
    type: s3
    environment:
      S3_VALIDATE_CERTS: False

Server-side encryption settings

DataRobot application can be configured to enable server-side encryption (SSE) for data at rest when it stores new files to S3 (it does not affect existing files). Either S3 managed or customer-managed (CMK) key can be used for encryption.

The following configuration settings are available to configure server-side encryption:

S3_SERVER_SIDE_ENCRYPTION : with the default value AES256, data will be encrypted using S3-managed keys, alternatively set to aws:kms to use server-side encryption with KMS-managed keys. Set to DISABLED to completely disable server-side encryption.

AWS_S3_SSE_KMS_KEY_ID : encrypt data using a particular KMS key. Set to the identity of a specific customer managed key, or leave blank to let AWS create a key on your behalf (see AWS managed CMK). This setting only applies when S3_SERVER_SIDE_ENCRYPTION is set to aws:kms.

Note: Server-side encryption means the encryption keys are independently obtained by the S3 service and hidden from the DataRobot application. If the keys are deleted or access is lost, then DataRobot will not be able to help decrypt the data.

Note: S3 will make a billable call to AWS KMS service every time DataRobot makes a read or write request against an encrypted object. Refer yourself to an article on reducing the costs of AWS KMS resource usage with SSE here

S3 compatible

DataRobot can be configured to use minio as well. The example below is based on having the Bitnami/minio chart in the same namespace as the DataRobot install.

File storage configuration for Minio

global:
  filestore:
    type: s3
    environment:
      S3_HOST: core-minio.DR_CORE_NAMESPACE.svc.cluster.local
      S3_BUCKET: <bucket>
      AWS_ACCESS_KEY_ID: miniodatarobot
      AWS_SECRET_ACCESS_KEY: miniodatarobot
      S3_IS_SECURE: "False"
      S3_VALIDATE_CERTS: "False"
      S3_REGION: us-east-1
      S3_PORT: """

If using the s3 storage type, you must additionally set the variables below.

S3_BUCKET : Name of the S3 bucket to store DataRobot application files in. Your access key ID must belong to an account that has write, read, and list permissions on this bucket.

S3_HOST: IP or hostname of the S3 appliance

S3_IS_SECURE: Whether or not the service is using HTTPS - The True value, which is the Default, has only been tested in AWS S3

S3_PORT: The port on which the S3 service is running

AWS_ACCESS_KEY_ID : Access key ID for the account you want to use to connect to S3 storage. AWS_SECRET_ACCESS_KEY : Secret access key for authenticating your AWS account.

Azure Blob Storage

The Datarobot supports four authentication methods for accessing an Azure Blob Storage container. The recommended method is Workload Identity.

Using Storage AccountKey

When you create a storage account, Azure generates two 512-bit storage account access keys for that account. These keys can be used to authorize access to data in your storage account via Shared Key authorization, or via SAS tokens that are signed with the shared key.

Storage account access keys provide full access to the configuration of a storage account, as well as the data. Always be careful to protect your access keys.

Refer to the Manage storage account access keys Azure documentation for how to retrieve the AccountKey.

Once the information is available, the values.yaml section to fill out looks like this:

global:
  filestore:
    type: azure_blob
    environment:
      AZURE_BLOB_STORAGE_CONTAINER_NAME: <AZ_STORAGE_CONTAINER_NAME>
      AZURE_BLOB_STORAGE_ACCOUNT_NAME: <AZ_BLOB_STORAGE_ACCOUNT_NAME>
      AZURE_BLOB_STORAGE_ACCOUNT_KEY: <AZ_BLOB_STORAGE_ACCOUNT_KEY>

Using with Connection String

A connection string includes the authorization information required for your application to access data in an Azure Storage account at runtime using Shared Key authorization.

A connection string will look something like this:

DefaultEndpointsProtocol=https;AccountName=AZ_BLOB_STORAGE_ACCOUNT_NAME;AccountKey=AZ_BLOB_STORAGE_ACCOUNT_KEY;EndpointSuffix=core.windows.net

Refer to the Configure Azure Storage connection strings Azure documentation for how to build a connection string.

Once the information is available, the values.yaml section to fill out looks like this:

global:
  filestore:
    type: azure_blob
    environment:
      AZURE_BLOB_STORAGE_CONTAINER_NAME: <AZ_STORAGE_CONTAINER_NAME>
      AZURE_BLOB_STORAGE_CONNECTION_STRING: <AZURE_BLOB_STORAGE_CONNECTION_STRING>

Using Azure Service Principal

Registering an application with Azure Active Directory (Azure AD) creates a service principal you can use to provide access to Azure storage accounts.

Refer to the Create an Azure Active Directory application and service principal that can access resources Azure documentation for how to set up an application. As an additional note the AZ_CLIENT_ID must have "Blob Contributor" permissions in the target storage account. This permission needs to be at the StorageAccount level, rather than at the Container level to properly configure access.

Once the information is available, the values.yaml section to fill out looks like this:

global:
  filestore:
    type: azure_blob
    environment:
      AZURE_BLOB_STORAGE_CONTAINER_NAME: AZ_STORAGE_CONTAINER_NAME
      AZURE_BLOB_STORAGE_ACCOUNT_NAME: AZ_BLOB_STORAGE_ACCOUNT_NAME
      AZURE_TENANT_ID: AZ_TENANT_ID
      AZURE_CLIENT_ID: AZ_CLIENT_ID
      AZURE_CLIENT_SECRET: AZ_CLIENT_SECRET

Workload Identity for Azure

To use Microsoft Entra Workload ID with Azure Kubernetes Service, a dedicated managed identity must be created. The service account will need the following rights:

  • "Storage Blob Data Contributor"

Using the following az commands will configure the Service account

# update this info accordingly to your deployment
NS=DR_NAMESPACE
export RESOURCE_GROUP="example-datarobot-rg"
export LOCATION="westcentralus"
export STORAGE_ACCOUNT_NAME="example-datarobot-storage-account"
export AKS_NAME="example-aks-name"

####
export SUBSCRIPTION="$(az account show --query id --output tsv)"
export AKS_OIDC_ISSUER="$(az aks show -n $AKS_NAME -g "${RESOURCE_GROUP}" --query "oidcIssuerProfile.issuerUrl" -otsv)"


# Create Azure managed identity
export USER_ASSIGNED_IDENTITY_NAME="datarobot-storage-sa"
az identity create --name "${USER_ASSIGNED_IDENTITY_NAME}" --resource-group "${RESOURCE_GROUP}" --location "${LOCATION}" --subscription "${SUBSCRIPTION}"
export USER_ASSIGNED_CLIENT_ID="$(az identity show --resource-group "${RESOURCE_GROUP}" --name "${USER_ASSIGNED_IDENTITY_NAME}" --query 'clientId' -otsv)"

# assign permission to the managed identity
az role assignment create --assignee $USER_ASSIGNED_CLIENT_ID \
  --role "Storage Blob Data Contributor" \
  --scope "/subscriptions/$SUBSCRIPTION/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT_NAME"


SALIST=('datarobot-storage-sa' 'dynamic-worker' 'kubeworker-sa' 'prediction-server-sa' 'internal-api-sa' 'build-service' 'tileservergl-sa' 'nbx-notebook-revisions-account' 'buzok-account' 'exec-manager-qw' 'exec-manager-wrangling' 'lrs-job-manager' 'blob-view-service')
for sa in "${SALIST[@]}"
do
    echo "setting trust for $NS:$sa"
    az identity federated-credential create --name $sa --identity-name ${USER_ASSIGNED_IDENTITY_NAME} --resource-group ${RESOURCE_GROUP} --issuer ${AKS_OIDC_ISSUER} --subject system:serviceaccount:$NS:$sa
done

Once completed, the values.yaml section to fill out should follow the minimal_datarobot-azure_workload_identity.yaml

NOTE: for additional information Use a workload identity with an application on Azure Kubernetes Service

Azure Government Config

To configure Azure Blob Storage for a Government region, you need to set the following additional parameters:

global:
  filestore:
    type: azure_blob
    environment:
      AZURE_HOST: blob.core.usgovcloudapi.net
      AZURE_BLOB_STORAGE_CONTAINER_NAME: <AZ_STORAGE_CONTAINER_NAME>
      AZURE_BLOB_STORAGE_CONNECTION_STRING: <AZURE_BLOB_STORAGE_CONNECTION_STRING>

Google Storage Account

DataRobot primarily employs GKE Workload Identity to access object storage containers in Google Cloud, as this method provides a secure and efficient way to authenticate Kubernetes workloads without relying on hard-coded credentials. However, for legacy reasons, service account key is supported. By supporting both GKE Workload Identity and service account keys, DataRobot ensures a flexible and secure access strategy, accommodating both modern practices and legacy requirements while effectively managing interactions with object storage containers.

Workload Identity

As part of the DataRobot installation process, a dedicated service account must be created to ensure secure access to Google Cloud resources.

The service account will need the following rights:

  • roles/storage.objectUser
  • roles/storage.insightsCollectorService

Using the following gcloudcommands will configure the Service account

PROJECT_ID=YOUR_GCP_PROJECT
NS=DR_NAMESPACE
SA_NAME=datarobot-storage-sa

gcloud iam service-accounts create $SA_NAME --project $PROJECT_ID
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com --role=roles/storage.objectUser
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com --role=roles/storage.insightsCollectorService

SALIST=('datarobot-storage-sa' 'dynamic-worker' 'kubeworker-sa' 'prediction-server-sa' 'internal-api-sa' 'build-service' 'tileservergl-sa' 'nbx-notebook-revisions-account' 'buzok-account' 'exec-manager-qw' 'exec-manager-wrangling' 'lrs-job-manager' 'blob-view-service')
for sa in "${SALIST[@]}"
do
    gcloud iam service-accounts add-iam-policy-binding $SA_NAME@$PROJECT_ID.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:$PROJECT_ID.svc.id.goog[$NS/$sa]"
done

Once completed, the values.yaml section to fill out should follow the minimal_datarobot-google_workload_identity.yaml

NOTE: for additional information GKE Workload Identity.

Using Service Account Key

As part of the DataRobot installation process, a dedicated service account and key can be created. To grant the service account access to your bucket, the following roles must be granted as described in the Google official documentation:

  • roles/storage.objectUser
  • roles/storage.insightsCollectorService

Once the information is available, the values.yaml section to fill out looks like this:

global:
  filestore:
    type: google
    environment:
      GOOGLE_STORAGE_CREDENTIALS_SOURCE: content
      GOOGLE_STORAGE_BUCKET: GCP_BUCKET_NAME
      GOOGLE_STORAGE_KEYFILE_CONTENTS: GCP_BASE64_SERVICE_ACCOUNT_KEY

NOTE: for additional information minimal_datarobot-google_values.yaml.