Self-managed installation and maintenance > Installation and configuration guide > Advanced configuration > Object Storage Configuration

Object storage configuration¶

DataRobot support the following object storages:

AWS S3
S3 compatible
Azure Blob Storage
Google Cloud Storage

AWS S3¶

File storage configuration for S3¶

FILE_STORAGE_PREFIX: Represents the prefix applied to all paths in the file storage medium after the root path.

FILE_STORAGE_TYPE: Set to s3 for AWS storage.

global:
  filestore:
    type: s3
    environment:
      S3_HOST: s3.us-east-1.amazonaws.com
      S3_BUCKET: <bucket>
      S3_IS_SECURE: "True"
      S3_VALIDATE_CERTS: "True"
      S3_REGION: us-east-1
      S3_PORT: "443"
      S3_SERVER_SIDE_ENCRYPTION: DISABLED

If using the s3 storage type, you must additionally set the variables below.

S3_BUCKET : Name of the S3 bucket to store DataRobot application files in. Your access key ID must belong to an account that has write, read, and list permissions on this bucket.

S3_HOST: IP or hostname of the S3 appliance (e.g. s3.us-east-1.amazonaws.com)

You may additionally set the S3_REGION variable if you want to explicitly specify which region you run on, or if you are using a storage provider which provides an S3-compatible API.

S3_IS_SECURE: Whether or not the service is using HTTPS - The True value, which is the Default, has only been tested in AWS S3

S3_PORT: The port on which the S3 service is running

You may also specify MULTI_PART_S3_UPLOAD: false to disable multipart file uploads if you encounter uploading issues. In general, multipart uploads are well tested and support much larger file uploads, so you typically don't need to change the default.

DataRobot recommends using AWS IRSA roles to authenticate with S3 storage. If you prefer to use keys, or are connecting to an S3-compatible API, you must also add your credentials as environment variables:

AWS_ACCESS_KEY_ID : Access key ID for the account you want to use to connect to S3 storage. AWS_SECRET_ACCESS_KEY : Secret access key for authenticating your AWS account.

S3 Ingestion¶

To enable data ingestion from private objects stored in S3, see the AWS S3 Ingest guide.

Disabling TLS verification¶

If there is a customer environment that needs to connect to its own object storage using unverified TLS:

global:
  filestore:
    type: s3
    environment:
      S3_VALIDATE_CERTS: False

Server-side encryption settings¶

DataRobot application can be configured to enable server-side encryption (SSE) for data at rest when it stores new files to S3 (it doesn't affect existing files). Either S3 managed or customer-managed (CMK) key can be used for encryption.

The following configuration settings are available to configure server-side encryption:

S3_SERVER_SIDE_ENCRYPTION : with the default value AES256, data is encrypted using S3-managed keys, alternatively set to aws:kms for server-side encryption with KMS-managed keys. Set to DISABLED to completely disable server-side encryption.

AWS_S3_SSE_KMS_KEY_ID : encrypt data using a particular KMS key. Set to the identity of a specific customer managed key, or leave blank to let AWS create a key on your behalf (see AWS managed CMK). This setting only applies when S3_SERVER_SIDE_ENCRYPTION's set to aws:kms.

Note: Server-side encryption means the encryption keys are independently obtained by the S3 service and hidden from the DataRobot application. If the keys are deleted or access is lost, then DataRobot isn't able to help decrypt the data.

Note: S3 makes a billable call to AWS KMS service every time DataRobot makes a read or write request against an encrypted object. Refer yourself to an article on reducing the costs of AWS KMS resource usage with SSE here

S3 compatible¶

DataRobot can be configured to use MinIO as well. The example below is based on having the Bitnami/minio chart in the same namespace as the DataRobot install.

File storage configuration for MinIO¶

global:
  filestore:
    type: s3
    environment:
      S3_HOST: core-minio.DR_CORE_NAMESPACE.svc.cluster.local
      S3_BUCKET: <bucket>
      AWS_ACCESS_KEY_ID: miniodatarobot
      AWS_SECRET_ACCESS_KEY: miniodatarobot
      S3_IS_SECURE: "False"
      S3_VALIDATE_CERTS: "False"
      S3_REGION: us-east-1
      S3_PORT: """

If using the s3 storage type, you must additionally set the variables below.

S3_BUCKET : Name of the S3 bucket to store DataRobot application files in. Your access key ID must belong to an account that has write, read, and list permissions on this bucket.

S3_HOST: IP or hostname of the S3 appliance

S3_IS_SECURE: Whether or not the service is using HTTPS - The True value, which is the Default, has only been tested in AWS S3

S3_PORT: The port on which the S3 service is running

AWS_ACCESS_KEY_ID : Access key ID for the account you want to use to connect to S3 storage. AWS_SECRET_ACCESS_KEY : Secret access key for authenticating your AWS account.

Checksum configuration for S3-compatible storage (Dell ECS, MinIO)¶

When DataRobot services (such as the build-service) upload large files like custom environments or Docker images, the underlying AWS Java SDK automatically utilizes a multi-part upload. By default, the SDK strictly enforces "chunked streaming signatures" and trailing SHA256 checksums for these large files.

Some custom S3-compatible backends (e.g., Dell ECS or MinIO) calculate these chunk signatures differently than native AWS S3. This discrepancy can result in a 400 Content-SHA256 mismatch error, causing the upload or image build to fail.

To resolve this compatibility issue, you must configure the AWS SDK to relax its checksum enforcement. Add the following environment variables to your values file under the core.config_env_vars block:

core:
  config_env_vars:
    AWS_REQUEST_CHECKSUM_CALCULATION: "WHEN_REQUIRED"
    AWS_RESPONSE_CHECKSUM_VALIDATION: "WHEN_REQUIRED"

Azure Blob storage¶

The DataRobot supports four authentication methods for accessing an Azure Blob Storage container. The recommended method is Workload Identity.

Using storage accountKey¶

When you create a storage account, Azure generates two 512-bit storage account access keys for that account. These keys can be used to authorize access to data in your storage account via Shared Key authorization, or via SAS tokens that are signed with the shared key.

Storage account access keys provide full access to the configuration of a storage account, as well as the data. Always be careful to protect your access keys.

Refer to the Manage storage account access keys Azure documentation for how to retrieve the AccountKey.

Once the information is available, the values.yaml section to fill out looks like this:

global:
  filestore:
    type: azure_blob
    environment:
      AZURE_BLOB_STORAGE_CONTAINER_NAME: <AZ_STORAGE_CONTAINER_NAME>
      AZURE_BLOB_STORAGE_ACCOUNT_NAME: <AZ_BLOB_STORAGE_ACCOUNT_NAME>
      AZURE_BLOB_STORAGE_ACCOUNT_KEY: <AZ_BLOB_STORAGE_ACCOUNT_KEY>

Using with connection string¶

A connection string includes the authorization information required for your application to access data in an Azure Storage account at runtime using Shared Key authorization.

A connection string looks something like this:

DefaultEndpointsProtocol=https;AccountName=AZ_BLOB_STORAGE_ACCOUNT_NAME;AccountKey=AZ_BLOB_STORAGE_ACCOUNT_KEY;EndpointSuffix=core.windows.net

Refer to the Configure Azure Storage connection strings Azure documentation for how to build a connection string.

Once the information is available, the values.yaml section to fill out looks like this:

global:
  filestore:
    type: azure_blob
    environment:
      AZURE_BLOB_STORAGE_CONTAINER_NAME: <AZ_STORAGE_CONTAINER_NAME>
      AZURE_BLOB_STORAGE_CONNECTION_STRING: <AZURE_BLOB_STORAGE_CONNECTION_STRING>

Using Azure service principal¶

Registering an application with Azure Active Directory (Azure AD) creates a service principal you can use to provide access to Azure storage accounts.

Refer to the Create an Azure Active Directory application and service principal that can access resources Azure documentation for how to set up an application. As an additional note the AZ_CLIENT_ID must have "Blob Contributor" permissions in the target storage account. This permission needs to be at the StorageAccount level, rather than at the Container level to properly configure access.

Once the information is available, the values.yaml section to fill out looks like this:

global:
  filestore:
    type: azure_blob
    environment:
      AZURE_BLOB_STORAGE_CONTAINER_NAME: AZ_STORAGE_CONTAINER_NAME
      AZURE_BLOB_STORAGE_ACCOUNT_NAME: AZ_BLOB_STORAGE_ACCOUNT_NAME
      AZURE_TENANT_ID: AZ_TENANT_ID
      AZURE_CLIENT_ID: AZ_CLIENT_ID
      AZURE_CLIENT_SECRET: AZ_CLIENT_SECRET

Workload Identity for Azure¶

To use Microsoft Entra Workload ID with Azure Kubernetes Service, a dedicated managed identity must be created. The service account needs the following rights:

"Storage Blob Data Contributor"

Using the following az commands configures the Service account

# update this info accordingly to your deployment
NS=DR_NAMESPACE
export RESOURCE_GROUP="example-datarobot-rg"
export LOCATION="westcentralus"
export STORAGE_ACCOUNT_NAME="example-datarobot-storage-account"
export AKS_NAME="example-aks-name"

####
export SUBSCRIPTION="$(az account show --query id --output tsv)"
export AKS_OIDC_ISSUER="$(az aks show -n $AKS_NAME -g "${RESOURCE_GROUP}" --query "oidcIssuerProfile.issuerUrl" -otsv)"


# Create Azure managed identity
export USER_ASSIGNED_IDENTITY_NAME="datarobot-storage-sa"
az identity create --name "${USER_ASSIGNED_IDENTITY_NAME}" --resource-group "${RESOURCE_GROUP}" --location "${LOCATION}" --subscription "${SUBSCRIPTION}"
export USER_ASSIGNED_CLIENT_ID="$(az identity show --resource-group "${RESOURCE_GROUP}" --name "${USER_ASSIGNED_IDENTITY_NAME}" --query 'clientId' -otsv)"

# assign permission to the managed identity
az role assignment create --assignee $USER_ASSIGNED_CLIENT_ID \
  --role "Storage Blob Data Contributor" \
  --scope "/subscriptions/$SUBSCRIPTION/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT_NAME"


SALIST=('datarobot-storage-sa' 'dynamic-worker' 'kubeworker-sa' 'prediction-server-sa' 'internal-api-sa' 'build-service' 'tileservergl-sa' 'nbx-notebook-revisions-account' 'buzok-account' 'exec-manager-qw' 'exec-manager-wrangling' 'lrs-job-manager' 'blob-view-service')
for sa in "${SALIST[@]}"
do
    echo "setting trust for $NS:$sa"
    az identity federated-credential create --name $sa --identity-name ${USER_ASSIGNED_IDENTITY_NAME} --resource-group ${RESOURCE_GROUP} --issuer ${AKS_OIDC_ISSUER} --subject system:serviceaccount:$NS:$sa
done

Once completed, the values.yaml section to fill out should follow the minimal_datarobot-azure_workload_identity.yaml

Azure workload identity

For additional information, see Use a workload identity with an application on Azure Kubernetes Service.

Azure Government config¶

To configure Azure Blob Storage for a Government region, you need to set the following additional parameters:

global:
  filestore:
    type: azure_blob
    environment:
      AZURE_HOST: blob.core.usgovcloudapi.net
      AZURE_BLOB_STORAGE_CONTAINER_NAME: <AZ_STORAGE_CONTAINER_NAME>
      AZURE_BLOB_STORAGE_CONNECTION_STRING: <AZURE_BLOB_STORAGE_CONNECTION_STRING>

Google storage account¶

DataRobot primarily employs GKE Workload Identity to access object storage containers in Google Cloud, as this method provides a secure and efficient way to authenticate Kubernetes workloads without relying on hard-coded credentials. However, for legacy reasons, service account key is supported. By supporting both GKE Workload Identity and service account keys, DataRobot ensures a flexible and secure access strategy, accommodating both modern practices and legacy requirements while effectively managing interactions with object storage containers.

Workload Identity¶

As part of the DataRobot installation process, a dedicated service account must be created to ensure secure access to Google Cloud resources.

The service account needs the following rights:

roles/storage.objectUser
roles/storage.insightsCollectorService

Using the following gcloudcommands configures the Service account

PROJECT_ID=YOUR_GCP_PROJECT
NS=DR_NAMESPACE
SA_NAME=datarobot-storage-sa

gcloud iam service-accounts create $SA_NAME --project $PROJECT_ID
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com --role=roles/storage.objectUser
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com --role=roles/storage.insightsCollectorService

SALIST=('datarobot-storage-sa' 'dynamic-worker' 'kubeworker-sa' 'prediction-server-sa' 'internal-api-sa' 'build-service' 'tileservergl-sa' 'nbx-notebook-revisions-account' 'buzok-account' 'exec-manager-qw' 'exec-manager-wrangling' 'lrs-job-manager' 'blob-view-service')
for sa in "${SALIST[@]}"
do
    gcloud iam service-accounts add-iam-policy-binding $SA_NAME@$PROJECT_ID.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:$PROJECT_ID.svc.id.goog[$NS/$sa]"
done

Once completed, the values.yaml section to fill out should follow the minimal_datarobot-google_workload_identity.yaml

GKE Workload Identity

For additional information, see GKE Workload Identity.

Using service account key¶

As part of the DataRobot installation process, a dedicated service account and key can be created. To grant the service account access to your bucket, the following roles must be granted as described in the Google official documentation:

roles/storage.objectUser
roles/storage.insightsCollectorService

Once the information is available, the values.yaml section to fill out looks like this:

global:
  filestore:
    type: google
    environment:
      GOOGLE_STORAGE_CREDENTIALS_SOURCE: content
      GOOGLE_STORAGE_BUCKET: GCP_BUCKET_NAME
      GOOGLE_STORAGE_KEYFILE_CONTENTS: GCP_BASE64_SERVICE_ACCOUNT_KEY

Reference values file

For additional information, see minimal_datarobot-google_values.yaml.