Skip to content

Compute Spark API

The Compute Spark API service gives DataRobot the ability to launch and manage interactive Spark sessions and where available cloud based serverless batch workloads.

Feature Availability

AWS EKS: Interactive & Batch Google GKE: Interactive & Batch Azure AKS: Interactive Generic: Interactive

Services

The Compute Spark API service includes the following service Deployments:

  • compute-spark-app
  • compute-spark-spark-celery
  • compute-spark-interactive-spark-watcher
  • compute-spark-batch-spark-watcher

Jobs

A database migration Job compute-spark-db-migration-job-* as a initContainer within the compute-spark-app deployment. Failure of this migration job will result in missing tables or columns. Migrations can be manually executed within any of the deployments by using kubectl to exec into a running container and executing cd ./spark/db && alembic upgrade head.

A CronJob compute-spark-db-cleanup-cron-job runs on a schedule to clean up any lingering interactive Spark applications the service manages. Failures of the job should be ignored if a more recent job has run successfully.

Configmaps and Secrets

Configuration values are supplied to the Pods through environment variables specified within the spark-compute-envvars ConfigMap. Information like PostgreSQL user's password and RabbitMQ user's password that are confidential are provided as environment variables mounted from spark-csp-secrets Kubernetes Secret.

PCS Requirements

The Compute Spark API service depends on both PostgreSQL and RabbitMQ to meet its foundational requirements. The service needs exclusive access to a PostgreSQL database, using the public schema, and an exclusive vhost within RabbitMQ.

Helm Configuration Values

Spark Interactive Feature

LRS Image Requirements

Interactive Spark applications internally depend on DataRobot's Long Running Services (LRS) feature. Each application is deployed as an LRS service using the datarobot/livy-k8s image included with the installation image artifact. This image has to be available in a registry from where DataRobot LRS can pull the image. This is achieved by setting lrs-operator.operator.config.pullSecrets with the correct image pull secret.

Example LRS Config:

lrs-operator:
  image:
    pullSecrets:
    - name: datarobot-image-pullsecret
  operator:
    config:
      pullSecrets:
      - name: datarobot-image-pullsecret

In the event that the datarobot/livy-k8s has been re-tagged before upload to a customer registry, the image name must be configured as part of liveImage map within the customer specific values.yaml. Note - DON'T COPY THIS UNLESS YOU CHANGED THE IMAGE NAME.

compute-spark:
  # Livy Image Location for Spark Interactive
  livyImage:
    registry: docker.io
    project: datarobot
    name: livy-k8s
    tag: <Livy image tag name>
...

Network security requirements

Please check Security requirements section if you're using an external object store.

Spark Batch Feature

Important: Configuring Batch Spark is optional. Customers without AWS EMR (or an equivalent cloud configuration) can safely skip these settings without impacting interactive features such as Wrangler and the SQL Editor.

Note: The default configuration in the datarobot-prime chart sets third_party: generic. This value is a default setting and does not imply that Batch Spark (EMR) is required. It can be overridden if Batch Spark is needed.

Spark batch functionality is implemented using a cloud provider's managed service offering. Each cloud provider will have specific infrastructure tasks that must be completed by a customer administrator before attempting to enable this feature.

Batch Support on AWS

Compute Spark batch functionality on AWS relies on the EMR Serverless. Where configuration details about EMR Serverless are lacking in the is guide, please consult the official AWS documentation.

Prerequisites and Constraints
  • An S3 bucket for storing datasets, python files, jars, etc. for use by a EMR Serverless batch job.
  • The ECRs hosting any Blob View Storage custom Spark images that will be used for an EMR Serverless batch job.
  • An IAM role EMR Serverless batch jobs will assume to access the above S3 bucket and ECRs. KMS permissions to encrypt and decrypt bucket content is required.
  • An IRSA role provisioned and applied to the compute-spark K8s service account for launching EMR-Serverless batch workloads. KMS permissions to encrypt and decrypt bucket content is required.
  • The list of AWS Subnets EMR Serverless batch job will run in. These subnets are assumed to have access to the DataRobot Public API.
  • The list of AWS security groups that will be applied to EMR Serverless batch jobs.
Setting values.yaml for AWS Batch Workloads

To utilize AWS EMR-Serverless Batch functionality, the compute-spark chart must have certain values set and environment variables provided at installation time.

compute-spark:
...
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: YOUR_AWS_IRSA_ROLE_NAME # with permissions to create EMR Serverless, access S3 + KMS, ECR, and create net interfaces
...
  config_env_vars:
    SPARK_CUSTOMER_IAM_ROLE:  YOUR_SPARK_CUSTOMER_IAM_ROLE # EMR Serverless batch jobs will assume to access the above S3 bucket and ECRs
    SPARK_NETWORK_AWS_SUBNET_IDS: # e.g. '["subnet-09324f6779738e07a","subnet-0ca401bda1a41c2a4","subnet-0ded853d655b0338b"]'
    SPARK_NETWORK_AWS_SECURITY_GROUPS: # e.g. '["sg-0e013a851ad5b55ee"]'
    SPARK_BLOB_STORAGE: YOUR_SPARK_BLOB_STORAGE # S3 bucket stores datasets, python files, jars, etc. for use by a EMR Serverless batch job
...
  third_parties:
    third_party: "aws"
    aws:
      aws_default_region: YOUR_AWS_S3_REGION # e.g. "us-east-1"
      aws_region: YOUR_AWS_S3_REGION

Batch Support on Azure

Compute Spark batch functionality on Azure relies on the Databricks API.

Prerequisites and Constraints for Azure

This feature requires:

  • A storage account with a storage container in it. The storage container will be where you store your datasets, notebooks, job files, and other any files needed by your spark job. This feature uses (Azure Data Lake Storage)[https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction] meaning communication goes to the dfs.core.windows.net endpoint instead of the blob.core.windows.net endpoint. When using private endpoint connections for Azure Storage, be sure to create a private endpoint for both the Blob Storage endpoint as well as the Data Lake Storage endpoint.

  • Make sure to enable Hierarchical Namespace for the storage account by going to the Overview page for the Storage Account, search for Hierarchical namespace, and if the word Disabled is next to it then click on the word Disabled.

  • A managed identity (rather than a service principle, for simplicity reasons). During the creation process, ensure that the managed identity is in the same resource group as the storage account.

  • Assign the managed identity "Storage Blob Data Contributor" (id: "ba92f5b4-2d11-453d-a403-e96b0029c9fe") over the storage container (which includes full permissions over the all resources within the container). Steps here.

  • Create a databricks workspace via these steps.

  • Assign the role "Managed Application Contributor Role" to the managed identity for the Databricks workspace.

  • Link your databricks workspace to the managed identity. Add this ending to your workspace url: /settings/workspace/identity-and-access/service-principals/create. Example: https://adb-4324452524.9.azuredatabricks.net/settings/workspace/identity-and-access/service-principals/create. Select Microsoft Entra ID managed, enter the manage identity's "Client ID" into the "Microsoft Entra application ID" field, enter the managed identity's name into the "Service principal name" field, select "Allow cluster creation", select "Databricks SQL Access", select "Allow workspace access", and then click "Add".

  • Give your managed identity federated access to your AKS cluster

Setting values.yaml for Azure Batch Workloads

To utilize Azure Databricks Batch functionality, the compute-spark chart must have certain values set and environment variables provided at installation time.

# Example Azure batch configuration
compute-spark:
  extraLabels:
    azure.workload.identity/use: "true"
  ...
  third_parties:
    third_party: "azure"
  ...
  config_env_vars:
    SPARK_BLOB_STORAGE: # abfss storage used by databricks to fulfill spark job dependencies and to save job output
    AZURE_WORKSPACE_HOST: # The url for your databricks workspace (e.g. https://adb-800256028595268.9.azuredatabricks.net)
    AZURE_SUBSCRIPTION_ID # The subscription id that both the managed identity and the storage bucket belong to. The subscription id can be found when viewing the managed identity in your azure portal
    AZURE_RESOURCE_GROUP_NAME #The resource group name that both the managed identity and the storage bucket belong to. # The resource group can be found when viewing the storage account overview page in your azure portal
    AZURE_CLIENT_ID: # The client id can be found when viewing the managed identity overview page in your azure portal
  ...
  serviceAccount:
    name: "spark-compute-services-sa"
    annotations:
      azure.workload.identity/client-id: # The client id can be found when viewing the managed identity in your azure portal
      azure.workload.identity/tenant-id: # The tenant id can be found via these steps: https://learn.microsoft.com/en-us/azure/azure-portal/get-subscription-tenant-id
  ...

Batch Support on Google

Compute Spark batch functionality on Google GCP relies on the Dataproc Batch Serverless API.

Prerequisites and Constraints for Google
  • This feature is limited to integration with a single Google Project in a single region.
  • The network and subnet the GKE cluster hosting DataRobot must configured for Dataproc communication. See Dataproc Cluster network configuration. Likewise batch workloads reaching out to other third party services, like Snowflake, must have networking setup in Dataproc accordingly.
  • Workload Identity must be enabled on the GKE cluster for authentication with GCP APIs.
  • A Cloud Storage Bucket must be made available for storing datasets, python files, jars, etc. for use by a Dataproc Serverless Batch run.
  • Custom Docker/OCI images for Spark batch workloads must be stored within a Google Artifact Registry.
  • A Google Service Account must be created for running customer batch workloads on Dataproc.
  • The service account must have the roles/dataproc.worker role to run workloads in Dataproc.
  • The service account must have read/write permissions to the above Google Cloud Storage bucket. The roles/storage.objectUser should be sufficient, but see IAM roles for Cloud Storage for more granular permissions.
  • The service account must have roles/artifactregistry.reader permissions to the above Google Artifact Registry for pulling custom images for batch workloads.
  • A Google Service Account must be created for use by the compute-spark service for launching workloads in Dataproc.
  • The service account must have roles/dataproc.editor to create workloads in Dataproc.
  • The service account must have the roles/iam.workloadIdentityUser role and be mapped to the Kubernetes service account that the compute-spark service runs under. This is required for the compute-spark service to submit GCP API requests from its pods.
  • The service account must have the roles/iam.serviceAccountActor role mapped to the customer service account that runs batch workloads. This is required to submit workloads as the customer.
  • The service account must have read/write permissions to the above Google Cloud Storage bucket. The roles/storage.objectUser should be sufficient, but see IAM roles for Cloud Storage for more granular permissions.
  • The service account must have roles/artifactregistry.reader permissions to the above Google Artifact Registry for submitting batch jobs with custom images to Dataproc.
Setting values.yaml for GCP Batch Workloads

To utilize Dataproc Serverless Batch functionality, the compute-spark chart must have certain values set and environment variables provided at installation time.

# Example GCP batch configuration
compute-spark:
  ...
  third_parties:
    third_party: "gcp"
  ...
  config_env_vars:
    GCP_DATAPROC_PROJECT_ID: YOUR_DATAPROC_PROJECT_ID # e.g. acme-project
    GCP_DATAPROC_REGION: YOUR_DATAPROC_REGION # e.g. us-east1
    GCP_DATAPROC_NETWORK: YOUR_DATAROBOT_K8S_NETWORK
    GCP_DATAPROC_SUBNET: YOUR_DATAROBOT_K8S_SUBNETWORK
    SPARK_CUSTOMER_IAM_ROLE: YOUR_CUSTOMER_DATAPROC_WORKLOAD_GOOGLE_SERVICE_ACCOUNT_EMAIL # e.g. customer-gsa@acme-project.iam.gserviceaccount.com
    SPARK_BLOB_STORAGE: YOUR_GOOGLE_BUCKET_FOR_DATAPROC
  ...
  serviceAccount:
    name: "spark-compute-services-sa"
    annotations:
      iam.gke.io/gcp-service-account: YOUR_COMPUTE_SPARK_GOOGLE_SERVICE_ACCOUNT_EMAIL # e.g. "compute-spark-gsa@acme-project.iam.gserviceaccount.com"
  ...

Database and Queue Connections

The default values of datarobot umbrella chart will work with DataRobot Persistent Critical Services (PCS) chart without changes. Otherwise, update the compute-spark-chart.services section within your deployment's specific connection information.

...
### Example External Services Configuration ###
compute-spark:
  ...
  services:
    postgresql:
      username: YOUR_POSTGRESQL_USER_NAME
      database: YOUR_POSTGRESQL_DATABASE_NAME
      hostname: YOUR_POSTGRESQL_HOSTNAME
      port: YOUR_POSTGRESQL_PORT
      # Additional connection options in the form of a postgres URI connections string
      # Ex: "?option1=value1&option2=value2"
      connection_options: ""
    rabbitmq:
      username: YOUR_RABBITMQ_USERNAME
      hostname: YOUR_RABBITMQ_HOSTNAME
      # amqp or amqps
      scheme: amqp
      vhost: YOUR_RABBITMQ_VHOST
      port: YOUR_RABBITMQ_AMQP_PORT
      admin_port: YOUR_RABBITMQ_ADMIN_PORT
...

Environment Variables

You can supply additional environment variables to the spark-compute-envvars ConfigMap using compute-spark-chart.config_env_vars.

...
compute-spark:
  ...
  ### Example Environment Variables ###
  config_env_vars:
    SPARK_ENVVAR_NAME: YOUR_ENVVAR_VALUE
...

Secrets and External Secrets

The compute-spark chart supports reusing PCS secrets, using Kubernetes Secrets directly to create spark-csp-secrets, or via the use of the Kubernetes External Secrets Operator.

The following keys are expected to be defined within the spark-csp-secrets Secret.

  • PGSQL_PASSWORD
  • RABBITMQ_PASSWORD

Option 1 (Default): Persistent Critical Services (PCS) chart secrets can be reused to provide Compute Spark API's PostgreSQL and RabbitMQ access.

...
compute-spark:
  ...
  createSecrets: "false"
  useSecretsVolume: "true"
...

Option 2: Secrets can be directly supplied through values.yaml.

...
### Example Environment Variables ###
compute-spark:
  ...
  ### Example Secret Variables ###
  createSecrets: "true"
  useSecretsVolume: "false"
  secrets:
    PGSQL_PASSWORD: YOUR_POSTGRESQL_PASSWORD
    RABBITMQ_PASSWORD: YOUR_RABBITMQ_PASSWORD
...

Option 3: Use AWS Secrets Manager and External Secrets operator.

This option requires the specified service account to have the appropriate IRSA Roles to access AWS Secret Manager values. All required secrets are expected to be supplied within a single AWS secret key.

...
### Example External Secrets ###
compute-spark:
  ...
  createSecrets: "false"
  useSecretsVolume: "false"
  secretManager:
    labels: {}
    enabled: true
    secretStore:
      useExistingSecretStore: false
      name: "csp-spark-secretstore"
    region: YOUR_AWS_REGION
    serviceAccount:
      arn: YOUR_IRSA_ROLE_ARN
      name: "compute-spark-api-tenant-secret-store"
    refreshInterval: 1m
    # Example "/csp/csp-secrets"
    key: YOUR_AWS_SECRETS_MANAGER_KEY_PATH
...

Option 4: A Kubernetes administrator can create a spark-csp-secrets before execution of the datarobot-aws umbrella chart with required keys.