Compute Spark API¶
The Compute Spark API service gives DataRobot the ability to launch and manage interactive Spark sessions and where available cloud based serverless batch workloads.
Feature Availability¶
AWS EKS: Interactive & Batch Google GKE: Interactive & Batch Azure AKS: Interactive Generic: Interactive
Services¶
The Compute Spark API service includes the following service Deployments:
compute-spark-appcompute-spark-spark-celerycompute-spark-interactive-spark-watchercompute-spark-batch-spark-watcher
ジョブ¶
A database migration Job compute-spark-db-migration-job-* as a initContainer within the compute-spark-app deployment. Failure of this migration job will result in missing tables or columns. Migrations can be manually executed within any of the deployments by using kubectl to exec into a running container and executing cd ./spark/db && alembic upgrade head.
A CronJob compute-spark-db-cleanup-cron-job runs on a schedule to clean up any lingering interactive Spark applications the service manages. Failures of the job should be ignored if a more recent job has run successfully.
Configmaps and Secrets¶
Configuration values are supplied to the Pods through environment variables specified within the spark-compute-envvars ConfigMap. Information like PostgreSQL user's password and RabbitMQ user's password that are confidential are provided as environment variables mounted from spark-csp-secrets Kubernetes Secret.
PCS Requirements¶
The Compute Spark API service depends on both PostgreSQL and RabbitMQ to meet its foundational requirements. The service needs exclusive access to a PostgreSQL database, using the public schema, and an exclusive vhost within RabbitMQ.
Helm Configuration Values¶
Spark Interactive Feature¶
LRS Image Requirements¶
Interactive Spark applications internally depend on DataRobot's Long Running Services (LRS) feature. Each application is deployed as an LRS service using the datarobot/livy-k8s image included with the installation image artifact. This image has to be available in a registry from where DataRobot LRS can pull the image. This is achieved by setting lrs-operator.operator.config.pullSecrets with the correct image pull secret.
Example LRS Config:
lrs-operator:
image:
pullSecrets:
- name: datarobot-image-pullsecret
operator:
config:
pullSecrets:
- name: datarobot-image-pullsecret
In the event that the datarobot/livy-k8s has been re-tagged before upload to a customer registry, the image name must be configured as part of liveImage map within the customer specific values.yaml.
Note - DON'T COPY THIS UNLESS YOU CHANGED THE IMAGE NAME.
compute-spark:
# Livy Image Location for Spark Interactive
livyImage:
registry: docker.io
project: datarobot
name: livy-k8s
tag: <Livy image tag name>
...
Network security requirements¶
Please check Security requirements section if you're using an external object store.
Spark Batch Feature¶
Important: Configuring Batch Spark is optional. Customers without AWS EMR (or an equivalent cloud configuration) can safely skip these settings without impacting interactive features such as Wrangler and the SQL Editor.
Note: The default configuration in the datarobot-prime chart sets third_party: generic. This value is a default setting and does not imply that Batch Spark (EMR) is required. It can be overridden if Batch Spark is needed.
Spark batch functionality is implemented using a cloud provider's managed service offering. Each cloud provider will have specific infrastructure tasks that must be completed by a customer administrator before attempting to enable this feature.
Batch Support on AWS¶
Compute Spark batch functionality on AWS relies on the EMR Serverless. Where configuration details about EMR Serverless are lacking in the is guide, please consult the official AWS documentation.
Prerequisites and Constraints¶
- An S3 bucket for storing datasets, python files, jars, etc. for use by a EMR Serverless batch job.
- The ECRs hosting any Blob View Storage custom Spark images that will be used for an EMR Serverless batch job.
- An IAM role EMR Serverless batch jobs will assume to access the above S3 bucket and ECRs. KMS permissions to encrypt and decrypt bucket content is required.
- An IRSA role provisioned and applied to the compute-spark K8s service account for launching EMR-Serverless batch workloads. KMS permissions to encrypt and decrypt bucket content is required.
- The list of AWS Subnets EMR Serverless batch job will run in. These subnets are assumed to have access to the DataRobot Public API.
- The list of AWS security groups that will be applied to EMR Serverless batch jobs.
Setting values.yaml for AWS Batch Workloads¶
To utilize AWS EMR-Serverless Batch functionality, the compute-spark chart must have certain values set and environment variables provided at installation time.
compute-spark:
...
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: YOUR_AWS_IRSA_ROLE_NAME # with permissions to create EMR Serverless, access S3 + KMS, ECR, and create net interfaces
...
config_env_vars:
SPARK_CUSTOMER_IAM_ROLE: YOUR_SPARK_CUSTOMER_IAM_ROLE # EMR Serverless batch jobs will assume to access the above S3 bucket and ECRs
SPARK_NETWORK_AWS_SUBNET_IDS: # e.g. '["subnet-09324f6779738e07a","subnet-0ca401bda1a41c2a4","subnet-0ded853d655b0338b"]'
SPARK_NETWORK_AWS_SECURITY_GROUPS: # e.g. '["sg-0e013a851ad5b55ee"]'
SPARK_BLOB_STORAGE: YOUR_SPARK_BLOB_STORAGE # S3 bucket stores datasets, python files, jars, etc. for use by a EMR Serverless batch job
...
third_parties:
third_party: "aws"
aws:
aws_default_region: YOUR_AWS_S3_REGION # e.g. "us-east-1"
aws_region: YOUR_AWS_S3_REGION
Batch Support on Azure¶
Compute Spark batch functionality on Azure relies on the Databricks API.
Prerequisites and Constraints for Azure¶
This feature requires:
-
A storage account with a storage container in it. The storage container will be where you store your datasets, notebooks, job files, and other any files needed by your spark job. This feature uses (Azure Data Lake Storage)[https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction] meaning communication goes to the
dfs.core.windows.netendpoint instead of theblob.core.windows.netendpoint. When using private endpoint connections for Azure Storage, be sure to create a private endpoint for both the Blob Storage endpoint as well as the Data Lake Storage endpoint. -
Make sure to enable
Hierarchical Namespacefor the storage account by going to the Overview page for the Storage Account, search forHierarchical namespace, and if the wordDisabledis next to it then click on the wordDisabled. -
A managed identity (rather than a service principle, for simplicity reasons). During the creation process, ensure that the managed identity is in the same resource group as the storage account.
-
Assign the managed identity "Storage Blob Data Contributor" (id: "ba92f5b4-2d11-453d-a403-e96b0029c9fe") over the storage container (which includes full permissions over the all resources within the container). Steps here.
-
Create a databricks workspace via these steps.
-
Assign the role "Managed Application Contributor Role" to the managed identity for the Databricks workspace.
-
Link your databricks workspace to the managed identity. Add this ending to your workspace url:
/settings/workspace/identity-and-access/service-principals/create. Example:https://adb-4324452524.9.azuredatabricks.net/settings/workspace/identity-and-access/service-principals/create. SelectMicrosoft Entra ID managed, enter the manage identity's "Client ID" into the "Microsoft Entra application ID" field, enter the managed identity's name into the "Service principal name" field, select "Allow cluster creation", select "Databricks SQL Access", select "Allow workspace access", and then click "Add". -
Give your managed identity federated access to your AKS cluster
Setting values.yaml for Azure Batch Workloads¶
To utilize Azure Databricks Batch functionality, the compute-spark chart must have certain values set and environment variables provided at installation time.
# Example Azure batch configuration
compute-spark:
extraLabels:
azure.workload.identity/use: "true"
...
third_parties:
third_party: "azure"
...
config_env_vars:
SPARK_BLOB_STORAGE: # abfss storage used by databricks to fulfill spark job dependencies and to save job output
AZURE_WORKSPACE_HOST: # The url for your databricks workspace (e.g. https://adb-800256028595268.9.azuredatabricks.net)
AZURE_SUBSCRIPTION_ID # The subscription id that both the managed identity and the storage bucket belong to. The subscription id can be found when viewing the managed identity in your azure portal
AZURE_RESOURCE_GROUP_NAME #The resource group name that both the managed identity and the storage bucket belong to. # The resource group can be found when viewing the storage account overview page in your azure portal
AZURE_CLIENT_ID: # The client id can be found when viewing the managed identity overview page in your azure portal
...
serviceAccount:
name: "spark-compute-services-sa"
annotations:
azure.workload.identity/client-id: # The client id can be found when viewing the managed identity in your azure portal
azure.workload.identity/tenant-id: # The tenant id can be found via these steps: https://learn.microsoft.com/en-us/azure/azure-portal/get-subscription-tenant-id
...
Batch Support on Google¶
Compute Spark batch functionality on Google GCP relies on the Dataproc Batch Serverless API.
Prerequisites and Constraints for Google¶
- This feature is limited to integration with a single Google Project in a single region.
- The network and subnet the GKE cluster hosting DataRobot must configured for Dataproc communication. See Dataproc Cluster network configuration. Likewise batch workloads reaching out to other third party services, like Snowflake, must have networking setup in Dataproc accordingly.
- Workload Identity must be enabled on the GKE cluster for authentication with GCP APIs.
- A Cloud Storage Bucket must be made available for storing datasets, python files, jars, etc. for use by a Dataproc Serverless Batch run.
- Custom Docker/OCI images for Spark batch workloads must be stored within a Google Artifact Registry.
- A Google Service Account must be created for running customer batch workloads on Dataproc.
- The service account must have the
roles/dataproc.workerrole to run workloads in Dataproc. - The service account must have read/write permissions to the above Google Cloud Storage bucket. The
roles/storage.objectUsershould be sufficient, but see IAM roles for Cloud Storage for more granular permissions. - The service account must have
roles/artifactregistry.readerpermissions to the above Google Artifact Registry for pulling custom images for batch workloads. - A Google Service Account must be created for use by the
compute-sparkservice for launching workloads in Dataproc. - The service account must have
roles/dataproc.editorto create workloads in Dataproc. - The service account must have the
roles/iam.workloadIdentityUserrole and be mapped to the Kubernetes service account that thecompute-sparkservice runs under. This is required for thecompute-sparkservice to submit GCP API requests from its pods. - The service account must have the
roles/iam.serviceAccountActorrole mapped to the customer service account that runs batch workloads. This is required to submit workloads as the customer. - The service account must have read/write permissions to the above Google Cloud Storage bucket. The
roles/storage.objectUsershould be sufficient, but see IAM roles for Cloud Storage for more granular permissions. - The service account must have
roles/artifactregistry.readerpermissions to the above Google Artifact Registry for submitting batch jobs with custom images to Dataproc.
Setting values.yaml for GCP Batch Workloads¶
To utilize Dataproc Serverless Batch functionality, the compute-spark chart must have certain values set and environment variables provided at installation time.
# Example GCP batch configuration
compute-spark:
...
third_parties:
third_party: "gcp"
...
config_env_vars:
GCP_DATAPROC_PROJECT_ID: YOUR_DATAPROC_PROJECT_ID # e.g. acme-project
GCP_DATAPROC_REGION: YOUR_DATAPROC_REGION # e.g. us-east1
GCP_DATAPROC_NETWORK: YOUR_DATAROBOT_K8S_NETWORK
GCP_DATAPROC_SUBNET: YOUR_DATAROBOT_K8S_SUBNETWORK
SPARK_CUSTOMER_IAM_ROLE: YOUR_CUSTOMER_DATAPROC_WORKLOAD_GOOGLE_SERVICE_ACCOUNT_EMAIL # e.g. customer-gsa@acme-project.iam.gserviceaccount.com
SPARK_BLOB_STORAGE: YOUR_GOOGLE_BUCKET_FOR_DATAPROC
...
serviceAccount:
name: "spark-compute-services-sa"
annotations:
iam.gke.io/gcp-service-account: YOUR_COMPUTE_SPARK_GOOGLE_SERVICE_ACCOUNT_EMAIL # e.g. "compute-spark-gsa@acme-project.iam.gserviceaccount.com"
...
Database and Queue Connections¶
The default values of datarobot umbrella chart will work with DataRobot Persistent Critical Services (PCS) chart without changes. Otherwise, update the compute-spark-chart.services section within your deployment's specific connection information.
...
### Example External Services Configuration ###
compute-spark:
...
services:
postgresql:
username: YOUR_POSTGRESQL_USER_NAME
database: YOUR_POSTGRESQL_DATABASE_NAME
hostname: YOUR_POSTGRESQL_HOSTNAME
port: YOUR_POSTGRESQL_PORT
# Additional connection options in the form of a postgres URI connections string
# Ex: "?option1=value1&option2=value2"
connection_options: ""
rabbitmq:
username: YOUR_RABBITMQ_USERNAME
hostname: YOUR_RABBITMQ_HOSTNAME
# amqp or amqps
scheme: amqp
vhost: YOUR_RABBITMQ_VHOST
port: YOUR_RABBITMQ_AMQP_PORT
admin_port: YOUR_RABBITMQ_ADMIN_PORT
...
環境変数¶
You can supply additional environment variables to the spark-compute-envvars ConfigMap using compute-spark-chart.config_env_vars.
...
compute-spark:
...
### Example Environment Variables ###
config_env_vars:
SPARK_ENVVAR_NAME: YOUR_ENVVAR_VALUE
...
Secrets and External Secrets¶
The compute-spark chart supports reusing PCS secrets, using Kubernetes Secrets directly to create spark-csp-secrets, or via the use of the Kubernetes External Secrets Operator.
The following keys are expected to be defined within the spark-csp-secrets Secret.
- PGSQL_PASSWORD
- RABBITMQ_PASSWORD
Option 1 (Default): Persistent Critical Services (PCS) chart secrets can be reused to provide Compute Spark API's PostgreSQL and RabbitMQ access.
...
compute-spark:
...
createSecrets: "false"
useSecretsVolume: "true"
...
Option 2: Secrets can be directly supplied through values.yaml.
...
### Example Environment Variables ###
compute-spark:
...
### Example Secret Variables ###
createSecrets: "true"
useSecretsVolume: "false"
secrets:
PGSQL_PASSWORD: YOUR_POSTGRESQL_PASSWORD
RABBITMQ_PASSWORD: YOUR_RABBITMQ_PASSWORD
...
Option 3: Use AWS Secrets Manager and External Secrets operator.
This option requires the specified service account to have the appropriate IRSA Roles to access AWS Secret Manager values. All required secrets are expected to be supplied within a single AWS secret key.
...
### Example External Secrets ###
compute-spark:
...
createSecrets: "false"
useSecretsVolume: "false"
secretManager:
labels: {}
enabled: true
secretStore:
useExistingSecretStore: false
name: "csp-spark-secretstore"
region: YOUR_AWS_REGION
serviceAccount:
arn: YOUR_IRSA_ROLE_ARN
name: "compute-spark-api-tenant-secret-store"
refreshInterval: 1m
# Example "/csp/csp-secrets"
key: YOUR_AWS_SECRETS_MANAGER_KEY_PATH
...
Option 4: A Kubernetes administrator can create a spark-csp-secrets before execution of the datarobot-aws umbrella chart with required keys.