Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Databricks Connector for Data Prep

User Persona: Data Prep User, Data Prep Admin, Data Source Admin, or IT/DevOps

Note

This document covers all configuration fields available during connector setup. Some fields may have already been filled out by your Administrator at an earlier step of configuration and may not be visible to you. For more information on Data Prep's connector framework, see Data Prep Connector setup. Also, your Admin may have named this connector something else in the list of Data Sources.

Configure Data Prep

This connector allows you to connect to Databricks for Library imports and exports. The following fields are used to define the connection parameters. It has been certified against Databricks on Azure and AWS.

This connector enables Import via browse, query and export operation.

All actions are performed over JDBC connection, except the data loading directly into Databricks storage (i.e., ADLS Gen2 or S3 bucket, depending on Databricks service provider) at the time of export.

General

Name: Name of the data source as it will appear to users in the UI.

Description: Description of the data source as it will appear to users in the UI.

Tip

You can connect Data Prep to multiple Databricks accounts. Using a descriptive name can be a big help to users in identifying the appropriate data source.

Databricks Server Configuration

  • Databricks Service Provider: Set this property based on which type of Databricks service you want to connect. We support Databricks on Azure and AWS.

  • Databricks on Azure

  • Databricks on AWS

  • Databricks Server Settings Type: Set this property based on how you want to configure the datasource to connect to Databricks.

  • Basic

  • Advanced

  • Databricks Server: The hostname of the server hosting the Databricks service.

  • Databricks Port: The port of the Databricks server.

  • Use SSL: Set this property to the value specified in the 'hive.server2.use.SSL' property of your Hive configuration file (hive-site.xml).

  • Transport Mode: Set this property to the value specified in the 'hive.server2.transport.mode' property of your Hive configuration file (hive-site.xml).

  • HTTP Path: In HTTP Transport Mode, set this property to specify the path component of the URL endpoint. This property should be set to the value specified in the 'hive.server2.thrift.http.path' property of the Hive configuration file (hive-site.xml).

  • Timeout: Seconds to wait until an operation times out. If set to zero, operations do not timeout.

  • JDBC Url: In Advance settings, set the entire JDBC Url to connect to Databricks. Please refer to the CData JDBC driver documentation documentation for more details.

Databricks Server Authentication Configuration

  • User: The username used to authenticate with Databricks server. Usually the username is ‘token’.

  • Password: The Personal Access Token used to authenticate with Databricks. Personal Access Token can be obtained by navigating to the User Settings page of your Databricks instance and selecting the Access Tokens tab.

Databricks Log Settings

  • Verbosity: The verbosity level that determines the amount of detail included in the log file. This is very useful to debug an issue in production.

  • Logfile: A path for driver log file within Pax server. All the directories in the specified path should pre-exist.

Databricks Server Export Storage Layer Configuration

Azure

  • ADLS Gen2 Data Store Root Directory: The apparent root path accessible by this connector. Use '/' to store the Databricks data within root folder of ADLS Gen2 file system.

  • ADLS Gen2 Storage Account Name: The Subdomain Name of your unique Azure URL. This Storage Account must be associated and accessible by Databricks cluster. ADLS Gen2 Storage account names must be between 3 and 24 characters in length and may contain numbers and lowercase letters only. Your ADLS Gen2 storage account name must be unique within Azure. No two storage accounts can have the same name.

  • ADLS Gen2 File System Name: The name of the ADLS Gen2 file system where you want to store the Databricks data within the storage account. This is sometimes called the 'container' name.

  • Authentication Type: The type of authentication you want to connect to ADLS Gen2 storage, either "Storage Account Access Key" or "Active Directory Username/Password."

  • ADLS Gen2 Storage Account Access Key: Enter the Storage Account Access Key in the field. This is sometimes referred to as a “Shared Key.”

  • Active Directory Username/Password: Enter the Azure Directory username and password associated with your account.

Note

You must grant access for Data Prep to read and write data within your Micorsoft account, otherwise you will get an error while attempting to connect. To grant access, click Test Data Source in the Connector set-up pane and follow the Grant Access link. This brings you to your Microsoft account where you can log in and grant access. Then, come back to Data Prep to continue.

AWS

  • S3 Bucket Name: An S3 Bucket name, where you want to store the Databricks data in Amazon S3. This S3 bucket must be associated and accessible by Databricks cluster.

  • S3 Object Prefix: The apparent root path accessible by this connector. Use ''/'' to store the Databricks data within root folder of S3 Bucket.

  • Authentication type: The authentication method for accessing S3 bucket.

  • AWS Credentials: Requires each user to enter the Access Key ID and Secret Key associated with the user’s AWS Access Key. This is the default setting.

  • Instance Profile (IAM Role): Enables all users in this tenant to access AWS without needing to individually authenticate.

  • IAM Cross Account: Enables access to S3 by assuming a role in another AWS account that has access to the configured S3 bucket.

Important

For the Instance Profile (IAM Role) and IAM Cross Account options, Data Prep must be installed on your Amazon EC2 hosts.

  • Encryption type:

  • None

  • SSE-S3
  • SSE-KMS

  • Bucket Region Locator: S3 AWS bucket region locator strategy.

  • Socket Timeout Seconds: The number of seconds to wait for a response from Amazon S3 on an established connection. The default value is 5 minutes. To handle export of large files, increase the value.

  • Browse:

    • View list of available databases and tables.
  • Import:

    • Browse:

      • Browse to a table (partitioned and non-partitioned) and and click the name for import.
    • Query:

      • Using a legal SQL Select Query
  • Export:

    • Browse to a database and export the table.

Configuration Layout

Databricks on Azure with ADLS Gen2 Storage

Databricks on AWS with S3 bucket Storage

Import via Browse

Export via Browse

Configure Databricks cluster

In addition to configuring your Databricks connector in DataRobot, you must also add Spark configurations to your Databricks cluster:

  1. Navigate to the Configuration tab of your DataBricks cluster and expand Advanced Options.

  2. In the Spark tab, add and save the following configuration settings:

    spark.sql.legacy.parquet.datetimeRebaseModeInRead LEGACY
    spark.driver.maxResultSize 12g
    

Databricks Connector Known Issues and Limitations

The following list of features might not work in some production environments. These issues will be fixed in an upcoming release.

  • Authentication with Azure Databricks instance and ADLS GEN2 storage using Active Directory credentials.

  • Authentication with AWS Databricks instance and Amazon S3 service storage using Cross account bucket ARN.

  • Authentication with AWS Databricks instance and Amazon S3 service storage using IAM Role enabled.

  • Import tables from AWS Databricks instance whose data is encrypted with SSE-KMS in an unencrypted S3 bucket.

  • Import tables from AWS Databricks instance whose data is encrypted with SSE-S3 and SSE-KMS in an encrypted S3 bucket.

  • Import tables from AWS Databricks instance after authentication with Cross account and IAM role.

  • Export to AWS Databricks instance after authentication with Cross account and IAM role.


Updated October 28, 2021
Back to top