Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Cloudera CDH6 Impala Connector for Data Prep

User Persona: Data Prep Admin, Data Source Admin, or IT/DevOps

Availability information

This Connector is not available to Data Prep SaaS customers.

Note

This document covers all configuration fields available during connector setup. Some fields may have already been filled out by your Administrator at an earlier step of configuration and may not be visible to you. For more information on Data Prep's connector framework, see Data Prep Connector setup. Also, your Admin may have named this connector something else in the list of Data Sources.

Configuring Data Prep

This connector allows you to connect to an Impala database for imports and exports. The fields you are required to set up here depend on the authentication method you select—Simple, Kerberos, or Hybrid. The type of authentication you select will apply to all Data Sources that you create based on a connector configuration.

Note

Configuring this Connector requires file system access on the Data Prep Server and a core-site.xml with the Hadoop cluster configuration. Please reach out to your Customer Success representative for assistance with this step.

General

  • Name: Name of the data source as it will appear to users in the UI.
  • Description: Description of the data source as it will appear to users in the UI.

Tip

You can connect Data Prep to multiple Impala databases. Using a descriptive name can be a big help to users in identifying the appropriate data source.

Hadoop Cluster

  • Authentication Method: Choose between Simple, Kerberos, or Hybrid. The type of authentication you select will apply to all Data Sources that you create based on a connector configuration.
  • Cluster Core Site XML Path: Fully qualified path of core-site.xml on webserver. Example: /path/to/core-site.xml
  • Cluster HDFS Site XML Path: Fully qualified path of hdfs-site.xml on webserver. Example: /path/to/hdfs-site.xml
  • Native Hadoop Library Path: Fully qualified path of native Hadoop libraries on webserver. Example: /path/to/libraries
  • HDFS User: The username on the HDFS cluster used to write files for export to Impala.

Impala Configuration

  • JDBC URL: The URL used to access Impala for import and registration of external tables. If Kerberos authentication is used, the following string must be added to the URL: ";auth=kerberos;impala.server2.proxy.user=${user.name}"

  • If a proxy user is used, then the string ${user.name} must be replaced with the proxy username

  • Impala File Location: The location on the Hadoop cluster used to store Impala files for external tables.

Kerberos Configuration

The following parameters are required for Kerberos and Hybrid authentication.

  • Principal: Kerberos Principal.
  • Realm: Kerberos Realm.
  • KDC Hostname: Kerberos Key Distribution Center Hostname.
  • Kerberos Configuration File: Fully-qualified path of Kerberos configuration file on webserver.
  • Keytab File: Fully-qualified path of Kerberos Keytab File on webserver.
  • Use Application User: Check this box to read/write as the logged-in application user, or uncheck to use proxy user.
  • Proxy User: The proxy used to authenticate with the cluster. ${user.name} can be entered as the proxy user. ${user.name} works similar to selecting Use Application User but allows for more flexibility. For example:

  • To add a domain to the user’s credentials, enter \domain_name\${user.name} in the Proxy User field. Data Prep will pass the username and the domain.

    • Example: \Accounts\${user.name} results in AccountsJoe (assuming Joe is the username).
  • To apply a text modifier to the username, add .modifier to the key ${user.name}. The acceptable modifiers are: toLower, toUpper, toLowerCase, toUpperCase, and trim.
    • For example ${user.name.toLowerCase} converts Joe into joe (assuming Joe is the username).

Credentials

  • Impala User: The username used to access Impala for Simple and Hybrid authentication.
  • Impala Password: The password used to access Impala for Simple and Hybrid authentication.

Visibility Settings

You can control the schemas and tables that are shown to users when they browse a data source during import. For schemas and tables you can choose to:

  • "Show only" which returns only the schemas or tables that you specify here.
  • "Hide" which hides the schemas and tables that you specify here.
  • "Show all" which is the default setting to display everything in the data source.

When you select the "Show only" or "Hide" options, a field is provided for specifying the schemas or tables on which you want the option enforced.

Note

These settings are not enforced when users query against the data source; query results still return a complete list of matches. For example, if you choose to "hide" a specific schema, users can still execute queries that pull data from tables within that schema. However, that schema will not be displayed to users when they browse the data source.

Import Configuration

  • Query Prefetch Size: Number of rows per batch.
  • Max Column Size: The maximum size in Unicode characters allowed for any value for both import and export. Values larger than this will be replaced by null.
  • PRE-IMPORT SQL: SQL to be executed before import process. This SQL may execute multiple times (for preview and import) and could be multiple SQL statements, newline-delimited.
  • POST-IMPORT SQL: SQL to be executed after import process. This SQL may execute multiple times (for preview and import) and could be multiple SQL statements, newline-delimited.

Note

As the Pre- and Post-Import SQL may be executed multiple times throughout the import process, please take care when specifying these values in the Connector/Datasource Configuration as they will be executed for every import performed with this configuration.

Export Configuration

  • PRE-EXPORT SQL: SQL to be executed before export process. This SQL will execute once and could be multiple SQL statements, newline-delimited.
  • POST-EXPORT SQL: SQL to be executed after export process. This SQL will execute once and could be multiple SQL statements, newline-delimited.

Data Import and Export Information

Via Browsing

Browse to a table and "Select" the table for import.

Via SQL Query

Using SQL Select queries


Updated June 17, 2022
Back to top