Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Data Prep Spark SQL Guidelines

Structured Query Language (SQL) is a declarative language designed for managing data stored in relational databases. Spark SQL is a component of Spark that allows queries to be written against registered DataFrames (data organized into named columns) in much the same way as SQL is used to query databases. Data Prep supports a library of Spark SQL functions for use with the Spark SQL tool.

Use case

The example dataset featured in this section contains a sample of job application data with the target feature, Hired.

Query guidelines

Once a dataset is loaded into Data Prep, it is registered as a DataFrame and can now be queried using the dataset alias in a Spark SQL statement.

To use Spark SQL to shape your data, enable the Spark SQL tool and enter SQL queries in the Spark SQL Statement pane.

Data Prep allows only SELECT queries. Build the queries using the column names, for example:

SELECT
    EducationLevel,
    Hired
FROM dataset

See Prohibited keywords and functions for other restrictions.

Sample queries

The following are sample Spark SQL queries used in Data Prep, along with descriptions of the results.

Note

In Data Prep, SQL statements are case sensitive, adhering to the common SQL convention of using all caps for keywords and lowercase for variable names.

Example 1

SELECT * FROM dataset

The updated dataset includes all columns and all rows from the dataset.

Example 2

ELECT * FROM dataset

The result is: ParseException

Example 3

SELECT * from doesNotExist

The result is: NoSuchTableException

Example 4

SELECT Hired FROM dataset

The updated dataset includes only the Hired column (containing whether or not the application was successful) and all rows.

Example 5

SELECT
    EducationLevel,
    Hired
FROM dataset
WHERE EducationLevel = 5

The updated dataset includes only the Hired and EducationLevel columns and only those rows where the EducationLevel is 5.

Example 6

SELECT
    EducationLevel,
    CASE WHEN Hired = 'No' THEN 0 ELSE 1 END
AS HiredNum
FROM dataset
WHERE EducationLevel = 5

The updated dataset includes EducationLevel and a numeric version of the Hired column for rows where the EducationLevel is 5.

Example 7

SELECT
    EducationLevel,
    avg(CASE WHEN Hired = 'No' THEN 0 ELSE 1 END) AS acceptance_rate
FROM dataset
GROUP BY EducationLevel
ORDER BY EducationLevel

The updated dataset includes EducationLevel and the average acceptance rate in EducationLevel groups, ordered by EducationLevel.

Example 8

SELECT
     EducationLevel,
     avg(CASE WHEN Hired = 'No' THEN 0 ELSE 1 END) AS acceptance_rate,
     std(CASE WHEN Hired = 'No' THEN 0 ELSE 1 END) AS acceptance_rate_std
FROM dataset
GROUP BY EducationLevel
ORDER BY EducationLevel

The updated dataset includes EducationLevel and the average and standard deviation of the acceptance rate in EducationLevel groups, ordered by EducationLevel.

Example 9

SELECT
    EducationLevel,
    length(Summary) AS length_summary
FROM dataset

Here we surface the EducationLevel and the length of the cover letter that went along with the application.

Example 10

SELECT
    EducationLevel,
    avg(length(Summary)) AS avg_length_summary,
    std(length(Summary)) AS std_length_summary,
    std(CASE WHEN Hired = 'No' THEN 0 ELSE 1 END) AS acceptance_rate_std,
    avg(CASE WHEN Hired = 'No' THEN 0 ELSE 1 END) AS acceptance_rate
FROM dataset
GROUP BY EducationLevel
ORDER BY EducationLevel

This example puts it all together— the updated dataset includes EducationLevel, the average and standard deviation of the acceptance rate, and the average and standard deviation of the summary length in EducationLevel groups, ordered by EducationLevel.

Prohibited keywords and functions

Some SparkSQL 2.4.0 functions present a security risk. Data Prep prohibits the use of commands and methods that raise a potential security risk.

Click the tabs below to view lists of prohibited keywords and functions.

ALTER (DATABASE|SCHEMA)
ALTER [TABLE | VIEW]
ALTER VIEW
CREATE (DATABASE|SCHEMA)
CREATE FUNCTION
CREATE TABLE
CREATE VIEW
DROP DATABASE
DROP FUNCTION
DROP [TABLE | VIEW]
MSCK REPAIR TABLE or ALTER TABLE RECOVER PARTITIONS
TRUNCATE TABLE
USE
LOAD DATA
EXPLAIN
ADD FILE
ADD JAR
ANALYZE TABLE
CACHE TABLE
CLEAR CACHE
DESCRIBE DATABASE
DESCRIBE FUNCTION
DESCRIBE TABLE
LIST FILE
LIST JAR
REFRESH
REFRESH TABLE
RESET
SET
SHOW COLUMNS
SHOW CREATE TABLE
SHOW (DATABASES|SCHEMAS)
SHOW FUNCTIONS
SHOW PARTITIONS
SHOW TABLES or SHOW TABLE EXTENDED
SHOW TBLPROPERTIES
UNCACHE TABLE
DESCRIBE
CREATE TEMP VIEW USING
CREATE TABLE
INSERT [INTO | OVERWRITE]
CREATE TABLE
INSERT OVERWRITE DIRECTORY
java_method
reflect

Considerations

  • The Data Prep Spark SQL tool supports the commands and functions available in Spark SQL 2.4.0.

  • Only SELECT queries are allowed. All other SQL operations that might change the current dataset are prohibited (e.g., INSERT, UPDATE, DELETE).

  • Built-in functions that could potentially hurt the system are also prohibited (e.g., reflect, java_method). See Prohibited keywords and functions.

  • Each SQL statement is executed under its own Spark session, so there is no possibility for data leaking between SQL executions.

  • For data type handling:

    • Infinity is not supported and is treated as null.
    • If a column has mixed data types, the value is converted to the specified type if convertible, otherwise, it is set to null.
    • When converting a SQL result into a Data Prep dataset, all data types from the SQL columns are converted to Data Prep data types. If the type is not supported, it is converted to String type.

Updated October 28, 2021
Back to top