Data Prep Spark SQL Guidelines¶
Structured Query Language (SQL) is a declarative language designed for managing data stored in relational databases. Spark SQL is a component of Spark that allows queries to be written against registered DataFrames (data organized into named columns) in much the same way as SQL is used to query databases. Data Prep supports a library of Spark SQL functions for use with the Spark SQL tool.
Use case¶
The example dataset featured in this section contains a sample of job application data with the target feature, Hired
.
Query guidelines¶
Once a dataset is loaded into Data Prep, it is registered as a DataFrame and can now be queried using the dataset
alias in a Spark SQL statement.
To use Spark SQL to shape your data, enable the Spark SQL tool and enter SQL queries in the Spark SQL Statement pane.
Data Prep allows only SELECT
queries. Build the queries using the column names, for example:
SELECT
EducationLevel,
Hired
FROM dataset
See Prohibited keywords and functions for other restrictions.
Sample queries¶
The following are sample Spark SQL queries used in Data Prep, along with descriptions of the results.
Note
In Data Prep, SQL statements are case sensitive, adhering to the common SQL convention of using all caps for keywords and lowercase for variable names.
Example 1
SELECT * FROM dataset
The updated dataset includes all columns and all rows from the dataset.
Example 2
SELECT * FROM dataset
The result is: ParseException
Example 3
SELECT * from doesNotExist
The result is: NoSuchTableException
Example 4
SELECT Hired FROM dataset
The updated dataset includes only the Hired
column (containing whether or not the application was successful) and all rows.
Example 5
SELECT
EducationLevel,
Hired
FROM dataset
WHERE EducationLevel = 5
The updated dataset includes only the Hired
and EducationLevel
columns and only those rows where the EducationLevel
is 5.
Example 6
SELECT
EducationLevel,
CASE WHEN Hired = 'No' THEN 0 ELSE 1 END
AS HiredNum
FROM dataset
WHERE EducationLevel = 5
The updated dataset includes EducationLevel
and a numeric version of the Hired
column for rows where the EducationLevel
is 5.
Example 7
SELECT
EducationLevel,
avg(CASE WHEN Hired = 'No' THEN 0 ELSE 1 END) AS acceptance_rate
FROM dataset
GROUP BY EducationLevel
ORDER BY EducationLevel
The updated dataset includes EducationLevel
and the average acceptance rate in EducationLevel
groups, ordered by EducationLevel
.
Example 8
SELECT
EducationLevel,
avg(CASE WHEN Hired = 'No' THEN 0 ELSE 1 END) AS acceptance_rate,
std(CASE WHEN Hired = 'No' THEN 0 ELSE 1 END) AS acceptance_rate_std
FROM dataset
GROUP BY EducationLevel
ORDER BY EducationLevel
The updated dataset includes EducationLevel
and the average and standard deviation of the acceptance rate in EducationLevel
groups, ordered by EducationLevel
.
Example 9
SELECT
EducationLevel,
length(Summary) AS length_summary
FROM dataset
Here we surface the EducationLevel
and the length of the cover letter that went along with the application.
Example 10
SELECT
EducationLevel,
avg(length(Summary)) AS avg_length_summary,
std(length(Summary)) AS std_length_summary,
std(CASE WHEN Hired = 'No' THEN 0 ELSE 1 END) AS acceptance_rate_std,
avg(CASE WHEN Hired = 'No' THEN 0 ELSE 1 END) AS acceptance_rate
FROM dataset
GROUP BY EducationLevel
ORDER BY EducationLevel
This example puts it all together— the updated dataset includes EducationLevel
, the average and standard deviation of the acceptance rate, and the average and standard deviation of the summary length in EducationLevel
groups, ordered by EducationLevel
.
Prohibited keywords and functions¶
Some SparkSQL 2.4.0 functions present a security risk. Data Prep prohibits the use of commands and methods that raise a potential security risk.
Click the tabs below to view lists of prohibited keywords and functions.
ALTER (DATABASE|SCHEMA)
ALTER [TABLE | VIEW]
ALTER VIEW
CREATE (DATABASE|SCHEMA)
CREATE FUNCTION
CREATE TABLE
CREATE VIEW
DROP DATABASE
DROP FUNCTION
DROP [TABLE | VIEW]
MSCK REPAIR TABLE or ALTER TABLE RECOVER PARTITIONS
TRUNCATE TABLE
USE
LOAD DATA
EXPLAIN
ADD FILE
ADD JAR
ANALYZE TABLE
CACHE TABLE
CLEAR CACHE
DESCRIBE DATABASE
DESCRIBE FUNCTION
DESCRIBE TABLE
LIST FILE
LIST JAR
REFRESH
REFRESH TABLE
RESET
SET
SHOW COLUMNS
SHOW CREATE TABLE
SHOW (DATABASES|SCHEMAS)
SHOW FUNCTIONS
SHOW PARTITIONS
SHOW TABLES or SHOW TABLE EXTENDED
SHOW TBLPROPERTIES
UNCACHE TABLE
DESCRIBE
CREATE TEMP VIEW USING
CREATE TABLE
INSERT [INTO | OVERWRITE]
CREATE TABLE
INSERT OVERWRITE DIRECTORY
java_method
reflect
Considerations¶
-
The Data Prep Spark SQL tool supports the commands and functions available in Spark SQL 2.4.0.
-
Only
SELECT
queries are allowed. All other SQL operations that might change the current dataset are prohibited (e.g.,INSERT
,UPDATE
,DELETE
). -
Built-in functions that could potentially hurt the system are also prohibited (e.g.,
reflect
,java_method
). See Prohibited keywords and functions. -
Each SQL statement is executed under its own Spark session, so there is no possibility for data leaking between SQL executions.
-
For data type handling:
- Infinity is not supported and is treated as null.
- If a column has mixed data types, the value is converted to the specified type if convertible, otherwise, it is set to null.
- When converting a SQL result into a Data Prep dataset, all data types from the SQL columns are converted to Data Prep data types. If the type is not supported, it is converted to String type.