Once data is read into the pipeline, it typically goes through a series of transformations. Transform modules let you create data transformations like combining multiple datasets, removing duplicates, and cleaning erroneous values.
See the section on data processing limits for each module type.
Spark SQL module¶
The Spark SQL transform module lets you write SQL queries on the incoming data. This module accepts one or more input datasets. You can compose SQL queries on these datasets to generate the desired output.
In the SQL queries, you address the datasets using the input port names. For example, if the module has two input ports, Orders and Customers, the SQL query must refer to the incoming data using the port names Orders and Customers, as shown here:
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate FROM Orders INNER JOIN Customers ON Orders.CustomerID = Customers.CustomerID
While most SQL transformations require at least one input dataset, some SQL statements can be written without any input data. In these cases, you can remove the inputs of the Spark SQL module and keep just the output.