Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Transform data

Once data is read into the pipeline, it typically goes through a series of transformations. Transform modules let you create data transformations like combining multiple datasets, removing duplicates, and cleaning erroneous values.

See the section on data processing limits for each module type.

Spark SQL module

The Spark SQL transform module lets you write SQL queries on the incoming data. This module accepts one or more input datasets. You can compose SQL queries on these datasets to generate the desired output.

In the SQL queries, you address the datasets using the input port names. For example, if the module has two input ports, Orders and Customers, the SQL query must refer to the incoming data using the port names Orders and Customers, as shown here:

SELECT
Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM
Orders INNER JOIN Customers
ON Orders.CustomerID = Customers.CustomerID

Tip

While most SQL transformations require at least one input dataset, some SQL statements can be written without any input data. In these cases, you can remove the inputs of the Spark SQL module and keep just the output.


Updated April 19, 2022
Back to top