This accelerator outlines an example that runs Feature Discovery SQL in a Docker-based Spark cluster. It walks you through the process of setting up a Spark cluster in Docker, registering custom user-defined functions (UDFs), and executing complex SQL queries for feature engineering across multiple datasets. The same approach can be applied to other Spark environments, such as GCP Dataproc, Amazon EMR, and Cloudera CDP, providing flexibility for running Feature Discovery on various Spark platforms.
Features are commonly split across multiple data assets. Bringing these data assets together can take a lot of work, as it involves joining them and then running machine learning models. It's even more difficult when the datasets are of different granularities, as it requires you to aggregate to join the data successfully.
Feature Discovery solves this problem by automating the procedure of joining and aggregating your datasets. After you define how the datasets need to be joined, DataRobot handles feature generation and modeling.
Feature Discovery uses Spark to perform joins and aggregations, generating Spark SQL at the end of the process. In some cases, you may want to run this Spark SQL in other Spark clusters to gain more flexibility and scalability for handling larger datasets, without the need to load data directly into the DataRobot environment. This approach allows you to leverage external Spark clusters for more resource-intensive tasks.
This accelerator provides an example of running Feature Discovery SQL in Docker-based Spark cluster.
Feature Discovery SQL is compatible with Spark 3.2(.2), Spark 3.4(.1), and Scala 2.12(.15). Using different Spark & Scala versions might lead to errors.
The UDFs .jar and environment files can be obtained from the following locations. Note that environment file is only required if working with Japanese text.
Using Feature Discovery SQL in other Spark clusters.ipynbis the notebook providing a framework for running Feature Discovery SQL in a new Spark cluster on Docker.
docker-compose.yml, Dockerfile, and start-spark.sh are files used by Docker to build and start the Docker container with Spark.
utils.py includes a helper function to download datasets and the UDFs jar.
The app directory includes:
Spark SQL (a file with a .sql extension)
Datasets (files with a .csv extension)
Helper function (files with a .py extension) to parse and execute the SQL
The libs directory includes:
A user-defined functions (UDFs) JAR file
An environment file (only required if datasets include Japanese text, which requires a Mecab tokenizer to handle)
The data directory is empty, as it is used to store the output result
Note that the datasets, UDFs jar, and environment files are initially unavailable. They have to be downloaded, as described in the accelerator.