Use Feature Discovery SQL with Spark clusters¶
This accelerator outlines an example that runs Feature Discovery SQL in a Docker-based Spark cluster. It walks you through the process of setting up a Spark cluster in Docker, registering custom user-defined functions (UDFs), and executing complex SQL queries for feature engineering across multiple datasets. The same approach can be applied to other Spark environments, such as GCP Dataproc, Amazon EMR, and Cloudera CDP, providing flexibility for running Feature Discovery on various Spark platforms.
問題のフレーミング¶
Features are commonly split across multiple data assets. これらのデータアセットをまとめるには、それらを結合してから、さらに機械学習モデルを実行するので、多くの作業が必要となります。 It's even more difficult when the datasets are of different granularities, as it requires you to aggregate to join the data successfully.
特徴量探索では、データセットを結合および集計する手順を自動化することで、この問題を解決します。 データセットの結合方法を定義した後、特徴量の生成とモデリングをDataRobotに任せます。
Feature Discovery uses Spark to perform joins and aggregations, generating Spark SQL at the end of the process. In some cases, you may want to run this Spark SQL in other Spark clusters to gain more flexibility and scalability for handling larger datasets, without the need to load data directly into the DataRobot environment. This approach allows you to leverage external Spark clusters for more resource-intensive tasks.
This accelerator provides an example of running Feature Discovery SQL in Docker-based Spark cluster.
前提条件¶
- Dockerをインストールする
- Docker composeをインストールする
- Download required the datasets, UDFs .jar, and an environment file (optional)
互換性¶
- Feature Discovery SQL is compatible with Spark 3.2(.2), Spark 3.4(.1), and Scala 2.12(.15). Using different Spark & Scala versions might lead to errors.
- The UDFs .jar and environment files can be obtained from the following locations. Note that environment file is only required if working with Japanese text.
- Spark 3.2.2
- Spark 3.4.1
- Specific Spark versions can be obtained from here.
ファイル概要¶
The file structure is outlined below:
.
├── Using Feature Discovery SQL in other Spark clusters.ipynb
├── apps
│ ├── DataRobotRunSSSQL.py
│ ├── LC_FD_SQL.sql
│ ├── LC_profile.csv
│ ├── LC_train.csv
│ └── LC_transactions.csv
├── data
├── libs
│ ├── spark-udf-assembly-0.1.0.jar
│ └── venv.tar.gz
├── docker-compose.yml
├── Dockerfile
├── start-spark.sh
└── utils.py
Using Feature Discovery SQL in other Spark clusters.ipynb
is the notebook providing a framework for running Feature Discovery SQL in a new Spark cluster on Docker.docker-compose.yml
,Dockerfile
, andstart-spark.sh
are files used by Docker to build and start the Docker container with Spark.utils.py
includes a helper function to download datasets and the UDFs jar.- The
app
directory includes: - Spark SQL (a file with a
.sql
extension) - Datasets (files with a
.csv
extension) - Helper function (files with a
.py
extension) to parse and execute the SQL - The
libs
directory includes: - A user-defined functions (UDFs) JAR file
- An environment file (only required if datasets include Japanese text, which requires a Mecab tokenizer to handle)
-
The
data
directory is empty, as it is used to store the output result -
Note that the datasets, UDFs jar, and environment files are initially unavailable. They have to be downloaded, as described in the accelerator.