Deploy to Hadoop tab¶
Hadoop deployment and scoring, including the Standalone Scoring Engine (SSE), will be unavailable and fully deprecated (end-of-life) starting with release v7.3 (December 13th, 2021 for cloud users). Post-deprecation, Hadoop should not be used to generate predictions.
The Deploy to Hadoop tab, which allows in-place scoring on Hadoop, is not available for Managed AI Cloud deployments.
DataRobot can perform distributed scoring of an in-memory model on a dataset stored on HDFS. You execute the action through the Deploy to Hadoop tab, which automates running of the
datarobot-scoring command on a specified Hadoop host. Using a model built and refined within a DataRobot project, the command allows you to then apply that model to (potentially huge) datasets on your HDFS cluster. There are both simplified and more advanced command entry options.
As you complete fields on the page, the text below Full Command displays the syntax of the
datarobot-scoring command and arguments that DataRobot will execute when you click Run. The Run button becomes available when the minimum, basic fields are complete. Alternatively, you can run the
datarobot-scoring command from the command line by copying and pasting the syntax from the UI window (see option #2, below).
Obtaining the model for scoring¶
There are two general scenarios for scoring on Hadoop:
Using the DataRobot GUI, specifically the Deploy to Hadoop tab. This is a good option for ad-hoc scoring requests. When using the UI, DataRobot handles the file download and other steps "behind the scenes." You can simply complete the fields and hit Run, as described below. Note that Deploy to Hadoop is not supported for OSS (open source) models.
Using the the
datarobot-scoringcommand from the command line. (Advanced) Use this option when you want to integrate the scoring into a workflow manager (Oozie, for example). The command line script requires that you export the model file. See the section on command line scoring for syntax, examples, and instruction on using the
Completing the basic Score in Place fields¶
The Score in Place screen provides a simple mechanism for using an existing DataRobot model on data in your Hadoop cluster. The following table describes the required fields for scoring on Hadoop. Use the advanced options for additional functionality.
|Model file path||The path and name to which DataRobot saves the current, trained model that you want to use for scoring. DataRobot completes this field for you, based on the model from which you opened the Deploy to Hadoop tab, with a random hex string. You can change it to any string; if the file name already exists, it is overwritten. Note that if you are using the command line to run the
|Input path||The input file/directory, Hive table, or socket (
|Output path||A new output directory (not file) name on the Hadoop cluster where the model scores are written. Make sure that the output directory does not already exist. Do not include the
|Advanced options||Selecting this option provides an input box for more advanced command options.|
Note that as you complete fields, the Full Command syntax updates. This syntax reflects the values you entered, as well as the DataRobot defaults, for select Spark configuration parameters.
Apache Spark versions below 2.0 support UTF-8 encoded text only. If you see the error message
PredictionServiceUserError: Malformed CSV, please check schema and encoding, please check the file encoding for non UTF-8.
Completing the advanced Score in Place fields¶
In addition to the required fields described above, DataRobot supports parameters that provide advanced features. To enter these parameters:
- Check the Advanced options box on the Score in Place page.
Put your cursor in the resulting box, either before or after the listed defaults, and enter any of the following parameters. Note these syntax conventions:
- [ ] indicates an optional argument
- < > indicates a user supplied value
||The format of the input file, either CSV, JSON, or parquet. The default is CSV.|
|--header=||A comma separated string supplying the column names for .csv input format. Use this field if the the headers in your file are problematic or missing.|
||A comma separated string of column names to include in the output.|
||The name and path of a Hive database used to score a Hive table.|
||The number of data points DataRobot scores at one time. The default is 1000 data points.|
|--skip-header||An instruction to skip the header of the input file(s).|
||If True, saves the output in one file per partition. The default is False.|
||If True, adds a row_id column to the prediction output (as the first column), where the value is the row number (starting from 0). The default is False.|
||If True, skips bad rows. Rows with errors either result in
||The format of the serialized output record, either
||The Spark configuration parameter(s) to use when running on YARN. You can enter as many
Your entries, and the default Spark parameters, are reflected in the Full Command syntax:
Passing configuration parameters to Spark¶
By default, DataRobot supplies the following default Spark configuration values:
- spark.executor.memory: 4g memory
- spark.memory.storageFraction: 0.2 (20%)
- spark.yarn.security.tokens.hive.enabled: false
- spark.executor.instances: 10 instances
You can change these values or add others by typing directly in the Advanced options box, as described above. To pass one or more additional configuration parameters to the Spark context, use the
--spark-conf parameter. The following example snippet shows syntax of some advanced options:
batch_size=1000 --spark-conf="spark.yarn.am.waitTime=150s" --spark-conf="spark.yarn.queue=datarobot"
See the "Spark Properties" table for a complete list of other supported properties and their meanings.
Monitoring the scoring process¶
There are cases where you may train your models on a part of your dataset but then want to make predictions on a much larger dataset on your Hadoop cluster. Because this may take some time to complete, you can use the Spark dashboard to monitor job status.
Click the View Spark Dashboard link to open the dashboard. While the scoring job runs, clicking the link takes you to a Spark dashboard that monitors and reports job progress:
When the job is complete, clicking the link takes you to the Hadoop UI, which provides a variety of job statistics: