Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Deploy to Hadoop tab

Warning

Hadoop deployment and scoring, including the Standalone Scoring Engine (SSE), will be unavailable and fully deprecated (end-of-life) starting with release v7.3 (December 13th, 2021 for cloud users). Post-deprecation, Hadoop should not be used to generate predictions.

Availability information

The Deploy to Hadoop tab, which allows in-place scoring on Hadoop, is not available for Managed AI Cloud deployments.

DataRobot can perform distributed scoring of an in-memory model on a dataset stored on HDFS. You execute the action through the Deploy to Hadoop tab, which automates running of the datarobot-scoring command on a specified Hadoop host. Using a model built and refined within a DataRobot project, the command allows you to then apply that model to (potentially huge) datasets on your HDFS cluster. There are both simplified and more advanced command entry options.

As you complete fields on the page, the text below Full Command displays the syntax of the datarobot-scoring command and arguments that DataRobot will execute when you click Run. The Run button becomes available when the minimum, basic fields are complete. Alternatively, you can run the datarobot-scoring command from the command line by copying and pasting the syntax from the UI window (see option #2, below).

Obtaining the model for scoring

There are two general scenarios for scoring on Hadoop:

  1. Using the DataRobot GUI, specifically the Deploy to Hadoop tab. This is a good option for ad-hoc scoring requests. When using the UI, DataRobot handles the file download and other steps "behind the scenes." You can simply complete the fields and hit Run, as described below. Note that Deploy to Hadoop is not supported for OSS (open source) models.

  2. Using the the datarobot-scoring command from the command line. (Advanced) Use this option when you want to integrate the scoring into a workflow manager (Oozie, for example). The command line script requires that you export the model file. See the section on command line scoring for syntax, examples, and instruction on using the .drx file.

Completing the basic Score in Place fields

The Score in Place screen provides a simple mechanism for using an existing DataRobot model on data in your Hadoop cluster. The following table describes the required fields for scoring on Hadoop. Use the advanced options for additional functionality.

Field Description
Model file path The path and name to which DataRobot saves the current, trained model that you want to use for scoring. DataRobot completes this field for you, based on the model from which you opened the Deploy to Hadoop tab, with a random hex string. You can change it to any string; if the file name already exists, it is overwritten. Note that if you are using the command line to run the datarobot-scoring command, you must use the fully qualified path, including the schema (hdfs://) and absolute path (e.g., /tmp).
Input path The input file/directory, Hive table, or socket (tcp://<host>:<port>) containing the data the command will use for scoring.
Output path A new output directory (not file) name on the Hadoop cluster where the model scores are written. Make sure that the output directory does not already exist. Do not include the hdfs:// prefix, as it will cause an error.
Advanced options Selecting this option provides an input box for more advanced command options.

Note that as you complete fields, the Full Command syntax updates. This syntax reflects the values you entered, as well as the DataRobot defaults, for select Spark configuration parameters.

Warning

Apache Spark versions below 2.0 support UTF-8 encoded text only. If you see the error message PredictionServiceUserError: Malformed CSV, please check schema and encoding, please check the file encoding for non UTF-8.

Completing the advanced Score in Place fields

In addition to the required fields described above, DataRobot supports parameters that provide advanced features. To enter these parameters:

  1. Check the Advanced options box on the Score in Place page.
  2. Put your cursor in the resulting box, either before or after the listed defaults, and enter any of the following parameters. Note these syntax conventions:

    • [ ] indicates an optional argument
    • < > indicates a user supplied value
Field Description
--format= The format of the input file, either CSV, JSON, or parquet. The default is CSV.
--header=
A comma separated string supplying the column names for .csv input format. Use this field if the the headers in your file are problematic or missing.
--add-columns= A comma separated string of column names to include in the output.
--database= The name and path of a Hive database used to score a Hive table.
--batch_size= The number of data points DataRobot scores at one time. The default is 1000 data points.
--skip-header An instruction to skip the header of the input file(s).
--combine-output= If True, saves the output in one file per partition. The default is False.
--index= If True, adds a row_id column to the prediction output (as the first column), where the value is the row number (starting from 0). The default is False.
--skip-bad-rows= If True, skips bad rows. Rows with errors either result in NaN prediction values or in omitted rows (if a row cannot be parsed). The default is False.
--output-format= The format of the serialized output record, either csv or json. If not set, the output format is the same as the input format. Note that parquet input produces json output.
--spark-conf= ... The Spark configuration parameter(s) to use when running on YARN. You can enter as many --spark-conf parameters as you require. See below for more detail.

Your entries, and the default Spark parameters, are reflected in the Full Command syntax:

Passing configuration parameters to Spark

By default, DataRobot supplies the following default Spark configuration values:

You can change these values or add others by typing directly in the Advanced options box, as described above. To pass one or more additional configuration parameters to the Spark context, use the --spark-conf parameter. The following example snippet shows syntax of some advanced options:

batch_size=1000 --spark-conf="spark.yarn.am.waitTime=150s" --spark-conf="spark.yarn.queue=datarobot"

See the "Spark Properties" table for a complete list of other supported properties and their meanings.

Monitoring the scoring process

There are cases where you may train your models on a part of your dataset but then want to make predictions on a much larger dataset on your Hadoop cluster. Because this may take some time to complete, you can use the Spark dashboard to monitor job status.

Click the View Spark Dashboard link to open the dashboard. While the scoring job runs, clicking the link takes you to a Spark dashboard that monitors and reports job progress:

When the job is complete, clicking the link takes you to the Hadoop UI, which provides a variety of job statistics:


Updated October 27, 2021
Back to top