Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Using Hadoop Scoring from the command line


Hadoop deployment and scoring, including the Standalone Scoring Engine (SSE), will be unavailable and fully deprecated (end-of-life) starting with release v7.3 (December 13th, 2021 for cloud users). Post-deprecation, Hadoop should not be used to generate predictions.

More advanced users may want to run Hadoop scoring from the command line instead of from within the application. The following sections describe the necessary syntax and provide examples. Parameter descriptions are available for these advanced features.

Command syntax for datarobot-scoring

The complete datarobot-scoring command syntax is as follows:

datarobot-scoring <model_file> --input=<input> --output=<output>
                  [--spark-conf=<spark_conf> ...]

Note: When running the datarobot-scoring command from the command line in Hortonworks (HDP), you must manually add the path to the command syntax copied from the UI. For example:

export PATH=/opt/DataRobot/current/bin/:$PATH

Retrieving the model file

Using the datarobot-scoring command requires that you export a .drx file. This can be done with the Model Export functionality available through the Downloads tab. Once exported, copy the .drx file into HDFS for the cluster containing the data to be scored and where the DataRobot parcel is deployed and activated. Make sure the <model_file> argument specifies the fully qualified path to the .drx file, including the schema (hdfs://) and absolute path (e.g., /tmp).

Usage examples

The following is a simple example using the datarobot-scoring command. The model file path, which is saved automatically to a hex string, has been renamed dtree_calhousing.model for simplicity.

$ datarobot-scoring
   /tmp/dtree_calhousing.model --input=/tmp/cal_housing.csv --output=/tmp/out.house1

    $ hdfs dfs -ls /tmp/out.house1
    -rw-r--r--   3 datarobot supergroup     163496 2016-01-25 14:58 /tmp/out.house1

For convenience the default mode --combine-output=True will produce a single output file with header. If you prefer one file per partition use --combine-output=False.

    $ datarobot-scoring
   /tmp/dtree_calhousing.model --input=/tmp/cal_housing.csv --output=/tmp/out.house1 --combine-output=False

$ hdfs dfs -ls /tmp/out.house1
Found 3 items
-rw-r--r--   3 datarobot supergroup          0 2016-01-25 13:09 /tmp/out.house1/_SUCCESS
-rw-r--r--   3 datarobot supergroup     163496 2016-01-25 13:09 /tmp/out.house1/part-00000
-rw-r--r--   3 datarobot supergroup     163749 2016-01-25 13:09 /tmp/out.house1/part-00001

You can use the Hadoop getmerge command to merge them back together.

Additional examples

In addition to scoring data stored in HDFS, the datarobot-scoring command can also be used on:

  • Multiple input files
  • Hive tables

Scoring multiple input files

This example scores multiple files. All of the files are in the /user/datarobot/lorem directory. They are .csv files without headers, so the command includes the --header option to provide them.

hadoop_host$  datarobot-scoring --input=/user/datarobot/lorem

You can use the --skip-header option to ignore the headers.

Scoring a Hive table

This example uses a Hive table as input for the scoring script. Here, DataRobot scores a table named lorem in the ipsum database:

hadoop_host$ datarobot-scoring --input=lorem --database=ipsum --output=/user/datarobot/lorem.scores hdfs:///user/datarobot/lorem.model

If the dataset columns contain uppercase letters, specify the headers like this:

hadoop_host$ datarobot-scoring --input=lorem

Updated October 27, 2021
Back to top