Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Apache Spark API for Scoring Code

The Spark API for Scoring Code library integrates DataRobot Scoring Code JARs into Spark clusters. It is available as a PySpark API and a Spark Scala API.

In previous versions, the Spark API for Scoring Code consisted of multiple libraries, each supporting a specific Spark version. Now, one library supports all supported Spark versions. The following Spark versions support this feature:

  • Spark 2.4.1 or greater
  • Spark 3.x

Important

Spark must be compiled for Scala 2.12.

For a list of the deprecated, Spark version-specific libraries, see the Deprecated Spark libraries section.

PySpark API

The PySpark API for Scoring Code is included in the datarobot-predict Python package, released on PyPI. The PyPI project description contains documentation and usage examples.

Spark Scala API

The Spark Scala API for Scoring Code is published on Maven as scoring-code-spark-api. For more information, see the API reference documentation.

Before using the Spark API, you must add it to the Spark classpath. For spark-shell, use the --packages parameter to load the dependencies directly from Maven:

spark-shell --conf "spark.driver.memory=2g" \
     --packages com.datarobot:scoring-code-spark-api:VERSION \
     --jars model.jar

Score a CSV file

The following example illustrates how you can load a CSV file into a Spark DataFrame and score it:

import com.datarobot.prediction.sparkapi.Predictors

val inputDf = spark.read.option("header", true).csv("input_data.csv")

val model = Predictors.getPredictor()
val output = model.transform(inputDf)

output.show()

Load models at runtime

The following examples illustrate how you can load a model's JAR file at runtime instead of using the spark-shell --jars parameter:

Define the PROJECT_ID, the MODEL_ID, and your API_TOKEN.

val model = Predictors.getPredictorFromServer(
     "https://app.datarobot.com/projects/PROJECT_ID/models/MODEL_ID/blueprint","API_TOKEN")

Define the path to the model JAR file and the MODEL_ID.

val model = Predictors.getPredictorFromHdfs("path/to/model.jar", spark, "MODEL_ID")

Time series scoring

The following examples illustrate how you can perform time series scoring with the transform method, just as you would with non-time series scoring. In addition, you can customize the time series parameters with the TimeSeriesOptions builder.

If you don't provide additional arguments for a time series model through the TimeSeriesOptions builder, the transform method returns forecast point predictions for an auto-detected forecast point:

val model = Predictors.getPredictor()
val forecastPointPredictions = model.transform(timeSeriesDf)

To define a forecast point, you can use the buildSingleForecastPointRequest() builder method:

import com.datarobot.prediction.TimeSeriesOptions

val tsOptions = new TimeSeriesOptions.Builder().buildSingleForecastPointRequest("2010-12-05")
val model = Predictors.getPredictor(modelId, tsOptions)
val output = model.transform(inputDf)

To return historical predictions, you can define a start date and end date through the buildForecastPointRequest() builder method:

val tsOptions = new TimeSeriesOptions.Builder().buildForecastDateRangeRequest("2010-12-05", "2011-01-02")

For a complete reference, see TimeSeriesOptions javadoc.

Deprecated Spark libraries

Support for Spark versions earlier than 2.4.1 or Spark compiled for Scala earlier than 2.12 is deprecated. If necessary, you can access deprecated libraries published on Maven Central; however, they will not receive any further updates.

The following libraries are deprecated:

Name Spark version Scala version
scoring-code-spark-api_1.6.0 1.6.0 2.10
scoring-code-spark-api_2.4.3 2.4.3 2.11
scoring-code-spark-api_3.0.0 3.0.0 2.12

Updated June 13, 2023