Learn more > Integrations > Apache Spark API for Scoring Code

Apache Spark API for Scoring Code¶

The Spark API for Scoring Code library integrates DataRobot Scoring Code JARs into Spark clusters. It is available as a PySpark API and a Spark Scala API.

In previous versions, the Spark API for Scoring Code consisted of multiple libraries, each supporting a specific Spark version. Now, one library supports all supported Spark versions. The following Spark versions support this feature:

Spark 2.4.1 or greater
Spark 3.x

Important

Spark must be compiled for Scala 2.12.

For a list of the deprecated, Spark version-specific libraries, see the Deprecated Spark libraries section.

PySpark API¶

The PySpark API for Scoring Code is included in the datarobot-predict Python package, released on PyPI. The PyPI project description contains documentation and usage examples.

Spark Scala API¶

The Spark Scala API for Scoring Code is published on Maven as scoring-code-spark-api. For more information, see the API reference documentation.

Before using the Spark API, you must add it to the Spark classpath. For spark-shell, use the --packages parameter to load the dependencies directly from Maven:

spark-shell --conf "spark.driver.memory=2g" \
     --packages com.datarobot:scoring-code-spark-api:VERSION \
     --jars model.jar

Score a CSV file¶

The following example illustrates how you can load a CSV file into a Spark DataFrame and score it:

import com.datarobot.prediction.sparkapi.Predictors

val inputDf = spark.read.option("header", true).csv("input_data.csv")

val model = Predictors.getPredictor()
val output = model.transform(inputDf)

output.show()

Load models at runtime¶

The following examples illustrate how you can load a model's JAR file at runtime instead of using the spark-shell --jars parameter:

From DataRobotFrom HDFS filesystem

Define the PROJECT_ID, the MODEL_ID, and your API_TOKEN.

val model = Predictors.getPredictorFromServer(
     "https://app.datarobot.com/projects/PROJECT_ID/models/MODEL_ID/blueprint","API_TOKEN")

Define the path to the model JAR file and the MODEL_ID.

val model = Predictors.getPredictorFromHdfs("path/to/model.jar", spark, "MODEL_ID")

Time series scoring¶

The following examples illustrate how you can perform time series scoring with the transform method, just as you would with non-time series scoring. In addition, you can customize the time series parameters with the TimeSeriesOptions builder.

If you don't provide additional arguments for a time series model through the TimeSeriesOptions builder, the transform method returns forecast point predictions for an auto-detected forecast point:

val model = Predictors.getPredictor()
val forecastPointPredictions = model.transform(timeSeriesDf)

To define a forecast point, you can use the buildSingleForecastPointRequest() builder method:

import com.datarobot.prediction.TimeSeriesOptions

val tsOptions = new TimeSeriesOptions.Builder().buildSingleForecastPointRequest("2010-12-05")
val model = Predictors.getPredictor(modelId, tsOptions)
val output = model.transform(inputDf)

To return historical predictions, you can define a start date and end date through the buildForecastPointRequest() builder method:

val tsOptions = new TimeSeriesOptions.Builder().buildForecastDateRangeRequest("2010-12-05", "2011-01-02")

For a complete reference, see TimeSeriesOptions javadoc.

Deprecated Spark libraries¶

Support for Spark versions earlier than 2.4.1 or Spark compiled for Scala earlier than 2.12 is deprecated. If necessary, you can access deprecated libraries published on Maven Central; however, they will not receive any further updates.

The following libraries are deprecated:

Name	Spark version	Scala version
scoring-code-spark-api_1.6.0	1.6.0	2.10
scoring-code-spark-api_2.4.3	2.4.3	2.11
scoring-code-spark-api_3.0.0	3.0.0	2.12

Apache Spark API for Scoring Code¶

PySpark API¶

Spark Scala API¶

Score a CSV file¶

Load models at runtime¶

Time series scoring¶

Deprecated Spark libraries¶

Was this page helpful?

Great! Let us know what you found helpful.

What can we do to improve the content?