Apache Spark API for Scoring Code¶
The Spark API for Scoring Code library integrates DataRobot Scoring Code JARs into Spark clusters. It is available as a PySpark API and a Spark Scala API.
In previous versions, the Spark API for Scoring Code consisted of multiple libraries, each supporting a specific Spark version. Now, one library supports all supported Spark versions. The following Spark versions support this feature:
- Spark 2.4.1 or greater
- Spark 3.x
Important
Spark must be compiled for Scala 2.12.
For a list of the deprecated, Spark version-specific libraries, see the Deprecated Spark libraries section.
PySpark API¶
The PySpark API for Scoring Code is included in the datarobot-predict
Python package, released on PyPI. The PyPI project description contains documentation and usage examples.
Spark Scala API¶
The Spark Scala API for Scoring Code is published on Maven as scoring-code-spark-api
. For more information, see the API reference documentation.
Before using the Spark API, you must add it to the Spark classpath. For spark-shell
, use the --packages
parameter to load the dependencies directly from Maven:
spark-shell --conf "spark.driver.memory=2g" \
--packages com.datarobot:scoring-code-spark-api:VERSION \
--jars model.jar
Score a CSV file¶
The following example illustrates how you can load a CSV file into a Spark DataFrame and score it:
import com.datarobot.prediction.sparkapi.Predictors
val inputDf = spark.read.option("header", true).csv("input_data.csv")
val model = Predictors.getPredictor()
val output = model.transform(inputDf)
output.show()
Load models at runtime¶
The following examples illustrate how you can load a model's JAR file at runtime instead of using the spark-shell --jars
parameter:
Define the PROJECT_ID
, the MODEL_ID
, and your API_TOKEN
.
val model = Predictors.getPredictorFromServer(
"https://app.datarobot.com/projects/PROJECT_ID/models/MODEL_ID/blueprint","API_TOKEN")
Define the path to the model JAR file and the MODEL_ID
.
val model = Predictors.getPredictorFromHdfs("path/to/model.jar", spark, "MODEL_ID")
Time series scoring¶
The following examples illustrate how you can perform time series scoring with the transform
method, just as you would with non-time series scoring. In addition, you can customize the time series parameters with the TimeSeriesOptions
builder.
If you don't provide additional arguments for a time series model through the TimeSeriesOptions
builder, the transform
method returns forecast point predictions for an auto-detected forecast point:
val model = Predictors.getPredictor()
val forecastPointPredictions = model.transform(timeSeriesDf)
To define a forecast point, you can use the buildSingleForecastPointRequest()
builder method:
import com.datarobot.prediction.TimeSeriesOptions
val tsOptions = new TimeSeriesOptions.Builder().buildSingleForecastPointRequest("2010-12-05")
val model = Predictors.getPredictor(modelId, tsOptions)
val output = model.transform(inputDf)
To return historical predictions, you can define a start date and end date through the buildForecastPointRequest()
builder method:
val tsOptions = new TimeSeriesOptions.Builder().buildForecastDateRangeRequest("2010-12-05", "2011-01-02")
For a complete reference, see TimeSeriesOptions javadoc.
Deprecated Spark libraries¶
Support for Spark versions earlier than 2.4.1 or Spark compiled for Scala earlier than 2.12 is deprecated. If necessary, you can access deprecated libraries published on Maven Central; however, they will not receive any further updates.
The following libraries are deprecated:
Name | Spark version | Scala version |
---|---|---|
scoring-code-spark-api_1.6.0 | 1.6.0 | 2.10 |
scoring-code-spark-api_2.4.3 | 2.4.3 | 2.11 |
scoring-code-spark-api_3.0.0 | 3.0.0 | 2.12 |