Assemble structured custom models¶
DataRobot provides built-in support for a variety of libraries to create models that use conventional target types. If your model is based on one of these libraries, DataRobot expects your model artifact to have a matching file extension:
Library | File Extension | Example |
---|---|---|
Scikit-learn | *.pkl | sklean-regressor.pkl |
Xgboost | *.pkl | xgboost-regressor.pkl |
PyTorch | *.pth | torch-regressor.pth |
tf.keras (tensorflow>=2.2.1) | *.h5 | keras-regressor.h5 |
ONNX | *.onnx | onnx-regressor.onnx |
pmml | *.pmml | pmml-regressor.pmml |
Library | File Extension | Example |
---|---|---|
Caret | *.rds | brnn-regressor.rds |
Library | File Extension | Example |
---|---|---|
datarobot-prediction | *.jar | dr-regressor.jar |
h2o-genmodel | *.java | GBM_model_python_1589382591366_1.java (pojo) |
h2o-genmodel | *.zip | GBM_model_python_1589382591366_1.zip (mojo) |
h2o-genmodel-ext-xgboost | *.java | XGBoost_2_AutoML_20201015_144158.java |
h2o-genmodel-ext-xgboost | *.zip | XGBoost_2_AutoML_20201015_144158.zip |
h2o-ext-mojo-pipeline | *.mojo | ... |
Note
-
DRUM supports models with DataRobot-generated Scoring Code and models that implement either the
IClassificationPredictor
orIRegressionPredictor
interface from the DataRobot-prediction library. The model artifact must have a.jar
extension. -
You can define the
DRUM_JAVA_XMX
environment variable to set JVM maximum heap memory size (-Xmx
java parameter):DRUM_JAVA_XMX=512m
. -
If you export an H2O model as
POJO
, you cannot rename the file; however, this limitiation doesn't apply to models exported asMOJO
—they may be named in any fashion. -
The
h2o-ext-mojo-pipeline
requires an h2o driverless AI license. -
Support for DAI Mojo Pipeline has not been incorporated into tests for the build of
datarobot-drum
.
If your model doesn't use one of the following libraries, you must create an unstructured custom model.
Compare the characteristics and capabilities of the two types of custom models below:
Model type | Characteristics | Capabilities |
---|---|---|
Structured |
|
|
Unstructured |
|
|
Structured custom model requirements¶
If your custom model uses one of the supported libraries, make sure it meets the following requirements:
- Data sent to a model must be usable for predictions without additional pre-processing.
- Regression models must return a single floating point per row of prediction data.
- Binary classification models must return one floating point value <= 1.0 or two floating point values that sum to 1.0 per row of prediction data.
- Single-value output is assumed to be the positive class probability.
- For multi-value, it is assumed that the first value is the negative class probability and the second is the positive class probability.
- There must be a single
pkl
/pth
/h5
file present.
Data format
When working with structured models DataRobot supports data as files of csv
, sparse
, or arrow
format. DataRobot doesn't sanitize missing or abnormal (containing parentheses, slashes, symbols, etc. ) column names.
Structured custom model hooks¶
To define a custom model using DataRobot’s framework, your artifact file should contain hooks (or functions) to define how a model is trained and how it scores new data. DataRobot automatically calls each hook and passes the parameters based on the project and blueprint configuration. However, you have full flexibility to define the logic that runs inside each hook. If necessary, you can include these hooks alongside your model artifacts in your model folder in a file called custom.py
for Python models or custom.R
for R models.
Note
Training and inference hooks can be defined in the same file.
The following sections describe each hook, with examples.
Type annotations in hook signatures
The following hook signatures are written with Python 3 type annotations. The Python types match the following R types:
Python type | R type | Description |
---|---|---|
DataFrame |
data.frame |
A numpy DataFrame or R data.frame . |
None |
NULL |
Nothing |
str |
character |
String |
Any |
An R object | The deserialized model. |
*args , **kwargs |
... |
These are keyword arguments, not types; they serve as placeholders for additional parameters. |
init()
¶
The init
hook is executed only once at the beginning of the run to allow the model to load libraries and additional files for use in other hooks.
init(**kwargs) -> None
init()
input¶
Input parameter | Description |
---|---|
**kwargs |
An additional keyword argument. code_dir provides a link, passed through the --code_dir parameter, to the folder where the model code is stored. |
init()
example¶
The following provides a brief code snippet using init()
; see a more complete example here.
def init(code_dir):
global g_code_dir
g_code_dir = code_dir
init <- function(...) {
library(brnn)
library(glmnet)
}
init()
output¶
The init()
hook does not return anything.
load_model()
¶
The load_model()
hook is executed only once at the beginning of the run to load one or more trained objects from multiple artifacts. It is only required when a trained object is stored in an artifact that uses an unsupported format or when multiple artifacts are used. The load_model()
hook is not required when there is a single artifact in one of the supported formats:
- Python:
.pkl
,.pth
,.h5
,.joblib
- Java:
.mojo
- R:
.rds
load_model(code_dir: str) -> Any
load_model()
input¶
Input parameter | Description |
---|---|
code_dir |
A link, passed through the --code_dir parameter, to the directory where the model artifact and additional code are provided. |
load_model()
example¶
The following provides a brief code snippet using load_model()
; see a more complete example here.
def load_model(code_dir):
model_path = "model.pkl"
model = joblib.load(os.path.join(code_dir, model_path))
load_model <- function(input_dir) {
readRDS(file.path(input_dir, "model_name.rds"))
}
load_model()
output¶
The load_model()
hook returns a trained object (of any type).
read_input_data()
¶
The read_input_data
hook customizes how the model reads data; for example, with encoding and missing value handling.
read_input_data(input_binary_data: bytes) -> Any
read_input_data()
input¶
Input parameter | Description |
---|---|
input_binary_data |
Data passed through the --input parameter in drum score mode, or a payload submitted to the drum server /predict endpoint. |
read_input_data()
example¶
def read_input_data(input_binary_data):
global prediction_value
prediction_value += 1
return pd.read_csv(io.BytesIO(input_binary_data))
read_input_data <- function(input_binary_data) {
input_text_data <- stri_conv(input_binary_data, "utf8")
read.csv(text=gsub("\r","", input_text_data, fixed=TRUE))
}
read_input_data()
output¶
The read_input_data()
hook must return a pandas DataFrame
or R data.frame
; otherwise, you must write your own score method.
transform()
¶
The transform()
hook defines the output of a custom transform and returns transformed data. Do not use this hook for estimator models. This hook can be used in both transformer and estimator tasks:
-
For transformers, this hook applies transformations to the data provided and passes it to downstream tasks.
-
For estimators, this hook applies transformations to the prediction data before making predictions.
transform(data: DataFrame, model: Any) -> DataFrame
transform()
input¶
Input parameter | Description |
---|---|
data |
A pandas DataFrame (Python) or R data.frame containing the data that the custom model should transform. Missing values are indicated with NaN in Python and NA in R, unless otherwise overridden by the read_input_data hook. |
model |
A trained object DataRobot loads from the artifact (typically, a trained transformer) or loaded through the load_model hook. |
transform()
example¶
The following provides a brief code snippet using transform()
; see a more complete example here.
def transform(data, model):
data = data.fillna(0)
return data
transform <- function(data, model) {
data[is.na(data)] <- 0
data
}
transform()
output¶
The transform()
hook returns a pandas DataFrame
or R data.frame
with transformed data.
score()
¶
The score()
hook defines the output of a custom estimator and returns predictions on input data. Do not use this hook for transform models.
score(data: DataFrame, model: Any, **kwargs: Dict[str, Any]) -> DataFrame
score()
input¶
Input parameter | Description |
---|---|
data |
A pandas DataFrame (Python) or R data.frame containing the data the custom model will score. If the transform hook is used, data will be the transformed data. |
model |
A trained object loaded from the artifact by DataRobot or loaded through the load_model hook. |
**kwargs |
Additional keyword arguments. For a binary classification model, it contains the positive and negative class labels as the following keys:
|
score()
examples¶
The following provides a brief code snippet using score()
; see a more complete example here.
def score(data: pd.DataFrame, model: Any, **kwargs: Dict[str, Any]) -> pd.DataFrame:
predictions = model.predict(data)
predictions_df = pd.DataFrame(predictions, columns=[kwargs["positive_class_label"]])
predictions_df[kwargs["negative_class_label"]] = (
1 - predictions_df[kwargs["positive_class_label"]]
)
return predictions_df
score <- function(data, model, ...){
scores <- predict(model, newdata = data, type = "prob")
names(scores) <- c('0', '1')
return(scores)
}
score()
output¶
The score()
hook should return a pandas DataFrame
(or R data.frame
or tibble
) of the following format:
-
For regression or anomaly detection projects, the output must have a single numeric column named Predictions.
-
For binary or multiclass projects, the output must have one column per class, with class names used as column names. Each cell must contain the probability of the respective class, and each row must sum up to 1.0.
post_process()
¶
The post_process
hook formats the prediction data returned by DataRobot or the score
hook when it doesn't match the output format expectations.
post_process(predictions: DataFrame, model: Any) -> DataFrame
post_process()
input¶
Input parameter | Description |
---|---|
predictions |
A pandas DataFrame (Python) or R data.frame containing the scored data produced by DataRobot or the score hook. |
model |
A trained object loaded from the artifact by DataRobot or loaded through the load_model hook. |
post_process()
example¶
def post_process(predictions, model):
return predictions + 1
post_process <- function(predictions, model) {
names(predictions) <- c('0', '1')
}
post_process()
output¶
The post_process
hook returns a pandas DataFrame
(or R data.frame
or tibble
) of the following format:
-
For regression or anomaly detection projects, the output must have a single numeric column named Predictions.
-
For binary or multiclass projects, the output must have one column per class, with class names used as column names. Each cell must contain the probability of the respective class, and each row must sum up to 1.0.