DataRobot provides built-in support for a variety of libraries to create models that use conventional target types. If your model is based on one of these libraries, DataRobot expects your model artifact to have a matching file extension:
Library
File Extension
Example
Scikit-learn
*.pkl
sklean-regressor.pkl
Xgboost
*.pkl
xgboost-regressor.pkl
PyTorch
*.pth
torch-regressor.pth
tf.keras (tensorflow>=2.2.1)
*.h5
keras-regressor.h5
ONNX
*.onnx
onnx-regressor.onnx
pmml
*.pmml
pmml-regressor.pmml
Library
File Extension
Example
Caret
*.rds
brnn-regressor.rds
Library
File Extension
Example
datarobot-prediction
*.jar
dr-regressor.jar
h2o-genmodel
*.java
GBM_model_python_1589382591366_1.java (pojo)
h2o-genmodel
*.zip
GBM_model_python_1589382591366_1.zip (mojo)
h2o-genmodel-ext-xgboost
*.java
XGBoost_2_AutoML_20201015_144158.java
h2o-genmodel-ext-xgboost
*.zip
XGBoost_2_AutoML_20201015_144158.zip
h2o-ext-mojo-pipeline
*.mojo
...
Note
DRUM supports models with DataRobot-generated Scoring Code and models that implement either the IClassificationPredictor or IRegressionPredictor interface from the DataRobot-prediction library. The model artifact must have a .jar extension.
You can define the DRUM_JAVA_XMX environment variable to set JVM maximum heap memory size (-Xmx java parameter): DRUM_JAVA_XMX=512m.
If you export an H2O model as POJO, you cannot rename the file; however, this limitation doesn't apply to models exported as MOJO—they may be named in any fashion.
The h2o-ext-mojo-pipeline requires an h2o driverless AI license.
Support for DAI Mojo Pipeline has not been incorporated into tests for the build of datarobot-drum.
If your model doesn't use one of the following libraries, you must create an unstructured custom model.
Compare the characteristics and capabilities of the two types of custom models below:
Model type
Characteristics
Capabilities
Structured
Uses a target type known to DataRobot (e.g., regression, binary classification, multiclass, and anomaly detection).
If your custom model uses one of the supported libraries, make sure it meets the following requirements:
Data sent to a model must be usable for predictions without additional pre-processing.
Regression models must return a single floating point per row of prediction data.
Binary classification models must return one floating point value <= 1.0 or two floating point values that sum to 1.0 per row of prediction data.
Single-value output is assumed to be the positive class probability.
For multi-value, it is assumed that the first value is the negative class probability and the second is the positive class probability.
There must be a single pkl/pth/h5 file present.
Data format
When working with structured models DataRobot supports data as files of csv, sparse, or arrow format. DataRobot doesn't sanitize missing or abnormal (containing parentheses, slashes, symbols, etc. ) column names.
To define a custom model using DataRobot’s framework, your artifact file should contain hooks (or functions) to define how a model is trained and how it scores new data. DataRobot automatically calls each hook and passes the parameters based on the project and blueprint configuration. However, you have full flexibility to define the logic that runs inside each hook. If necessary, you can include these hooks alongside your model artifacts in your model folder in a file called custom.py for Python models or custom.R for R models.
Note
Training and inference hooks can be defined in the same file.
The following sections describe each hook, with examples.
Type annotations in hook signatures
The following hook signatures are written with Python 3 type annotations. The Python types match the following R types:
Python type
R type
Description
DataFrame
data.frame
A numpy DataFrame or R data.frame.
None
NULL
Nothing
str
character
String
Any
An R object
The deserialized model.
*args, **kwargs
...
These are keyword arguments, not types; they serve as placeholders for additional parameters.
The load_model() hook is executed only once at the beginning of the run to load one or more trained objects from multiple artifacts. It is only required when a trained object is stored in an artifact that uses an unsupported format or when multiple artifacts are used. The load_model() hook is not required when there is a single artifact in one of the supported formats:
The transform() hook defines the output of a custom transform and returns transformed data. Do not use this hook for estimator models. This hook can be used in both transformer and estimator tasks:
For transformers, this hook applies transformations to the data provided and passes it to downstream tasks.
For estimators, this hook applies transformations to the prediction data before making predictions.
A pandas DataFrame (Python) or R data.frame containing the data that the custom model should transform. Missing values are indicated with NaN in Python and NA in R, unless otherwise overridden by the read_input_data hook.
model
A trained object DataRobot loads from the artifact (typically, a trained transformer) or loaded through the load_model hook.
A pandas DataFrame (Python) or R data.frame containing the data the custom model will score. If the transform hook is used, data will be the transformed data.
model
A trained object loaded from the artifact by DataRobot or loaded through the load_model hook.
**kwargs
Additional keyword arguments. For a binary classification model, it contains the positive and negative class labels as the following keys:
The score() hook should return a pandas DataFrame (or R data.frame or tibble) of the following format:
For regression or anomaly detection projects, the output must have a numeric column named Predictions.
For binary or multiclass projects, the output must have one column per class label, with class names used as column names. Each cell must contain the floating-point class probability of the respective class and these rows must sum up to 1.0.
Additional output in prediction responses for custom models is off by default. Contact your DataRobot representative or administrator for information on enabling this feature.
Feature flag: Enable Additional Custom Model Output in Prediction Responses
The score() hook can return any number of extra columns, containing data of types string, int, float, bool, or datetime. When additional columns are returned through the score() method, the prediction response is as follows:
For a tabular response (CSV), the additional columns are returned as part of the response table or dataframe.
For a JSON response, the extraModelOutput key is returned alongside each row. This key is a dictionary containing the values of each additional column in the row.
Examples: Return extra columns
The following score hooks for various target types return extra columns (containing random data for illustrative purposes) alongside the prediction data:
The chat() hook allows custom models to implement the Bolt-on Governance API to provide access to chat history and streaming response. When using the Bolt-on Governance API with a deployed LLM blueprint, see LLM availability for the recommended values of the model parameter. Alternatively, specify a reserved value, model="datarobot-deployed-llm", to let the LLM blueprint select the relevant model ID automatically when calling the LLM provider's services.
In Workbench, when adding a deployed LLM that implements the chat function, the playground uses the Bolt-on Governance API as the preferred communication method. Enter the Chat model ID associated with the LLM blueprint to set the model parameter for requests from the playground to the deployed LLM. Alternatively, enter datarobot-deployed-llm to let the LLM blueprint select the relevant model ID automatically when calling the LLM provider's services.
The chat() hook returns four keys related to citations and accessible to custom models:
Citation keys
Description
content
The contents of the custom model citations field.
metadata
The LangChain Document's metadata, containing the following information:
source: The source of the citation (e.g., a filename in the original dataset).
page: The source page number where the citation was found.
link
The metadata key's page and source combined to provide a full citation.
vector
The embedding vector of the citation. This key is used by the custom model as part of the LLM context for monitoring on the Data exploration and Custom metrics tabs.
The post_process hook returns a pandas DataFrame (or R data.frame or tibble) of the following format:
For regression or anomaly detection projects, the output must have a single numeric column named Predictions.
For binary or multiclass projects, the output must have one column per class, with class names used as column names. Each cell must contain the probability of the respective class, and each row must sum up to 1.0.