The ability to create new DataRobot Prime models has been removed from the application. This does not affect existing Prime models or deployments. To export Python code in the future, use the Python code export function in any RuleFit model.
DataRobot Prime builds models for use outside of the DataRobot application, which can provide multiple benefits. Once created, you can export these models as a Python module or a Java class, and run the exported script.
Using a technique known as "knowledge distillation," a form of regularization, DataRobot trains a smaller (“student”) model using the original Leaderboard (“teacher”) model’s predictions as the target. Once the rule-based Prime model is on the Leaderboard, you can compare its validation score against the teacher and other models in the project.
Deep dive: Knowledge distillation
Training a model based on the outputs of another model is called knowledge distillation. In this case the initial model is the "teacher" model and the DataRobot Prime model is the "student." DataRobot Prime creates a parametric model (a model with a finite number of parameters) that performs comparably to a selected model on the Leaderboard. All metrics are calculated for the model's ability to predict the target rather than the ability to predict the teacher model's output.
Knowledge distillation is an effective regularization technique used to better predict the "truth." The meaning of truth may be different in the training data versus what comes after the model is deployed. In training data, truth is the target column. At prediction time, it is what the value of the target column will eventually be (even though it’s not known yet).
Here's a simple example:
You want to predict the likelihood that a flipped coin will come up heads. You have predictive features like temperature, wind speed, humidity, time of day, day of the week (one-hot-encoded), and many more.
The standard approach would be to predict on the raw data (0s for tails, 1s for heads). Even with data from thousands of coin flips, a model would overfit. For example, if every time the temperature was 67.013 degrees and wind speed was 10mph, the coin came up heads, the model would conclude this will universally remain true in the future.
Now, as a human you know that temperature and wind speed don’t affect a coin flip—the best prediction is “50% likelihood of heads.” So you set the target to reflect that value and train a model on data where everything is set at 50%. Your resulting (student) model always predicts 50% because that is the only target it has seen. This would yield a great model—better than the one that used the raw data.
This example is a contrived scenario that proposes that you know the accompanying features don’t actually help. Its point is to describe knowledge distillation as an approach that moves slightly towards the “human putting in 50%,” but in a way that’s appropriate for more realistic scenarios.
Raw data involves some random chance. A teacher model typically does not make predictions of 0% or 100% probability. Instead, it uncovers some of the underlying structure in how to make good predictions. (The exact amount of structure depends on the model type.) If, for instance, you used a tree-based model in the coin-flipping example, it would group “similar” coin flips together so that the predicted outcomes don't result in 0s and 1s—it would group flips together based on features. Then its prediction is the average outcome in each group.
For example, if you had 15 flips meeting some grouping criteria like:
- Between 1PM and 2PM
- Wind speed under 10mph*
- Temperature between 65 and 70 degrees
Of those 15 flips, 12 (80%) were heads and 3 (20%) were tails. Given that, the model would predict 80% likelihood of heads for anything in the future meeting those criteria.
The take-away here is that the small (naive) model can benefit from the teacher model's structure, making it overfit less.
You can build a DataRobot Prime model for most models on the Leaderboard. There are, however, some situations in which this type of model cannot be built. See the associated considerations for additional information.
Creating a DataRobot Prime model¶
DataRobot Prime makes predictions using the number of features it has determined to be the optimal balance against the project's original metric. To create a DataRobot Prime model:
Process your dataset using any of the modeling modes.
Expand the model you want to apply DataRobot Prime to; click the DataRobot Prime tab.
On the resulting screen, click RUN DATAROBOT PRIME. You will see the the modeling job added to the Worker Queue and receive a success message:
When the job completes, the new DataRobot Prime model is available on the Leaderboard. The description below the model name contains the name and model number of the parent model, as well as the number of rules used in the downloadable code.
Expand the new DataRobot Prime model and click the DataRobot Prime tab to view a graph (explained here) of 10 rule count options plotted against the resulting metric score for each:
Changing the rule count¶
Initially, DataRobot builds a model based on the best rule count choice. There are reasons why you may want to change the rule count, however. To use a different rule count:
- Determine, from the graph, the number of rules in your chosen selection.
- Select the new rule count by clicking the associated radio button.
- Confirm the new model request by clicking CONTINUE. When you click, DataRobot generates a new DataRobot Prime model, with the new rule count, and adds the entry to the Leaderboard.
Exporting your DataRobot Prime model¶
Once you are satisfied with the performance of your DataRobot Prime model, you can generate and download production code to make predictions.
Downloading production code¶
To download production code:
Using the Select Language dropdown in the bottom left corner, choose either Python or Java.
When using the generated source code in Python, you must specify the encoding if you are using a character set other than UTF-8.
Click Generate and Download Code. If this is the first time you are generating code for the model, DataRobot launches a Prime Validation job to test and verify the integrity of the source code it is generating. You can monitor the job progress in the Worker Queue:
When testing completes, DataRobot displays a message indicating whether validation passed or failed and provides a button to download the code:
To download DataRobot Prime model code for production use, click DOWNLOAD GENERATED CODE and browse to a save location. Your can now use the code outside of DataRobot to make predictions.
Using debugging information¶
When creating code, DataRobot tries to predict each row and, if an exception or error occurs, records the error in the code output (
stderr). Search for these messages to verify the integrity of your production code data or if you encounter problems when trying to run the production code.
For example, where "healthy" production code returns this:
def predict_dataframe(ds): return ds.apply(predict, axis=1)
Errors code returns something similar to this:
def predict_dataframe(ds): try: return ds.apply(predict, axis=1) except TypeError as e: sys.stderr.write('Error processing column: ' + unicode(e) +'\n') os._exit(1)
Using a DataRobot Prime model¶
Once you have exported your DataRobot Prime model in a selected language, you can use it for prediction. See the Prime examples section for more information.
This section provides additional details on DataRobot Prime models as well as tips in the event validation fails.
Reasons to use DataRobot Prime¶
DataRobot Prime supports the model transparency goals of DataRobot by providing:
- Generated model and scoring code.
- A coefficients model to verify data integrity.
- Multiple language support.
- DataRobot integration into systems that can’t necessarily communicate with the DataRobot environment (for example, for privacy reasons).
- Proof of performance as evidenced by the Prime model also placing on the Leaderboard.
- Low-latency scoring without the API call overhead. For example, if you use a real-time, low-latency scoring platform with GLMs and custom code, rule-based systems in a fast language like C++ or Java, DataRobot's Prime code export allows you to score directly on your low-latency platform without the API call-time overhead.
Exploring the DataRobot Prime model¶
To view the graph of rule count options plotted against the resulting metric score for each graph, expand the DataRobot Prime model on the Leaderboard and click the DataRobot Prime tab:
The following table describes the elements of the DataRobot Prime tab page for existing Prime models:
||Displays the metric used in the original project build.|
|Rule count options (2)||List the 10 rule count options, and their associated metric value, available for the model. Click the radio button to begin the build of a new model with a different rule count.|
|Language selection (3)||Provides a mechanism for choosing the language for your downloadable code.|
|Code generation link (4)||Begins the code generation (and, ultimately, download) process for exporting your DataRobot Prime model.|
Why to change the rule count¶
Initially, DataRobot builds a model based on the best rule count choice. You may learn from the graph, however, that there is a better rule count choice and so you can change the rule count to simplify your model. For example, a particular rule count may have fewer rules than the best selection, while only suffering a small score penalty.
When you change the rule count, DataRobot builds a new DataRobot Prime model and adds it to the Leaderboard. Any previous DataRobot Prime models built from the blueprint remain available. Note that you must generate and download code for each model individually.
There may be cases when you have applied a var type transformation on a feature and then created a feature list using the transformed feature. You can create a DataRobot Prime model using a var type transformation (a change from the type DataRobot detected and assigned to a type of your own choosing). If you execute the generated code on a dataset that does not contain the transformed feature, the DataRobot Prime model returns the same results as the internal predictions results. Because transformations allow you to define a "NaN" value, DataRobot replaces invalid values in the generated code with the value you defined.
DataRobot Prime does not support user-defined, log, square, or power transformations. Specifically, you can use the following var type transformations:
If validation fails¶
Although rare, it is possible that DataRobot returns an error message when it runs validation in response to a request to generate code. There are two reasons for error; DataRobot reports the error type in the message it returns. Note that even with an error message, you can still download code. It is best to email DataRobot Customer Support describing the issue for further assistance. Reasons for failure include:
Predictions from the generated code were not close enough to the predictions from the DataRobot Prime model. In this case, generated code can still be run.
Generated code could not run due to issues such as problem data or out of memory error. In this case, generated code probably will not run. That is, if the issue is a problem with the data, the code, most likely, will not run. If it is a memory error, if your local machine is large enough (while the workers that were trying to validate the code were not) the code may run.
You can re-run the validation if you feel circumstances may return a different result. Also, review the DataRobot Prime considerations. To re-run a validation job:
- Delete the DataRobot Prime model.
- Run the model again (either by rerunning the original model or generating a new model from the DataRobot Prime tab graph.
- Click Generate and Download Code to run the validation job again.
If validation still fails, click the link in the modal where the failure is indicated. DataRobot opens your email client and populates a message with the DataRobot Customer Support recipient, a subject line, and message content to help Support assist you in debugging the issue. You can add any additional information, if you choose.
The following considerations apply to DataRobot Prime:
DataRobot Prime models cannot be built when the model:
- Image, Location, Date, or Summarized Categorical features, or derived features, are in the feature list, or when the feature list contains a single-column text list.
- Is part of a multiclass project.
Date/time partitioning is not available for DataRobot Prime.
DataRobot Prime models are not displayed on the Learning Curve, but do display on Speed vs Accuracy.
DataRobot Prime models must be run on the same feature list, and at the same sample size, as the original model.
You cannot manually launch cross-validation from a DataRobot Prime model.
When using DataRobot Prime, you must run the model with enough data left to include a validation set. In other words, you cannot build or retrain a DataRobot Prime model on 100% of data. Instead, you can set the holdout set to 0% and make the validation set smaller. Be aware, however, that your model results will be not properly compared if the validation set is too small. Generally the validation set should be at least 10%.
DataRobot Prime does not employ the same level of ts-date-time format checking as the other prediction mechanisms. As a result, ts-date-time formatting inconsistencies between training data and prediction data may lead to incorrect predictions (the date value will be imputed as NaN rather than explicitly erroring, as would happen with other DataRobot prediction mechanisms). To ensure that this does not cause a problem, verify the formats are the same before running predictions.
DataRobot Prime is disabled when Exposure and/or Offset parameters are set.