NextGen experience > Registry > Model workshop > Configure evaluation and moderation

Configure evaluation and moderation¶

Premium

Evaluation and moderation guardrails are a premium feature. Contact your DataRobot representative or administrator for information on enabling this feature.

Feature flag: Enable Moderation Guardrails (Premium), Enable Global Models in the Model Registry (Premium), Enable Additional Custom Model Output in Prediction Responses

Evaluation and moderation guardrails help your organization block prompt injection and hateful, toxic, or inappropriate prompts and responses. It can also prevent hallucinations or low-confidence responses and, more generally, keep the model on topic. In addition, these guardrails can safeguard against the sharing of personally identifiable information (PII). Many evaluation and moderation guardrails connect a deployed text generation model (LLM) to a deployed guard model. These guard models make predictions on LLM prompts and responses, and then report these predictions and statistics to the central LLM deployment. To use evaluation and moderation guardrails, first, create and deploy guard models to make predictions on an LLM's prompts or responses; for example, a guard model could identify prompt injection or toxic responses. Then, when you create a custom model with the Text Generation target type, define one or more evaluation and moderation guardrails.

Important prerequisites

Before configuring evaluation and moderation guardrails for an LLM, follow these guidelines while deploying guard models and configuring your LLM deployment:

If using a custom guard model, before deployment, define moderations.input_column_name and moderations.output_column_name as tag-type key values on the registered model version. If you don't set these key values, any users of the guard model will have to enter the input and output column names manually.
Deploy the global or custom guard models you intend to use to monitor the central LLM before configuring evaluation and moderation.
Deploy the central LLM on a different prediction environment than the deployed guard models.
Set an association ID and enable prediction storage before you start making predictions through the deployed LLM. If you don't set an association ID and provide association IDs alongside the LLM's predictions, the metrics for the moderations won't be calculated on the Custom metrics tab.
After you define the association ID, you can enable automatic association ID generation to ensure these metrics appear on the Custom metrics tab. You can enable this setting during or after deployment.

Prediction method considerations

When making predictions outside a chat generation Q&A application, evaluations and moderations are only compatible with real-time predictions, not batch predictions. In addition, when requesting streaming responses using the Bolt-on Governance API, evaluation and moderation negates the effect of streaming. Guardrails evaluate only the complete response of the LLM and therefore return the response text in one chunk.

To select and configure evaluation and moderation guardrails:

In the Model workshop, open the Assemble tab of a custom model with the Text Generation target type and assemble a model, either manually from a custom model you created outside of DataRobot or automatically from a model built in a Use Case's LLM playground:

When you assemble a text generation model with moderations, ensure you configure any required runtime parameters (for example, credentials) or resource settings (for example, public network access). Finally, set the Base environment to a moderation-compatible environment; for example, [GenAI] Python 3.11 with Moderations:

Resource settings

DataRobot recommends creating the LLM custom model using larger resource bundles with more memory and CPU resources.
After you've configured the custom model's required settings, navigate to the Evaluation and Moderation section and click Configure:

In the Configuration summary, do either of the following:

Click View lineage to review how evaluations are executed in DataRobot. All evaluations and their respective moderations run in parallel.

Click General configuration to set the following:

Setting	Description
Timeout for moderation	Configure the maximum wait time (in seconds) for moderations before the system automatically times out.
Timeout action	Define what happens if the moderation system times out: Score prompt / response or Block prompt / response.

In the Configure evaluation and moderation panel, click one of the following metric cards to configure the required properties:

Evaluation metric	Requires	Description
Custom Deployment	Custom deployment	Use any deployment to evaluate and moderate your LLM (supported target types: regression, binary classification, multiclass, text generation).
Emotions Classifier	Emotions Classifier deployment	Classify prompt or response text by emotion.
Faithfulness	Playground LLM, vector database	Measure if the LLM response matches the source to identify possible hallucinations.
PII Detection	Presidio PII Detection	Detect Personally Identifiable Information (PII) in text using the Microsoft Presidio library.
Prompt Injection	Prompt Injection Classifier	Detect input manipulations, such as overwriting or altering system prompts, intended to modify the model's output.
Prompt tokens	N/A	Tracks the number of tokens associated with the input to the LLM.
Response tokens	N/A	Tracks the number of tokens associated with the output from the LLM.
Rouge 1	Vector database	Calculate the similarity between the response generated from an LLM blueprint and the documents retrieved from the vector database.
Stay on topic for inputs	NVIDIA NeMo guardrails configuration	Use NVIDIA NeMo Guardrails to provide topic boundaries, ensuring prompts are topic-relevant and do not use blocked terms.
Stay on topic for output	NVIDIA NeMo guardrails configuration	Use NVIDIA NeMo Guardrails to provide topic boundaries, ensuring responses are topic-relevant and do not use blocked terms.
Token Count	N/A	Track the number of tokens associated with the input to the LLM, output from the LLM, and/or retrieved text from the vector database.
Toxicity	Toxicity Classifier	Classify content toxicity to apply moderation techniques, safeguarding against dissemination of harmful content.

The deployments required for PII detection, prompt injection detection, emotion classification, and toxicity classification are available as global models in the registry.

Multiclass custom deployment metric limits

Multiclass custom deployment metrics can have:

Up to 10 classes defined in the Matches list for moderation criteria.
Up to 100 class names in the guard model.

Depending on the metric selected above, configure the following fields:

Field	Description
General settings
Name	Enter a unique name if adding multiple instances of the evaluation metric.
Apply to	Select one or both of Prompt and Response, depending on the evaluation metric. Note that when you select Prompt, it's the user prompt, not the final LLM prompt, that is used for metric calculation.
Custom Deployment, PII Detection, Prompt Injection, Emotions Classifier, and Toxicity settings
Deployment name	For evaluation metrics calculated by a guard model, select the custom model deployment.
Custom Deployment settings
Input column name	This name is defined by the custom model creator. For global models created by DataRobot, the default input column name is `text`. If the guard model for the custom deployment has the `moderations.input_column_name` key value defined, this field is populated automatically.
Output column name	This name is defined by the custom model creator, and needs to refer to the target column for the model. The target name is listed on the deployment's Overview tab (and often has `_PREDICTION` appended to it). You can confirm the column names by exporting and viewing the CSV data from the custom deployment. If the guard model for the custom deployment has the `moderations.output_column_name` key value defined, this field is populated automatically.
Faithfulness settings
LLM	Select a Playground LLM for evaluation.
Stay on topic for input/ouput settings
LLM Type	Select Azure OpenAI or OpenAI, and then, set the following: For the Azure OpenAI LLM type, enter an OpenAI API base URL, OpenAI Credentials, and OpenAI API Deployment. For the OpenAI LLM type, select Credentials. Credentials are defined on the Credentials management page.
Files	For the Stay on topic evaluations, next to a file, click to modify the NeMo guardrails configuration files. In particular, update `prompts.yml` with allowed and blocked topics and `blocked_terms.txt` with the blocked terms, providing rules for NeMo guardrails to enforce. The `blocked_terms.txt` file is shared between the input and output stay on topic metrics; therefore, modifying `blocked_terms.txt` in the input metric modifies it for the output metric and vice versa. Only two NeMo stay on topic metrics can exist in a custom model, one for input and one for output.
Moderation settings
Configure and apply moderation	Enable this setting to expand the Moderation section and define the criteria that determines when moderation logic is applied.

In the Moderation section, with Configure and apply moderation enabled, for each evaluation metric, set the following:

Setting	Description
Moderation criteria	If applicable, set the threshold settings evaluated to trigger moderation logic. For the Emotions Classifier, select Matches or Does not match and define a list of classes (emotions) to trigger moderation logic.
Moderation method	Select Report, Report and block, or Replace (if applicable).
Moderation message	If you select Report and block, you can optionally modify the default message.

After configuring the required fields, click Add to save the evaluation and return to the evaluation selection page. Then, select and configure another metric, or click Save configuration.

The guardrails you selected appear in the Evaluation and moderation section of the Assemble tab.

After you add guardrails to a text generation custom model, you can test, register, and deploy the model to make predictions in production. After making predictions, you can view the evaluation metrics on the Custom metrics tab and prompts, responses, and feedback (if configured) on the Data exploration tab.

Tracing tab

When you add moderations to an LLM deployment, you can't view custom metric data by row on the Data exploration > Tracing tab.

Global models for evaluation metric deployments¶

The deployments required for PII detection, prompt injection detection, emotion classification, and toxicity classification are available as global models in the registry. The following global models are available:

Model	Type	Target	Description
Prompt Injection Classifier	Binary	injection	Classifies text as prompt injection or legitimate. This model requires one column named `text`, containing the text to classify. For more information, see the deberta-v3-base-injection model details.
Toxicity Classifier	Binary	toxicity	Classifies text as toxic or non-toxic. This model requires one column named `text`, containing the text to classify. For more information, see the toxic-comment-model details.
Sentiment Classifier	Binary	sentiment	Classifies text sentiment as positive or negative. This model requires one column named `text`, containing the text to classify. For more information, see the distilbert-base-uncased-finetuned-sst-2-english model details.
Emotions Classifier	Multiclass	target	Classifies text by emotion. This is a multilabel model, meaning that multiple emotions can be applied to the text. This model requires one column named `text`, containing the text to classify. For more information, see the roberta-base-go_emotions-onnx model details.
Refusal Score	Regression	target	Outputs a maximum similarity score, comparing the input to a list of cases where an LLM has refused to answer a query because the prompt is outside the limits of what the model is configured to answer.
Presidio PII Detection	Binary	contains_pii	Detects and replaces Personally Identifiable Information (PII) in text. This model requires one column named `text`, containing the text to be classified. The types of PII to detect can optionally be specified in a column, 'entities', as a comma-separated string. If this column is not specified, all supported entities will be detected. Entity types can be found in the PII entities supported by Presidio documentation. In addition to the detection result, the model returns an `anonymized_text` column, containing an updated version of the input with detected PII replaced with placeholders. For more information, see the Presidio: Data Protection and De-identification SDK documentation.
Zero-shot Classifier	Binary	target	Performs zero-shot classification on text with user-specified labels. This model requires classified text in a column named `text` and class labels as a comma-seperated string in a column named `labels`. It expects the same set of labels for all rows; therefore, the labels provided in the first row are used. For more information, see the deberta-v3-large-zeroshot-v1 model details.
Python Dummy Binary Classification	Binary	target	Always yields 0.75 for the positive class. For more information, see the python3_dummy_binary model template.

View evaluation and moderation guardrails¶

When a text generation model with guardrails is registered and deployed, you can view the configured guardrails on the registered model's Overview tab and the deployment's Overview tab:

RegistryConsole

Evaluation and moderation logs

On the Activity log > Moderation tab of a deployed LLM with evaluation and moderation configured, you can view a history of evaluation and moderation-related events for the deployment to diagnose issues with a deployment's configured evaluations and moderations.

Configure evaluation and moderation¶

Global models for evaluation metric deployments¶

View evaluation and moderation guardrails¶

Was this page helpful?

Great! Let us know what you found helpful.

What can we do to improve the content?