# Use LLM evaluation tools

> Use LLM evaluation tools - Configure evaluation and moderation guardrails for LLM blueprints in a
> playground.

This Markdown file sits beside the HTML page at the same path (with a `.md` suffix). It summarizes the topic and lists links for tools and LLM context.

Companion generated at `2026-05-06T18:17:09.559454+00:00` (UTC).

## Primary page

- [Use LLM evaluation tools](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html): Full documentation for this topic (HTML).

## Sections on this page

- [Configure evaluation metrics](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#configure-evaluation-metrics): In-page section heading.
- [Create a new configuration](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#create-a-new-configuration): In-page section heading.
- [Change credentials](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#change-credentials): In-page section heading.
- [Manage configured metrics](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#manage-configured-metrics): In-page section heading.
- [Copy a metric configuration](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#copy-a-metric-configuration): In-page section heading.
- [View metrics in a chat](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#view-metrics-in-a-chat): In-page section heading.
- [Add evaluation datasets](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#add-evaluation-datasets): In-page section heading.
- [Add aggregated metrics](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#add-aggregated-metrics): In-page section heading.
- [Configure compliance testing](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#configure-compliance-testing): In-page section heading.
- [Manage compliance testing from the Evaluation tab](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#manage-compliance-testing-from-the-evaluation-tab): In-page section heading.
- [View and customize DataRobot compliance tests](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#view-and-customize-datarobot-compliance-tests): In-page section heading.
- [Create custom compliance tests](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#manage-custom-compliance-tests): In-page section heading.
- [Run compliance testing from the playground](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#run-compliance-testing-from-the-playground-): In-page section heading.
- [Run existing compliance tests](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#run-existing-compliance-tests): In-page section heading.
- [Create and run custom compliance tests](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#create-and-run-custom-compliance-tests): In-page section heading.
- [Manage compliance test runs](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#manage-compliance-test-runs): In-page section heading.
- [Compare compliance test results](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#compare-compliance-test-results): In-page section heading.
- [View the tracing table](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#view-the-tracing-table): In-page section heading.
- [Send a metric and compliance test configuration to the workshop](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#send-a-metric-and-compliance-test-configuration-to-the-workshop): In-page section heading.

## Related documentation

- [Agentic AI](https://docs.datarobot.com/en/docs/agentic-ai/index.html): Linked from this page.
- [RAG workflows](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/index.html): Linked from this page.
- [required guard models from the NextGen Registry](https://docs.datarobot.com/en/docs/workbench/nxt-registry/nxt-model-directory/nxt-global-models.html): Linked from this page.
- [key value](https://docs.datarobot.com/en/docs/workbench/nxt-registry/nxt-model-directory/nxt-key-values.html#add-key-values-for-moderation-and-evaluation-guard-models): Linked from this page.
- [Overview](https://docs.datarobot.com/en/docs/workbench/nxt-console/nxt-overview/nxt-overview.html): Linked from this page.
- [exporting and viewing the CSV data from the custom deployment](https://docs.datarobot.com/en/docs/api/reference/sdk/data-exploration.html): Linked from this page.
- [available LLMs](https://docs.datarobot.com/en/docs/reference/gen-ai-ref/llm-availability.html): Linked from this page.
- [credentials management](https://docs.datarobot.com/en/docs/platform/acct-settings/stored-creds.html#credentials-management): Linked from this page.
- [vector database](https://docs.datarobot.com/en/docs/agentic-ai/vector-database/vector-dbs.html): Linked from this page.
- [compare LLMs](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/compare-llm.html): Linked from this page.
- [Data exploration > Tracing](https://docs.datarobot.com/en/docs/workbench/nxt-console/nxt-monitoring/nxt-data-exploration.html#explore-deployment-data-tracing): Linked from this page.
- [Data exploration](https://docs.datarobot.com/en/docs/workbench/nxt-console/nxt-settings/nxt-data-exploration-settings.html): Linked from this page.
- [association IDs are provided](https://docs.datarobot.com/en/docs/workbench/nxt-console/nxt-settings/nxt-custom-metrics-settings.html): Linked from this page.
- [Monitoring > Custom metrics](https://docs.datarobot.com/en/docs/workbench/nxt-console/nxt-monitoring/nxt-custom-metrics.html): Linked from this page.
- [Monitoring > Service health](https://docs.datarobot.com/en/docs/workbench/nxt-console/nxt-monitoring/nxt-service-health.html): Linked from this page.
- [generate compliance documentation](https://docs.datarobot.com/en/docs/workbench/nxt-registry/nxt-model-directory/nxt-compliance-doc.html): Linked from this page.
- [runtime parameter](https://docs.datarobot.com/en/docs/workbench/nxt-registry/nxt-model-workshop/nxt-create-custom-model.html#define-runtime-parameters): Linked from this page.
- [configure the custom model in the workshop](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/deploy-llm.html): Linked from this page.

## Documentation content

> [!NOTE] Premium
> LLM evaluation tools are a premium feature. Contact your DataRobot representative or administrator for information on enabling this feature.

The playground's LLM evaluation tools include evaluation metrics and datasets, aggregated metrics, compliance tests, and tracing. The LLM evaluation metric tools include:

| LLM evaluation tool | Description |
| --- | --- |
| Evaluation metrics | Report an array of performance, safety, and operational metrics for prompts and responses in the playground and define moderation criteria and actions for any configured metrics. |
| Evaluation datasets | Upload or generate the evaluation datasets used to evaluate an LLM blueprint through evaluation dataset metrics, aggregated metrics, and compliance tests. |
| Aggregated metrics | Combine evaluation metrics across many prompts and responses to evaluate an LLM blueprint at a high level, as only so much can be learned from evaluating a single prompt or response. |
| Compliance tests | Combine an evaluation metric and dataset to automate the detection of compliance issues with pre-configured or custom compliance testing. |
| Tracing table | Trace the execution of LLM blueprints through a log of all components and prompting activity used in generating LLM responses in the playground. |

## Configure evaluation metrics

With evaluation metrics, you can configure an array of performance, safety, and operational metrics. Configuring these metrics lets you define moderation methods to intervene when prompts and responses meet the moderation criteria you set. This functionality can help detect and block prompt injection and hateful, toxic, or inappropriate prompts and responses. It can also help identify hallucinations or low-confidence responses and safeguard against the sharing of personally identifiable information (PII).

> [!TIP] Evaluation deployment metrics
> Many evaluation metrics connect a playground-built LLM to a deployed guard model. These guard models make predictions on LLM prompts and responses and then report the predictions and statistics to the playground. If you intend to use any of the Evaluation Deployment type metrics—Custom Deployment, PII Detection, Prompt Injection, Emotions Classifier, and Toxicity—deploy the [required guard models from the NextGen Registry](https://docs.datarobot.com/en/docs/workbench/nxt-registry/nxt-model-directory/nxt-global-models.html) to make predictions on the LLM's prompts or responses.

Selecting and configuring evaluation metrics in an LLM playground depends on whether you have already configured LLM blueprints:

**With LLM blueprints:**
If you've added one or more LLM blueprints to the playground, with or without blueprints selected click the Evaluation tile on the side navigation bar:

[https://docs.datarobot.com/en/docs/images/playground-eval-metrics-blueprint.png](https://docs.datarobot.com/en/docs/images/playground-eval-metrics-blueprint.png)

**Without LLM blueprints:**
If you haven't added any blueprints to the playground, in the Evaluate with metrics section, click Open configure evaluation to configure metrics before adding an LLM blueprint:

[https://docs.datarobot.com/en/docs/images/playground-eval-metrics-no-blueprint.png](https://docs.datarobot.com/en/docs/images/playground-eval-metrics-no-blueprint.png)


In both cases, the Evaluation and moderation page opens to the Metrics tab. Certain metrics are enabled by default. Note, however, that to report a metric value for Citations and ROUGE-1, you must first associate a vector database with the LLM blueprint.

### Create a new configuration

To create a new evaluation metric configuration for the playground:

1. In the upper-right corner of theEvaluation and moderationpage, clickConfigure metrics:
2. In theConfigure evaluation and moderationpanel, click an evaluation metric and then configure the metric settings. The metrics, requirements, and settings are outlined in the tables below. Evaluation metricRequiresDescriptionCostLLM cost settingsCalculate the cost of generating the LLM response using a default or custom LLM, currency, input cost-per-token, and output cost-per-token values. The cost calculation also includes the cost of citations. For more information, seeCost metric settings.Custom DeploymentCustom deploymentUse an existing deployment to evaluate and moderate your LLM (supported target types: regression, binary classification,multiclass, text generation).Emotions ClassifierEmotions Classifier deploymentClassify prompt or response text by emotion.PII DetectionPresidio PII Detection deploymentDetect Personally Identifiable Information (PII) in text using the Microsoft Presidio library.Prompt InjectionPrompt Injection Classifier deploymentDetects input manipulations, such as overwriting or altering system prompts, intended to modify the model's output.ToxicityToxicity Classifier deploymentClassifies content toxicity to apply moderation techniques, safeguarding against dissemination of harmful content.ROUGE-1Vector databaseRecall-Oriented Understudy for Gisting Evaluationcalculates the similarity between the response generated from an LLM blueprint and the documents retrieved from the vector database.CitationsVector databaseReports the documents retrieved by an LLM when prompting a vector database.All tokensN/ATracks the number of tokens associated with the input to the LLM, output from the LLM, and/or retrieved text from the vector database.Prompt tokensN/ATracks the number of tokens associated with the input to the LLM.Response tokensN/ATracks the number of tokens associated with the output from the LLM.Document tokensN/ATracks the number of tokens associated with the retrieved text from the vector database.LatencyN/AReports the response latency of the LLM blueprint.CorrectnessLLM, evaluation dataset, vector databaseUses either a provided or synthetically generated set of prompts or prompt and response pairs to evaluate aggregated metrics against the provided reference dataset. The Correctness metric uses the LlamaIndexCorrectness Evaluator.FaithfulnessLLM, vector databaseMeasures if the LLM response matches the source to identify possible hallucinations. The Faithfulness metric uses the LlamaIndexFaithfulness Evaluator.Topic control metricsStay on topic for inputsNIM deployment ofllama-3.1-nemoguard-8b-topic-control, NVIDIA NeMo guardrails configurationUses NVIDIA NeMo Guardrails to provide topic boundaries, ensuring prompts are topic-relevant and do not use blocked terms.Stay on topic for outputNIM deployment ofllama-3.1-nemoguard-8b-topic-control, NVIDIA NeMo guardrails configurationUses NVIDIA NeMo Guardrails to provide topic boundaries, ensuring responses are topic-relevant and do not use blocked terms. Global models for evaluation metric deploymentsThe deployments required for PII detection, prompt injection detection, emotion classification, and toxicity classification are available asglobal models in Registry Multiclass custom deployment metric limitsMulticlasscustom deployment metrics can have:Up to10classes defined in theMatcheslist for moderation criteria.Up to100class names in the guard model. Depending on the evaluation metric (or evaluation metric type) selected, as well as whether you are using the LLM gateway, different configuration options are required: SettingDescriptionGeneral settingsNameEnter a unique name if adding multiple instances of the evaluation metric.Apply toSelect one or both ofPromptandResponse, depending on the evaluation metric. Note that when you selectPrompt, it's the user prompt, not the final LLM prompt, that is used for metric calculation. This field is only configurable for metrics that apply to both the prompt and the response.Custom Deployment, PII Detection, Prompt Injection, Emotions Classifier, and Toxicity settingsDeployment nameFor evaluation metrics calculated by a guard model deployment, select the custom model deployment.Custom Deployment settingsInput column nameThis name is defined by the custom model creator. Forglobal models created by DataRobot, the default input column name istext. If the guard model for the custom deployment has themoderations.input_column_namekey valuedefined, this field is populated automatically.Output column nameThis name is defined by the custom model creator, and needs to refer to the target column for the model. The target name is listed on the deployment'sOverviewtab (and often has_PREDICTIONappended to it). You can confirm the column names byexporting and viewing the CSV data from the custom deployment. If the guard model for the custom deployment has themoderations.output_column_namekey valuedefined, this field is populated automatically.Correctness and Faithfulness settingsLLMSelect an LLM for evaluation.Topic control metric settingsLLM TypeSelectAzure OpenAI,OpenAI, orNIM. For theAzure OpenAILLM type, additionally enter anOpenAI API base URLandOpenAI API Deployment; forNIMenter aNIM deployment(thellama-3.1-nemoguard-8b-topic-controltopic-control model). If you use the LLM gateway, the default experience, DataRobot-supplied credentials are provided. You can, however, clickChange credentialsto provide your own authentication.FilesFor theStay on topicevaluations, next to a file, clickto modify the NeMo guardrails configuration files. In particular, updateprompts.ymlwith allowed and blocked topics andblocked_terms.txtwith the blocked terms, providing rules for NeMo guardrails to enforce. Theblocked_terms.txtfile is shared between the input and output topic control metrics; therefore, modifyingblocked_terms.txtin the input metric modifies it for the output metric and vice versa. Only two topic control metrics can exist in a playground, one for input and one for output.Moderation settingsConfigure and apply moderationEnable this setting to expand theModerationsection and define the criteria that determines when moderation logic is applied. Cost metric settingsFor theCostmetric, in the row for eachLLMtype, define aCurrencyand theInputandOutputcost incurrency amount / tokens amountformat, then clickAdd:TheCostmetric doesn't include theModerationsection toConfigure and apply moderation.
3. In theModerationsection, withConfigure and apply moderationenabled, for each evaluation metric, set the following: SettingDescriptionModeration criteriaIf applicable, set the threshold settings evaluated to trigger moderation logic. For numeric metrics (int or float), you can useless than,greater than, orequals towith a value of your choice. For binary metrics (for example, Stay on topic for inputs), useequals to0 or 1. For the Emotions Classifier, selectMatchesorDoes not matchand define a list of classes (emotions) to trigger moderation logic.Moderation methodSelectReportorReport and block.Moderation messageIf you selectReport and block, you can optionally modify the default message.
4. After configuring the required fields, clickAddto save the evaluation and return to the evaluation selection page. The metrics you selected appear on theConfigure evaluation and moderationpanel, in theConfiguration summarysidebar.
5. Select and configure another metric, or clickSave configuration. The metrics appear on theEvaluation and moderationpage. If any issues occur during metric configuration, an error message appears below the metric to provide guidance on how to fix the issue. Metric configuration processingError message example

### Change credentials

DataRobot provides credentials for [available LLMs](https://docs.datarobot.com/en/docs/reference/gen-ai-ref/llm-availability.html) using the LLM gateway. With Azure OpenAI and OpenAI LLM types, you can, however, use your own credentials for authentication. Before proceeding, define user-specified credentials on the [credentials management](https://docs.datarobot.com/en/docs/platform/acct-settings/stored-creds.html#credentials-management) page.

To change credentials for either Stay on topic for inputs or Stay on topic for output, choose the LLM type and click Change credentials.

**LLM type: Azure OpenAI:**
Provide the Azure OpenAI API deployment and the OpenAI API base URL. Then, from the dropdown, select the set of credentials to apply.

[https://docs.datarobot.com/en/docs/images/change-metric-creds-azure.png](https://docs.datarobot.com/en/docs/images/change-metric-creds-azure.png)

**LLM type: OpenAI:**
From the dropdown, select the set of credentials to apply.

[https://docs.datarobot.com/en/docs/images/change-metric-creds-openai.png](https://docs.datarobot.com/en/docs/images/change-metric-creds-openai.png)

**LLM type: NIM:**
Select the NIM deployment (for example, the topic-control model). Credentials are typically provided via the deployment configuration.


To revert to DataRobot-provided credentials, click Revert credentials.

### Manage configured metrics

To edit or remove a configured evaluation metric from the playground:

1. In the upper-right corner of theEvaluation and moderationpage, clickConfigure metrics:
2. In theConfigure evaluation and moderationpanel, in theConfiguration summarysidebar, click the edit iconor the delete icon:
3. If you click edit, you can re-configure the settings for that metric and clickUpdate:

### Copy a metric configuration

To copy an evaluation metrics configuration to or from an LLM playground:

1. In the upper-right corner of theEvaluation and moderationpage, next toConfigure metrics, click, and then clickCopy configuration.
2. In theCopy evaluation and moderation configurationmodal, select one of the following options: From an existing playgroundTo an existing playgroundTo a new playgroundIf you selectFrom an existing playground, choose toAdd to existing configurationorReplace existing configurationand then select a playground toCopy from.If you selectTo an existing playground, choose toAdd to existing configurationorReplace existing configurationand then select a playground toCopy to.If you selectTo a new playground, enter aNew playground name.
3. Select if you want toInclude evaluation datasets, and then clickCopy configuration.

> [!NOTE] Duplicate evaluation metrics
> Selecting Add to existing configuration can result in duplicate metrics, except in the case of NeMo Stay on topic for inputs and Stay on topic for output. Only two topic control metrics can exist, one for input and one for output.

### View metrics in a chat

The metrics you configure and add to the playground appear on the LLM responses in the playground. Click the down arrow to open the metric panel for more details. From this panel, click Citation to view the prompt, response, and a list of citations in the Citation dialog box. You can also provide positive or negative feedback for the response.

In addition, if a response from the LLM is blocked by the configured moderation criteria and strategy, you can click Show response to view the blocked response:

> [!TIP] Multiple moderation messages
> If a response from the LLM is blocked by multiple configured moderations, the message for each triggered moderation appears, replacing the LLM response, in the chat. If you configure descriptive moderation messages, this can provide a complete list of reasons for blocking the LLM response.

## Add evaluation datasets

To enable evaluation dataset metrics and aggregated metrics, add one or more evaluation datasets to the playground. The dataset must be a CSV file, in the Data Registry, and have at least one text or categorical column.

> [!WARNING] When using evaluation datasets with an LLM that includes a vector database
> Ensure that no column name exists in both the evaluation dataset and the vector database. If any column name exists in both, those columns are treated as metadata filters, and vector database results are excluded from prompts when you run evaluation dataset aggregation. This situation is most common when the vector database was built from a CSV source document.

1. To select and configure evaluation metrics in an LLM playground, do either of the following:
2. On theEvaluation and moderationpage, click theEvaluation datasetstab to view any existing datasets, then, clickAdd evaluation dataset, and select one of the following methods: Dataset addition methodDescriptionAdd evaluation datasetIn theAdd evaluation datasetpanel, select an existing dataset from theData Registrytable, or upload a new dataset:ClickUploadto register and select a new dataset from your local filesystem.ClickUpload from URL, then, enter theURLfor a hosted dataset and clickAdd.After you select a dataset, in theEvaluation dataset configurationsidebar, define thePrompt column nameandResponse (target) column name, and clickAdd evaluation dataset.Generate synthetic dataEnter aDataset name, select anLLM, setVector database,Vector databaseversion, and theLanguageto use when creating synthetic data. Then, clickGenerate data. For more information, seeGenerate synthetic datasets.
3. After you add an evaluation dataset, it appears on theEvaluation datasetstab of theEvaluation and moderationpage, where you can clickOpen datasetto view the data. You can also click theActions menutoEdit evaluation datasetorDelete evaluation dataset:

**Q: How are synthetic datasets generated?**

When you add evaluation dataset metrics, DataRobot can use a [vector database](https://docs.datarobot.com/en/docs/agentic-ai/vector-database/vector-dbs.html) to generate synthetic datasets, composed of prompt and response pairs, to evaluate your LLM blueprint against. Synthetic datasets are generated by accessing the selected vector database, clustering the vectors, pulling a representative chunk from each cluster, and prompting the selected LLM to generate 100 question and answer pairs based on the document(s). When you configure the synthetic evaluation dataset settings and click Generate, two events occur sequentially:

1. A placeholder dataset is registered to the Data Registry with the required columns (questionandanswer), containing 65 rows and 2 columns of placeholder data (for example,Record for synthetic prompt answer 0,Record for synthetic prompt answer 1, etc.).
2. The selected LLM and vector database pair generates question and answer pairs, and is added to the Data Registry as a second version of the synthetic evaluation dataset created in step 1.The generation time depends on the selected LLM.

To generate high-quality and diverse questions, DataRobot runs cosine similarity-based clustering. Similar chunks are grouped into the same cluster and each cluster generates a single question and answer pair. Therefore, if a vector database includes many similar chunks, they'll be grouped into a much smaller number of clusters. When this happens, the number of pairs generated is much lower than the number of chunks in the vector database.

## Add aggregated metrics

When a playground includes more than one metric, you can begin creating aggregate metrics. Aggregation is the act of combining metrics across many prompts and/or responses, which helps to evaluate a blueprint at a high level (only so much can be learned from evaluating a single prompt/response). Aggregation provides a more comprehensive approach to evaluation.

Aggregation either averages the raw scores, counts the boolean values, or surfaces the number of categories in a multiclass model. DataRobot does this by generating the metrics for each individual prompt/response and then aggregating using one of the methods listed, based on the metric.

To configure aggregated metrics:

1. In a playground, clickConfigure aggregationbelow the prompt input: Aggregation job run limitOnly one aggregated metric job can run at a time. If an aggregation job is currently running, theConfigure aggregationbutton is disabled and the "Aggregation job in progress; try again when it completes" tooltip appears.
2. On theGenerate aggregated metricspanel, select metrics to calculate in aggregate and configure theAggregate bysettings. Then, enter a newChat name, select anEvaluation dataset(to generate prompts in the new chat), and select theLLM blueprintsfor which the metrics should be generated. These fields are pre-populated based on the current playground: Evaluation dataset selectionIf you select an evaluation dataset metric, likeCorrectness, you must use the evaluation dataset used to create that evaluation dataset metric. After you complete theMetrics selectionandConfigurationsections, clickGenerate metrics. This results in a new chat containing all associated prompts and responses: Aggregated metrics are run against an evaluation dataset, not individual prompts in a standard chat. Therefore, you can only view aggregated metrics in the generatedaggregated metrics chat, added to the LLM blueprint'sAll Chatslist (on the LLM's configuration page). Aggregation metric calculation for multiple blueprintsIf many LLM blueprints are included in the metric aggregation request, aggregated metrics are computed sequentially, blueprint-by-blueprint.
3. Once an aggregated chat is generated, you can explore the resulting aggregated metrics, scores, and related assets on theAggregated metricstab. You can filter byAggregation method,Evaluation dataset, andMetric: In addition, clickCurrent configurationto compare only those metrics calculated for the blueprint configuration currently defined in theLLMtab of theConfigurationsidebar. View related assetsFor each metric in the table, you can clickEvaluation datasetandAggregated chatto view the corresponding asset contributing to the aggregated metric.
4. Returning to the LLMBlueprints comparisonpage, you can now open theAggregated metricstab to view a leaderboard comparing LLM blueprint performance for the generated aggregated metrics:

## Configure compliance testing

Combine an evaluation metric and an evaluation dataset to automate the detection of compliance issues through test prompt scenarios.

### Manage compliance testing from the Evaluation tab

When you manage compliance testing on the Evaluation tab, you can view pre-defined compliance tests, create and manage custom tests, or modify pre-defined tests to suit your organization's testing requirements.

To view all available compliance tests:

1. On the side navigation bar click theEvaluationtile.
2. Click theCompliance teststab. On theCompliance teststab, you can view all the compliance tests available, both DataRobot and custom (if present). The table contains columns for theTestname,Provider, andConfiguration(number of evaluations and evaluation datasets).

#### View and customize DataRobot compliance tests

Use the View option to review and, optionally:

- Customize DataRobot pre-configured compliance tests, including changing the LLM for certain tests.
- Manage custom compliance tests.

In the table on the Compliance tests tab, click View to open and review any of the compliance tests in which DataRobot is the Provider:

| Compliance test | Description | Assessing LLM | Based on |
| --- | --- | --- | --- |
| Bias Benchmark | Runs LLM question/answer sets that test for bias along eight social dimensions. | GPT-4o | AI Verify Foundation |
| Jailbreak | Applies testing scenarios to evaluate whether built-in safeguards enforce LLM jailbreaking compliance standards. | Customizable | jailbreak_llms |
| Completeness | Determines whether the LLM response is supplying enough information to comprehensively answer questions. | GPT-4o | Internal |
| Personally Identifiable Information (PII) | Determines whether the LLM response contains PII included in the prompt. | Customizable | Internal |
| Toxicity | Applies testing scenarios to evaluate whether built-in safeguards enforce toxicity compliance standards. For more information, see the explicit and offensive content warning. | Customizable | Hugging Face |
| Japanese Bias Benchmark | Runs LLM question/answer sets in Japanese that test for bias along five social dimensions. | GPT-4o | AI Verify Foundation |

> [!WARNING] Explicit and offensive content warning
> The [public evaluation dataset for toxicity testing](https://huggingface.co/datasets/allenai/real-toxicity-prompts) contains explicit and offensive content. It is intended to be used exclusively for the purpose of eliminating such content from external models and applications. Any other use is strictly prohibited.

**Bias tests**

Bias testing is based on the following moonshot-data datasets from AI Verify Foundation:

- bbq-lite-age-disamb.json
- bbq-lite-disability-status-disamb.json
- bbq-lite-gender-disamb.json
- bbq-lite-nationality-disamb.json
- bbq-lite-race-ethnicity-disamb.json
- bbq-lite-religion-disamb.json
- bbq-lite-sexual-orientation-disamb.json
- bbq-lite-ses-disamb.json
- JBBQ: Japanese Bias Benchmark for Analyzing Social Biases in Large Language ModelsandJBBQ_data

When viewing a compliance test from the list, you can review the individual evaluations run as part of the compliance testing process. For all tests, you can review the Name, Metric, LLM, Evaluation dataset, Pass threshold, and Number of prompts. If the test shows `-` in the LLM field, it uses GPT-4o. The following tests default to GPT-4o as the LLM but can be customized:

- Jailbreak
- Toxicity
- PII

Use a selected DataRobot test as the foundation for a custom test as follows:

1. SelectViewfor the test you want to modify.
2. ClickCustomize test.
3. From theCreate custom testmodal, modify any of the individual evaluations for the compliance test settings. NoteIn addition to the default metrics and evaluation datasets, you can select any evaluation metrics implemented by a deployed binary classification sidecar model and any evaluation datasets added to the Use Case. SettingDescriptionNameA descriptive name for the custom compliance test.DescriptionA description of the purpose of the compliance test (this is pre-populated when you modify an existing DataRobot test).Test pass thresholdThe minimum percentage (0-100%) of individual evaluations that must pass for the test as a whole to pass.Evaluations*NameThe name of the individual metric.MetricThe criteria to match against.LLMThe LLM used to assess the response. This field is enabled for Jailbreak, Toxicity, and PII compliance tests. All others use GPT-4o.Evaluation datasetThedatasetused for calculating metrics.Pass thresholdThe minimum percentage of responses that must pass for the evaluation to pass.Number of promptsThe number of rows from the dataset used to perform the evaluation.Add evaluationCreate additional evaluations.Copy from existing testCopy the individual evaluations from an existing compliance test. * Use the API-only process,expected_response_column, to validate a sidecar model with metrics you are introducing. It compares the LLM response with an expected response, similar to the pre-providedexact_matchmetric.
4. After you customize the compliance test settings, clickAdd. The new test appears in the table on theCompliance teststab.

#### Create custom compliance tests

To create a custom compliance test:

1. At the top or bottom of theCompliance teststab, clickCreate custom compliance test. Create compliance tests from anywhere in the Evaluations tabWhen theEvaluationtab is open, you can clickCreate custom compliance testfrom anywhere, not just theCompliance teststab.
2. In theCreate custom testpanel, configure the following settings: SettingDescriptionNameA descriptive name for the custom compliance test.DescriptionA description of the purpose of the compliance test (this is pre-populated when you modify an existing DataRobot test).Test pass thresholdThe minimum percentage (0-100%) of individual evaluations that must pass for the test as a whole to pass.Evaluations*NameThe name of the individual metric.MetricThe criteria to match against.LLMThe LLM used to assess the response. This field is enabled for Jailbreak, Toxicity, and PII compliance tests. All others use GPT-4o. You must set theMetricbefore setting this field.Evaluation datasetThedatasetused for calculating metrics.Pass thresholdThe minimum percentage of responses that must pass for the evaluation to pass.Number of promptsThe number of rows from the dataset used to perform the evaluation.Add evaluationCreate additional evaluations.Copy from existing testCopy the individual evaluations from an existing compliance test.
3. After you configure the compliance test settings, clickAdd. The new test appears in the table on theCompliance teststab.

#### Manage custom compliance tests

To manage custom compliance tests, locate tests with Custom as the Provider, and choose a management action:

- Click the edit icon, then, in theEdit custom testpanel, update the compliance test configuration and clickSave.
- Click the delete icon, then clickYes, delete testto remove the test from all playgrounds in the Use Case.

### Run compliance testing from the playground

When you perform compliance testing on the Playground tile, you can run the pre-defined compliance tests without modification, create custom tests, or modify the pre-defined tests to suit your organization's testing requirements.

To access compliance from the playground tests to run, modify, or create a test:

1. On thePlaygroundtile, in theLLM blueprintslist, click the LLM blueprint you want to test, or, select up to three blueprints for comparison. Access compliance tests from the blueprints comparison pageIf you have two or more LLM blueprints selected, you can click theCompliance teststab from theBlueprints comparisonpage to run compliance tests for multiple LLM blueprints and compare the results. For more information, seeCompare compliance test results
2. In the LLM blueprint, click theCompliance teststab to create or run tests. If you have not run tests before, you receive a message saying no compliance test results are available. If you have run a test before, test results are listed. In either case, clickRun testto open the test panel.
3. TheRun testpanel opens to a list of pre-configured DataRobot compliance tests and custom tests you've created.
4. When you select a compliance test from theAll testslist, you can view the individual evaluations run as part of the compliance testing process. For each test, you can review theName,Metric,Evaluation dataset,Pass threshold, andNumber of prompts.
5. Next, run an existing test, create and run a custom test, or manage custom tests.

#### Run existing compliance tests

To run an existing, configured compliance test:

1. On theRun testpanel, from theAll testslist, select an availableDataRobotorCustomtest.
2. After selecting a test, clickRun.
3. The test appears on theCompliance teststab with aRunning...status. Cancel a running testIf you need to cancel a test with theRunning...status, clickDelete test results.

#### Create and run custom compliance tests

To create and run a custom or modified compliance test:

1. On theRun testpanel, from theAll testslist:
2. On theCustom testpanel, configure the following settings: SettingDescriptionNameA descriptive name for the custom compliance test.DescriptionA description of the purpose of the compliance test (this is pre-populated when you modify an existing DataRobot test).Test pass thresholdThe minimum percentage (0-100%) of individual evaluations that must pass for the test as a whole to pass.EvaluationsThe individual evaluations for the compliance test, each consisting of aName,Metric,Evaluation dataset,Pass threshold, andNumber of prompts. In addition to the default metrics and evaluation datasets, you can select any evaluation metrics implemented by a deployed binary classification sidecar model and any evaluation datasets added to the Use Case.Click+ Add evaluationto create additional evaluations.ClickCopy from existing testto copy the individual evaluations from an existing compliance test.There is an API-only process to validate a sidecar model withexpected_response_columnto introduce metrics comparing the LLM response with and expected response, similar to the pre-providedexact_matchmetric.
3. After configuring a custom test, clickSave and run.
4. The test appears on theCompliance teststab with aRunning...status. Cancel a running testIf you need to cancel a test with theRunning...status, clickDelete test results.

#### Manage compliance test runs

From a running or completed test on the Compliance tests tab:

- To delete a completed test run or cancel and delete a running test, click Delete test results .
- To view the chat calculating the metric, click the chat name in the Corresponding chat column.
- To view the evaluation dataset used to calculate the metric, click the dataset name in the Evaluation dataset column.

#### Manage custom compliance tests

To manage custom compliance tests, on the Run test panel, from the All tests list, select a custom test, then click Delete test or Edit test. You can't edit or delete pre-configured DataRobot tests.

If you select Edit test, update the settings you [configured during compliance test creation](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#create-and-run-custom-compliance-tests).

### Compare compliance test results

To compare compliance test results, you can run compliance tests for up to three LLM blueprints at a time. On the Playground tile, in the LLM blueprints list, select up to three LLM blueprint to test, click the Compliance tests tab, and then click Run test.

This opens the Run test panel, where you can select and run a test [as you would for a single blueprint](https://docs.datarobot.com/en/docs/agentic-ai/playground-tools/playground-eval-metrics.html#run-existing-compliance-tests); however, you can also define the LLM blueprints to run it for. By default, the blueprints selected on the comparison tab are listed here:

After the compliance tests run, you can compare them on the Blueprints comparison page. To delete a completed test run, or cancel an in-progress test run, click Delete test results.

## View the tracing table

Tracing the execution of LLM blueprints is a powerful tool for understanding how most parts of the GenAI stack work. The Tracing tab provides a log of all components and prompting activity used in generating LLM responses in the playground. Insights from tracing provide full context of everything the LLM evaluated, including prompts, vector database chunks, and past interactions within the context window. For example:

- DataRobot metadata: Reports the timestamp, Use Case, playground, vector database, and blueprint IDs, as well as creator name and base LLM. These help pinpoint the sources of trace records if you need to surface additional information from DataRobot objects interacting with the LLM blueprint.
- LLM parameters: Shows the parameters used when calling out to an LLM, which is useful for potentially debugging settings like temperature and the system prompts.
- Prompts and responses: Provide a history of chats; token count and user feedback provide additional detail.
- Latency: Highlights issues orchestrating the parts of the LLM Blueprint.
- Token usage: displays the breakdown of token usage to accurately calculate LLM cost.
- Evaluations and moderations (if configured): Illustrates how evaluation and moderation metrics are scoring prompts or responses.

To locate specific information in the Tracing table, click Filters and filter by User name, LLM, Vector database, LLM Blueprint name, Chat name, Evaluation dataset, and Evaluation status.

> [!TIP] Send tracing data to the Data Registry
> Click Upload to Data Registry to export data from the tracing table to the Data Registry. A warning appears on the tracing table when it includes results from running the toxicity test and the toxicity test results are excluded from the Data Registry upload.

## Send a metric and compliance test configuration to the workshop

After creating an LLM blueprint, setting the blueprint configuration (including evaluations metrics and moderations), and testing and tuning the responses, send the LLM blueprint to the workshop:

1. In a Use Case, from thePlaygroundtile, click the playground containing the LLM you want to register as a blueprint.
2. In the playground,compare LLMsto determine which LLM blueprint to send to the workshop, then, do either of the following:
3. In theSend to the workshopmodal, select up totwelveevaluation metrics (and any configured moderations). Why can't I send all metrics to the workshop?Several metrics are supported by default after you register and deploy an LLM sent to the workshop from the playground, others are configurable using custom metrics. The following table lists the evaluation metrics you cannot select during this process and provides the alternative metric in Console:MetricConsole equivalentCitationsCitations are provided on theData exploration > Tracingtab. If configured in the playground, citations are included in the transfer by default, without the need to select the option in theSend to the workshopmodal. The resulting custom model has theENABLE_CITATION_COLUMNSruntime parameter configured. After deploying that custom model, if theData explorationtab is enabled andassociation IDs are provided, citations are available for a model sent to the workshop.CostCost can be calculated on theMonitoring > Custom metricstab of a deployment.CorrectnessCorrectness is not available for deployed models.LatencyLatency is calculated on theMonitoring > Service healthtab andMonitoring > Custom metricstab.All TokensAll tokens can be calculated on theCustom metricstab, or you can add the prompt tokens and response tokens metrics separately.Document TokensDocument tokens are not available for deployed models.
4. Next, select anyCompliance teststo send. Then, clickSend to the workshop: Compliance tests sent to the workshop are included when you register the custom model andgenerate compliance documentation. Compliance tests in the workshopThe selected compliance test are linked to the custom model in the workshop by theLLM_TEST_SUITE_IDruntime parameter. If you modify the custom model code significantly in the workshop, set theLLM_TEST_SUITE_IDruntime parameter toNoneto avoid running compliance documentation intended for the original model on the modified model.
5. To complete the transfer of evaluation metrics,configure the custom model in the workshop.