Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Use LLM evaluation tools

Premium

LLM evaluation tools are a premium feature. Contact your DataRobot representative or administrator for information on enabling this feature.

The playground's LLM evaluation tools include evaluation metrics and datasets, aggregated metrics, compliance tests, and tracing. The LLM evaluation metric tools include:

LLM evaluation tool Description
Evaluation metrics Report an array of performance, safety, and operational metrics for prompts and responses in the playground and define moderation criteria and actions for any configured metrics.
Evaluation datasets Upload or generate the evaluation datasets used to evaluate an LLM blueprint through evaluation dataset metrics, aggregated metrics, and compliance tests.
Aggregated metrics Combine evaluation metrics across many prompts and responses to evaluate an LLM blueprint at a high level, as only so much can be learned from evaluating a single prompt or response.
Compliance tests Combine an evaluation metric and dataset to automate the detection of compliance issues with pre-configured or custom compliance testing.
Tracing table Trace the execution of LLM blueprints through a log of all components and prompting activity used in generating LLM responses in the playground.

Configure evaluation metrics

With evaluation metrics, you can configure an array of performance, safety, and operational metrics. Configuring these metrics lets you define moderation methods to intervene when prompts and responses meet the moderation criteria you set. This functionality can help detect and block prompt injection and hateful, toxic, or inappropriate prompts and responses. It can also help identify hallucinations or low-confidence responses and safeguard against the sharing of personally identifiable information (PII).

Evaluation deployment metrics

Many evaluation metrics connect a playground-built LLM to a deployed guard model. These guard models make predictions on LLM prompts and responses and then report the predictions and statistics to the playground. If you intend to use any of the Evaluation Deployment type metrics—Custom Deployment, PII Detection, Prompt Injection, Emotions Classifier, and Toxicity—deploy the required guard models from the NextGen Registry to make predictions on the LLM's prompts or responses.

To select and configure evaluation metrics in an LLM playground, do either of the following:

  • If you've added a blueprint to the playground, on the side navigation bar click Configure evaluation and moderation metrics:

  • If you haven't added any blueprints to the playground, in the Evaluate with metrics tile, click Configure evaluation to configure metrics before adding a blueprint:

    The LLM Evaluation and moderation page opens to the Metrics tab. Certain metrics are enabled by default; however, to report a metric value for Citations and Rouge 1, you must associate a vector database with the LLM blueprint.

Create a new configuration

To create a new evaluation metric configuration for the playground:

  1. In the upper-right corner of the LLM Evaluation and moderation page, click Configure metrics:

  2. In the Configure evaluation and moderation panel, click an evaluation metric and then configure the metric settings. The metrics, requirements, and settings are outlined in the tables below:

    Evaluation metric details

    For more detailed definitions of the evaluation metrics available in DataRobot, see the LLM custom metrics reference.

    Evaluation metric Requires Description
    Cost LLM cost settings Calculates the cost of generating the LLM response using a default or custom LLM, currency, input cost-per-token, and output cost-per-token values. The cost calculation also includes the cost of citations.
    Custom Deployment Custom deployment Uses an existing deployment to evaluate and moderate your LLM (supported target types: regression, binary classification, multiclass, text generation).
    Emotions Classifier Emotions Classifier deployment Classify prompt or response text by emotion.
    PII Detection Presidio PII Detection deployment Detects Personally Identifiable Information (PII) in text using the Microsoft Presidio library.
    Prompt Injection Prompt Injection Classifier deployment Detects input manipulations, such as overwriting or altering system prompts, intended to modify the model's output.
    Stay on topic for inputs NVIDIA NeMo guardrails configuration Uses NVIDIA NeMo Guardrails to provide topic boundaries, ensuring prompts are topic-relevant and do not use blocked terms.
    Stay on topic for output NVIDIA NeMo guardrails configuration Use NVIDIA NeMo Guardrails to provide topic boundaries, ensuring responses are topic-relevant and do not use blocked terms.
    Toxicity Toxicity Classifier deployment Classifies content toxicity to apply moderation techniques, safeguarding against dissemination of harmful content.
    ROUGE-1 Vector database Recall-Oriented Understudy for Gisting Evaluation calculates the similarity between the response generated from an LLM blueprint and the documents retrieved from the vector database.
    Citations Vector database Reports the documents retrieved by an LLM when prompting a vector database.
    All tokens N/A Tracks the number of tokens associated with the input to the LLM, output from the LLM, and/or retrieved text from the vector database.
    Prompt tokens N/A Tracks the number of tokens associated with the input to the LLM.
    Response tokens N/A Tracks the number of tokens associated with the output from the LLM.
    Document tokens N/A Tracks the number of tokens associated with the retrieved text from the vector database.
    Latency N/A Reports the response latency of the LLM blueprint.
    Correctness Playground LLM, evaluation dataset, vector database Uses either a provided or synthetically generated set of prompts or prompt and response pairs to evaluate aggregated metrics against the provided reference dataset. The Correctness metric uses the LlamaIndex Correctness Evaluator.
    Faithfulness Playground LLM, vector database Measures if the LLM response matches the source to identify possible hallucinations. The Faithfulness metric uses the LlamaIndex Faithfulness Evaluator.

    Multiclass custom deployment metric limits

    Multiclass custom deployment metrics can have:

    • Up to 10 classes defined in the Matches list for moderation criteria.

    • Up to 100 class names in the guard model.

    The deployments required for PII detection, prompt injection detection, emotion classification, and toxicity classification are available as global models in the registry. The following global models are available:

    Model Type Target Description
    Prompt Injection Classifier Binary injection Classifies text as prompt injection or legitimate. This model requires one column named text, containing the text to classify. For more information, see the deberta-v3-base-injection model details.
    Toxicity Classifier Binary toxicity Classifies text as toxic or non-toxic. This model requires one column named text, containing the text to classify. For more information, see the toxic-comment-model details.
    Sentiment Classifier Binary sentiment Classifies text sentiment as positive or negative. This model requires one column named text, containing the text to classify. For more information, see the distilbert-base-uncased-finetuned-sst-2-english model details.
    Emotions Classifier Multiclass target Classifies text by emotion. This is a multilabel model, meaning that multiple emotions can be applied to the text. This model requires one column named text, containing the text to classify. For more information, see the roberta-base-go_emotions-onnx model details.
    Refusal Score Regression target Outputs a maximum similarity score, comparing the input to a list of cases where an LLM has refused to answer a query because the prompt is outside the limits of what the model is configured to answer.
    Presidio PII Detection Binary contains_pii Detects and replaces Personally Identifiable Information (PII) in text. This model requires one column named text, containing the text to be classified. The types of PII to detect can optionally be specified in a column, 'entities', as a comma-separated string. If this column is not specified, all supported entities will be detected. Entity types can be found in the PII entities supported by Presidio documentation.

    In addition to the detection result, the model returns an anonymized_text column, containing an updated version of the input with detected PII replaced with placeholders.

    For more information, see the Presidio: Data Protection and De-identification SDK documentation.
    Zero-shot Classifier Binary target Performs zero-shot classification on text with user-specified labels. This model requires classified text in a column named text and class labels as a comma-seperated string in a column named labels. It expects the same set of labels for all rows; therefore, the labels provided in the first row are used. For more information, see the deberta-v3-large-zeroshot-v1 model details.
    Python Dummy Binary Classification Binary target Always yields 0.75 for the positive class. For more information, see the python3_dummy_binary model template.

    Depending on the evaluation metric (or evaluation metric type) selected, different configuration options are required:

    Setting Description
    General settings
    Name Enter a unique name if adding multiple instances of the evaluation metric.
    Apply to Select one or both of Prompt and Response, depending on the evaluation metric. Note that when you select Prompt, it's the user prompt, not the final LLM prompt, that is used for metric calculation.
    Custom Deployment, PII Detection, Prompt Injection, Emotions Classifier, and Toxicity settings
    Deployment name For evaluation metrics calculated by a guard model deployment, select the custom model deployment.
    Custom Deployment settings
    Input column name This name is defined by the custom model creator. For global models created by DataRobot, the default input column name is text. If the guard model for the custom deployment has the moderations.input_column_name key value defined, this field is populated automatically.
    Output column name This name is defined by the custom model creator, and needs to refer to the target column for the model. The target name is listed on the deployment's Overview tab (and often has _PREDICTION appended to it). You can confirm the column names by exporting and viewing the CSV data from the custom deployment. If the guard model for the custom deployment has the moderations.output_column_name key value defined, this field is populated automatically.
    Correctness and Faithfulness settings
    LLM Select a Playground LLM for evaluation.
    Stay on topic for input/output settings
    LLM Type Select Azure OpenAI or OpenAI, and then, set the following:
    • For the Azure OpenAI LLM type, enter an OpenAI API base URL, OpenAI Credentials, and OpenAI API Deployment.
    • For the OpenAI LLM type, select Credentials.
    Credentials are defined on the Credentials management page.
    Files For the Stay on topic evaluations, next to a file, click to modify the NeMo guardrails configuration files. In particular, update prompts.yml with allowed and blocked topics and blocked_terms.txt with the blocked terms, providing rules for NeMo guardrails to enforce. The blocked_terms.txt file is shared between the input and output stay on topic metrics; therefore, modifying blocked_terms.txt in the input metric modifies it for the output metric and vice versa. Only two NeMo stay on topic metrics can exist in a playground, one for input and one for output.
    Moderation settings
    Configure and apply moderation Enable this setting to expand the Moderation section and define the criteria that determines when moderation logic is applied.
    Cost metric settings

    For the Cost metric, in the row for each LLM type, define a Currency, Input cost in currency amount / tokens amount, and Output cost in currency amount / tokens amount, then click Add:

    The Cost metric doesn't include the Moderation section to Configure and apply moderation.

  3. In the Moderation section, with Configure and apply moderation enabled, for each evaluation metric, set the following:

    Setting Description
    Moderation criteria If applicable, set the threshold settings evaluated to trigger moderation logic. For the Emotions Classifier, select Matches or Does not match and define a list of classes (emotions) to trigger moderation logic.
    Moderation method Select Report or Report and block.
    Moderation message If you select Report and block, you can optionally modify the default message.
  4. After configuring the required fields, click Add to save the evaluation and return to the evaluation selection page.

    The metrics you selected appear on the Configure evaluation and moderation panel, in the Configuration summary sidebar.

  5. Select and configure another metric, or click Save configuration.

    The metrics appear on the LLM Evaluation and moderation page. If any issues occur during metric configuration, an error message appears below the metric to provide guidance on how to fix the issue.

Manage configured metrics

To edit or remove a configured evaluation metric from the playground:

  1. In the upper-right corner of the LLM Evaluation and moderation page, click Configure metrics:

  2. In the Configure evaluation and moderation panel, in the Configuration summary sidebar, click the edit icon or the delete icon :

  3. If you click edit , you can re-configure the settings for that metric and click Update:

Copy a metric configuration

To copy an evaluation metrics configuration to or from an LLM playground:

  1. In the upper-right corner of the LLM Evaluation and moderation page, next to Configure metrics, click , and then click Copy configuration.

  2. In the Copy evaluation and moderation configuration modal, select one of the following options:

    If you select From an existing playground, choose to Add to existing configuration or Replace existing configuration and then select a playground to Copy from.

    If you select To an existing playground, choose to Add to existing configuration or Replace existing configuration and then select a playground to Copy to.

    If you select To a new playground, enter a New playground name.

  3. Select if you want to Include evaluation datasets, and then click Copy configuration.

Duplicate evaluation metrics

Selecting Add to existing configuration can result in duplicate metrics, except in the case of NeMo Stay on topic for inputs and Stay on topic for output. Only two NeMo stay on topic metrics can exist, one for input and one for output.

View metrics in a chat

The metrics you configure and add to the playground appear on the LLM responses in the playground. Click the down arrow to open the metric panel for more details. From this panel, click Citation to view the prompt, response, and a list of citations in the Citation dialog box. You can also provide positive or negative feedback for the response.

In addition, if a response from the LLM is blocked by the configured moderation criteria and strategy, you can click Show response to view the blocked response:

Multiple moderation messages

If a response from the LLM is blocked by multiple configured moderations, the message for each triggered moderation appears, replacing the LLM response, in the chat. If you configure descriptive moderation messages, this can provide a complete list of reasons for blocking the LLM response.

Add evaluation datasets

To enable evaluation dataset metrics and aggregated metrics, add one or more evaluation datasets to the playground.

  1. To select and configure evaluation metrics in an LLM playground, do either of the following:

    • If you've added a blueprint to the playground, on the side navigation bar click Configure evaluation and moderation metrics:

    • If you haven't added any blueprints to the playground, in the Evaluate with metrics tile, click Configure evaluation to configure metrics before adding a blueprint:

      The LLM Evaluation and moderation page opens to the Metrics tab. Certain metrics are enabled by default; however, Citations and Rouge 1 require a vector database associated with the LLM blueprint to report a metric value.

  2. On the LLM Evaluation and moderation tab, click the Evaluation datasets tab to view any existing datasets, then, click Add evaluation dataset, and select one of the following methods:

    Dataset addition method Description
    Add evaluation dataset In the Add evaluation dataset panel, select an existing dataset from the Data Registry table, or upload a new dataset:
    • Click Upload to register and select a new dataset from your local filesystem.
    • Click (next to Upload), click Upload from URL, then, enter the URL for a hosted dataset.
    After you select a dataset, in the Evaluation dataset configuration sidebar, define the Prompt column name and Response (target) column name, and click Add evaluation dataset.
    Generate synthetic data Enter a Dataset name, select an LLM, set Vector database, Vector database version, and the Language to use when creating synthetic data. Then, click Generate data. For more information, see Generate synthetic datasets.
  3. After you add an evaluation dataset, it appears on the Evaluation datasets tab of the LLM Evaluation and moderation page, where you can click Open dataset to view the data. You can also click the Actions menu to Edit evaluation dataset or Delete evaluation dataset:

How are synthetic datasets generated?

When you add evaluation dataset metrics, DataRobot can use a vector database to generate synthetic datasets, composed of prompt and response pairs, to evaluate your LLM blueprint against. Synthetic datasets are generated by accessing the selected vector database, clustering the vectors, pulling a representative chunk from each cluster, and prompting the selected LLM to generate 100 question and answer pairs based on the document(s). When you configure the synthetic evaluation dataset settings and click Generate, two events occur sequentially:

  1. A placeholder dataset is registered to the Data Registry with the required columns (question and answer), containing 65 rows and 2 columns of placeholder data (for example, Record for synthetic prompt answer 0, Record for synthetic prompt answer 1, etc.).

  2. The selected LLM and vector database pair generates question and answer pairs, and is added to the Data Registry as a second version of the synthetic evaluation dataset created in step 1. The generation time depends on the selected LLM.

To generate high-quality and diverse questions, DataRobot runs cosine similarity-based clustering. Similar chunks are grouped into the same cluster and each cluster generates a single question and answer pair. Therefore, if a vector database includes many similar chunks, they'll be grouped into a much smaller number of clusters. When this happens, the number of pairs generated is much lower than the number of chunks in the vector database.

Add aggregated metrics

When a playground includes more than one metric, you can begin creating aggregate metrics. Aggregation is the act of combining metrics across many prompts and/or responses, which helps to evaluate a blueprint at a high level (only so much can be learned from evaluating a single prompt/response). Aggregation provides a more comprehensive approach to evaluation.

Aggregation either averages the raw scores, counts the boolean values, or surfaces the number of categories in a multiclass model. DataRobot does this by generating the metrics for each individual prompt/response and then aggregating using one of the methods listed, based on the metric.

To configure aggregated metrics:

  1. In a playground, click Configure aggregation below the prompt input:

    Aggregation job run limit

    Only one aggregated metric job can run at a time. If an aggregation job is currently running, the Configure aggregation button is disabled and the "Aggregation job in progress; try again when it completes" tooltip appears.

  2. On the Generate aggregated metrics panel, select metrics to calculate in aggregate and configure the Aggregate by settings. Then, enter a new Chat name, select an Evaluation dataset (to generate prompts in the new chat), and select the LLM blueprints for which the metrics should be generated. These fields are pre-populated based on the current playground:

    Evaluation dataset selection

    If you select an evaluation dataset metric, like Correctness, you must use the evaluation dataset used to create that evaluation dataset.

    After you complete the Metrics selection and Configuration sections, click Generate metrics. This results in a new chat containing all associated prompts and responses:

    Aggregated metrics are run against an evaluation dataset, not individual prompts in a standard chat. Therefore, you can only view aggregated metrics in the generated aggregated metrics chat, added to the LLM blueprint's All Chats list (on the LLM's configuration page).

    Aggregation metric calculation for multiple blueprints

    If many LLM blueprints are included in the metric aggregation request, aggregated metrics are computed sequentially, blueprint-by-blueprint.

  3. Once an aggregated chat is generated, you can explore the resulting aggregated metrics, scores, and related assets on the Aggregated metrics tab. You can filter by Aggregation method, Evaluation dataset, and Metric:

    In addition, click Current configuration to compare only those metrics sharing the configuration currently displayed in the LLM tab of the Configuration sidebar.

    View related assets

    For each metric in the table, you can click Evaluation dataset and Aggregated chat to view the corresponding asset contributing to the aggregated metric.

  4. Returning to the LLM Blueprints comparison page, you can now open the Aggregated metrics tab to view a leaderboard comparing LLM blueprint performance for the generated aggregated metrics:

Configure compliance testing

Combine an evaluation metric and an evaluation dataset to automate the detection of compliance issues through test prompt scenarios.

Manage compliance testing from the Evaluation tab

When you manage compliance testing on the Evaluation tab, you can view pre-defined compliance tests, create and manage custom tests, or modify pre-defined tests to suit your organization's testing requirements.

To view all available compliance tests:

  1. On the side navigation bar click Configure evaluation and moderation metrics.

  2. Click the Compliance tests tab. On the Compliance tests tab, you can view all the compliance tests available, both DataRobot and custom (if present). The table contains columns for the Test name, Provider, and Configuration (number of evaluations and evaluation datasets).

  3. In the last column of the table, use the available actions to either view and, optionally, customize DataRobot compliance tests or manage custom compliance tests.

View and customize DataRobot compliance tests

To view and, optionally, customize the pre-configured DataRobot compliance tests available:

  1. In the table on the Compliance tests tab, review the tests with DataRobot as the Provider, click View to open and review any of the following compliance tests:

    Compliance test Description
    Bias Benchmark Runs LLM question/answer sets that test for bias along eight social dimensions.
    Jailbreak Applies testing scenarios to evaluate whether built-in safeguards enforce LLM jailbreaking compliance standards.
    Completeness Determines whether the LLM response is supplying enough information to comprehensively answer questions.
    Personally Identifiable Information (PII) Determines whether the LLM response contains PII included in the prompt.
    Toxicity Applies testing scenarios to evaluate whether built-in safeguards enforce toxicity compliance standards. For more information, see the explicit and offensive content warning.
    Japanese Bias Benchmark Runs LLM question/answer sets in Japanese that test for bias along five social dimensions.

    Explicit and offensive content warning

    The public evaluation dataset for toxicity testing contains explicit and offensive content. It is intended to be used exclusively for the purpose of eliminating such content from customer models and applications. Any other use is strictly prohibited.

  2. When you view a compliance test from the list, you can review the individual evaluations run as part of the compliance testing process. For each test, you can review the Name, Metric, Evaluation dataset, Pass threshold, and Number of prompts. Click Customize test to access and modify these settings.

  3. To use the selected DataRobot test as the foundation for a custom test, on the Create custom test tab, modify the following settings:

    Setting Description
    Name A descriptive name for the custom compliance test.
    Description A description of the purpose of the compliance test (this is pre-populated when you modify an existing DataRobot test).
    Test pass threshold The minimum percentage (0-100%) of individual evaluations that must pass for the test as a whole to pass.
    Evaluations The individual evaluations for the compliance test, each consisting of a Name, Metric, Evaluation dataset, Pass threshold, and Number of prompts. In addition to the default metrics and evaluation datasets, you can select any evaluation metrics implemented by a deployed binary classification sidecar model and any evaluation datasets added to the Use Case.
    • Click + Add evaluation to create additional evaluations.
    • Click Copy from existing test to copy the individual evaluations from an existing compliance test.
    There is an API-only process to validate a sidecar model with expected_response_column to introduce metrics comparing the LLM response with an expected response, similar to the pre-provided exact_match metric.
  4. After you customize the compliance test settings, click Add. The new test appears in the table on the Compliance tests tab.

Create custom compliance tests

To create a custom compliance test:

  1. At the top or bottom of the Compliance tests tab, click Create custom compliance test.

    Create compliance tests from anywhere in the Evaluations tab

    When the Evaluation tab is open, you can click Create custom compliance test from anywhere, not just the Compliance tests tab.

  2. In the Create custom test panel, configure the following settings:

    Setting Description
    Name A descriptive name for the custom compliance test.
    Description A description of the purpose of the compliance test (this is pre-populated when you modify an existing DataRobot test).
    Test pass threshold The minimum percentage (0-100%) of individual evaluations that must pass for the test as a whole to pass.
    Evaluations The individual evaluations for the compliance test, each consisting of a Name, Metric, Evaluation dataset, Pass threshold, and Number of prompts. In addition to the default metrics and evaluation datasets, you can select any evaluation metrics implemented by a deployed binary classification sidecar model and any evaluation datasets added to the Use Case.
    • Click + Add evaluation to create additional evaluations.
    • Click Copy from existing test to copy the individual evaluations from an existing compliance test.
    There is an API-only process to validate a sidecar model with expected_response_column to introduce metrics comparing the LLM response with and expected response, similar to the pre-provided exact_match metric.
  3. After you configure the compliance test settings, click Add. The new test appears in the table on the Compliance tests tab.

Manage custom compliance tests

To manage custom compliance tests, locate tests with Custom as the Provider, and choose a management action:

  • Click the edit icon , then, in the Edit custom test panel, update the compliance test configuration and click Save.

  • Click the delete icon , then click Yes, delete test to remove the test from all playgrounds in the Use Case.

Run compliance testing from the Playground tab

When you perform compliance testing on the Playground tab, you can run the pre-defined compliance tests without modification, create custom tests, or modify the pre-defined tests to suit your organization's testing requirements.

To access compliance from the Playground tests to run, modify, or create a test:

  1. On the Playground tab, in the LLM blueprints list, click the LLM blueprint you want to test, or, select up to three blueprints for comparison.

    Access compliance tests from the blueprints comparison page

    If you have two or more LLM blueprints selected, you can click the Compliance tests tab from the Blueprints comparison page to run compliance tests for multiple LLM blueprints and compare the results. For more information, see Compare compliance test results

  2. In the LLM blueprint, click the Compliance tests tab, and, to create or run tests:

    • If you haven't run a test before, in the center of the tab, under No compliance test results available, click Run compliance test.

    • If you have run a test before, in the right corner of the tab, click Run test.

  3. The Run test panel opens to a list of pre-configured DataRobot compliance tests and custtom tests you've created.

  4. When you select a compliance test from the All tests list, you can view the individual evaluations run as part of the compliance testing process. For each test, you can review the Name, Metric, Evaluation dataset, Pass threshold, and Number of prompts.

  5. Next, run an existing test, create and run a custom test, or manage custom tests.

Run existing compliance tests

To run an existing, configured compliance test:

  1. On the Run test panel, from the All tests list, select an available DataRobot or Custom test.

  2. After selecting a test, click Run.

  3. The test appears on the Compliance tests tab with a Running... status.

    Cancel a running test

    If you need to cancel a test with the Running... status, click Delete test results.

Create and run custom compliance tests

To create and run a custom or modified compliance test:

  1. On the Run test panel, from the All tests list:

    • To modify a pre-configured DataRobot test—which results in creating a custom test—select a test from the list and click Customize test.

    • To create a new Custom test, click Create custom test.

  2. On the Custom test panel, configure the following settings:

    Setting Description
    Name A descriptive name for the custom compliance test.
    Description A description of the purpose of the compliance test (this is pre-populated when you modify an existing DataRobot test).
    Test pass threshold The minimum percentage (0-100%) of individual evaluations that must pass for the test as a whole to pass.
    Evaluations The individual evaluations for the compliance test, each consisting of a Name, Metric, Evaluation dataset, Pass threshold, and Number of prompts. In addition to the default metrics and evaluation datasets, you can select any evaluation metrics implemented by a deployed binary classification sidecar model and any evaluation datasets added to the Use Case.
    • Click + Add evaluation to create additional evaluations.
    • Click Copy from existing test to copy the individual evaluations from an existing compliance test.
    There is an API-only process to validate a sidecar model with expected_response_column to introduce metrics comparing the LLM response with and expected response, similar to the pre-provided exact_match metric.
  3. After configuring a custom test, click Save and run.

  4. The test appears on the Compliance tests tab with a Running... status.

    Cancel a running test

    If you need to cancel a test with the Running... status, click Delete test results.

Manage compliance test runs

From a running or completed test on the Compliance tests tab:

  • To delete a completed test run or cancel and delete a running test, click Delete test results.
  • To view the chat calculating the metric, click the chat name in the Corresponding chat column.
  • To view the evaluation dataset used to calculate the metric, click the dataset name in the Evaluation dataset column.

Manage custom compliance tests

To manage custom compliance tests, on the Run test panel, from the All tests list, select a custom test, then click Delete test or Edit test. You can't edit or delete pre-configured DataRobot tests.

If you select Edit test, update the settings you configured during compliance test creation.

Compare compliance test results

To compare compliance test results, you can run compliance tests for up to three LLM blueprints at a time. On the Playground tab, in the LLM blueprints list, select up to three LLM blueprint to test, click the Compliance tests tab, and then click Run test.

This opens the Run test panel, where you can select and run a test as you would for a single blueprint; however, you can also define the LLM blueprints to run it for. By default, the blueprints selected on the comparison tab are listed here:

After the compliance tests run, you can compare them on the Blueprints comparison page. To delete a completed test run, or cancel an in-progress test run, click Delete test results.

View the tracing table

Tracing the execution of LLM blueprints is a powerful tool for understanding how most parts of the GenAI stack work. The Tracing tab provides a log of all components and prompting activity used in generating LLM responses in the playground. Insights from tracing provide full context of everything the LLM evaluated, including prompts, vector database chunks, and past interactions within the context window. For example:

  • DataRobot metadata: Reports the timestamp, Use Case, playground, vector database, and blueprint IDs, as well as creator name and base LLM. These help pinpoint the sources of trace records if you need to surface additional information from DataRobot objects interacting with the LLM blueprint.
  • LLM parameters: Shows the parameters used when calling out to an LLM, which is useful for potentially debugging settings like temperature and the system prompts.
  • Prompts and responses: Provide a history of chats; token count and user feedback provide additional detail.
  • Latency: Highlights issues orchestrating the parts of the LLM Blueprint.
  • Token usage: displays the breakdown of token usage to accurately calculate LLM cost.
  • Evaluations and moderations (if configured): Illustrates how evaluation and moderation metrics are scoring prompts or responses.

To locate specific information in the Tracing table, click Filters and filter by User name, LLM, Vector database, LLM Blueprint name, Chat name, Evaluation dataset, and Evaluation status.

Send tracing data to the Data Registry

Click Upload to Data Registry to export data from the tracing table to the Data Registry. A warning appears on the tracing table when it includes results from running the toxicity test and the toxicity test results are excluded from the Data Registry upload.

Send a metric and compliance test configuration to the model workshop

After creating an LLM blueprint, setting the blueprint configuration (including evaluations metrics and moderations), and testing and tuning the responses, send the LLM blueprint to the model workshop:

  1. In a Use Case, from the Playground tab, click the playground containing the LLM you want to register as a blueprint.

  2. In the playground, compare LLMs to determine which LLM blueprint to send to the model workshop, then, do either of the following:

    • In the Comparison panel, on the LLM blueprints tab, click the Actions menu , and then click Send to model workshop.

    • In the chat comparison window, on the blueprint's header, click LLM blueprint actions , and then click Send to model workshop.

  3. In the Send to model workshop modal, select up to twelve evaluation metrics (and any configured moderations).

    If configured in the playground, Citations are included in the transfer by default, without the need to select them here. Citations are enabled for a model in the workshop by setting the ENABLE_CITATIONS_CONTENT_COLUMN, ENABLE_CITATIONS_SOURCE_COLUMN, ENABLE_CITATIONS_PAGE_COLUMN runtime parameters to true.

    The following evaluation metrics aren't supported in the model workshop and cannot be sent during this process: Cost, Correctness, Latency, All Tokens, and Document Tokens.

  4. Next, select any Compliance tests to send. Then, click Send to model workshop:

    Compliance tests sent to the model workshop are included when you register the custom model and generate compliance documentation.

    Compliance tests in the model workshop

    The selected compliance test are linked to the custom model in the model workshop by the LLM_TEST_SUITE_ID runtime parameter. If you modify the custom model code significantly in the model workshop, set the LLM_TEST_SUITE_ID runtime parameter to None to avoid running compliance documentation intended for the original model on the modified model.

  5. To complete the transfer of evaluation metrics, configure the custom model in the model workshop.


Updated January 8, 2025