Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

LLM evaluation

Availability information

Evaluation metrics are off by default. Contact your DataRobot representative or administrator for information on enabling this feature.

Evaluation metrics help your organization report on prompt injection and hateful, toxic, or inappropriate prompts and responses. These metrics can also prevent hallucinations or low-confidence responses and safeguard against the sharing of personally identifiable information (PII). Many evaluation metrics connect a playground-built LLM to a deployed guard model. These guard models make predictions on LLM prompts and responses and then report the predictions and statistics to the playground. The LLM evaluation metric tools include:

Add and configure metrics

To use evaluation metrics, first, if you intend to use any of the Evaluation Deployment type metrics—Custom Deployment, PII Detection, Prompt Injection, Sentiment Classifier, and Toxicity—deploy the required guard models from the NextGen Registry to make predictions on an LLM's prompts or responses.

To select and configure evaluation metrics in an LLM playground, click Configure metrics below the prompt box or LLM evaluation in the side navigation bar:

If you haven't added any blueprints to the playground, in the Evaluate with metrics tile, click Configure metrics to configure metrics before adding a blueprint:

The LLM Evaluation page opens to the Metrics tab. Certain metrics are enabled by default; however, Citations and Rouge 1 require a vector database associated with the LLM blueprint to report a metric value. Click a disabled metric to enable it in the current playground. To configure a new metric for the playground, in the upper-right corner of the LLM Evaluation page, click Configure:

In the Configure evaluation panel, click evaluation metrics with + Add in the upper-right corner to add the metric; no configuration is required for these metrics. They may, however, require a vector database associated with the LLM blueprint to report metric values. Click metrics with Configure in the upper right corner to define the settings:

Evaluation metric Requires Description
Evaluation dataset metric Evaluation dataset/vector database Either provide or synthetically generate a set of prompts or prompt and response pairs to evaluate aggregated metrics against the provided reference dataset.
Cost LLM cost settings Calculate the cost of generating the LLM response using default or custom LLM, currency, input cost per token, and output cost per token values.
Token Count Token count settings Track the number of tokens associated with the input to the LLM, output from the LLM, and/or retrieved text from the vector database.
Latency N/A Report the response latency of the LLM blueprint.
Citations Vector database Report the documents retrieved by an LLM when prompting a vector database.
Faithfulness Vector database Measure if the LLM response matches the source to identify possible hallucinations.
Rouge 1 Vector database Calculate the similarity between the response generated from an LLM blueprint and the documents retrieved from the vector database.
Custom Deployment Custom deployment Use any deployment to evaluate and moderate your LLM (supported target types: regression, binary classification, multiclass, text generation).
PII Detection Presidio PII Detection deployment Detect Personally Identifiable Information (PII) in text using the Microsoft Presidio library.
Prompt Injection Prompt Injection Classifier deployment Detect input manipulations, such as overwriting or altering system prompts, intended to modify the model's output.
Sentiment Classifier Sentiment Classifier deployment Classify text sentiment as positive or negative.
Toxicity Toxicity Classifier deployment Classify content toxicity to apply moderation techniques, safeguarding against dissemination of harmful content.

The deployments required for PII detection, prompt injection detection, sentiment classification, and toxicity classification are available as global models in the registry. The following global models are available:

Model Type Target Description
Prompt Injection Classifier Binary injection Classifies text as prompt injection or legitimate. This model requires one column named text, containing the text to classify. For more information, see the deberta-v3-base-injection model details.
Toxicity Classifier Binary toxicity Classifies text as toxic or non-toxic. This model requires one column named text, containing the text to classify. For more information, see the toxic-comment-model details.
Sentiment Classifier Binary sentiment Classifies text sentiment as positive or negative. This model requires one column named text, containing the text to classify. For more information, see the distilbert-base-uncased-finetuned-sst-2-english model details.
Emotions Classifier Multiclass target Classifies text by emotion. This is a multilabel model, meaning that multiple emotions can be applied to the text. This model requires one column named text, containing the text to classify. For more information, see the roberta-base-go_emotions-onnx model details.
Refusal Score Regression target Outputs a maximum similarity score, comparing the input to a list of cases where an LLM has refused to answer a query because the prompt is outside the limits of what the model is configured to answer.
Presidio PII Detection Binary contains_pii Detects and replaces Personally Identifiable Information (PII) in text. This model requires one column named text, containing the text to be classified. The types of PII to detect can optionally be specified in a column, 'entities', as a comma-separated string. If this column is not specified, all supported entities will be detected. Entity types can be found in the PII entities supported by Presidio documentation.

In addition to the detection result, the model returns an anonymized_text column, containing an updated version of the input with detected PII replaced with placeholders.

For more information, see the Presidio: Data Protection and De-identification SDK documentation.
Zero-shot Classifier Binary target Performs zero-shot classification on text with user-specified labels. This model requires classified text in a column named text and class labels as a comma-seperated string in a column named labels. It expects the same set of labels for all rows; therefore, the labels provided in the first row are used. For more information, see the deberta-v3-large-zeroshot-v1 model details.
Python Dummy Binary Classification Binary target Always yields 0.75 for the positive class. For more information, see the python3_dummy_binary model template.

Depending on the evaluation metric (or evaluation metric type) selected, different configuration options are required. Click the tab below to learn more about the settings for each metric or metric type:

For Evaluation dataset metrics, under Add evaluation dataset to Use Case, do one of the following:

Dataset addition method Description
Upload evaluation dataset Click + Select dataset and, from the Data Registry panel, select a data set and click Select dataset, or click Upload to register and select a new dataset.
Generate synthetic data Enter a Dataset name, select an LLM and Vector database to use when creating synthetic data, and then click Generate. For more information, see Synthetic datasets.
Select from Use Case Select an Evaluation dataset from the current Use Case.

After you add a dataset, click the Correctness metric tile to apply that metric to LLM responses in the playground. Correctness calculates a response similarity score between the LLM blueprint and the supplied ground-truth evaluation dataset.

When you add an evaluation dataset, it appears on the LLM Evaluation > Evaluation datasets tab, where you can click the Actions menu to Edit evaluation dataset or Delete evaluation dataset:

For the Cost metric, in the row for each LLM type, define a Currency, Input cost in currency amount / tokens amount, and Output cost in currency amount / tokens amount:

For the Token Count metric, select the token types to count: Total token count, Citation token count, Input token count, and Output token count:

For the Evaluation Deployment type metrics—Custom Deployment, PII Detection, Prompt Injection, Sentiment Classifier, and Toxicity—configure the following settings:

Field Description
Name Enter a unique name if adding multiple instances of the evaluation metric.
Apply to Select one or both of Prompt and Response, depending on the evaluation metric.
Deployment name For evaluation metrics calculated by a guard model, select the custom model deployment. For a Custom Deployment, you must also configure the following:
Moderation criteria Define the criteria that determine when moderation logic is applied.

After you configure a new metric, it appears on the LLM Evaluation > Metrics tab, where you can enable it:

View metrics

The metrics you configure and add to the playground appear on the LLM responses in the playground. You can click the down arrow to open the metric panel for more details:

You can click Citation to view the prompt, response, and a list of citations in the Citation dialog box.

Aggregated metrics

When a playground includes more than one metric, you can begin creating aggregate metrics. Aggregation is the act of combining metrics across many prompts and/or responses, which helps to evaluate a blueprint at a high level (only so much can be learned from evaluating a single prompt/response). Aggregation provides a more comprehensive approach to evaluation.

Aggregation either averages the raw scores, counts the boolean values, or surfaces the number of categories in a multiclass model. DataRobot does this by generating the metrics for each individual prompt/response and then aggregating using one of the methods listed, based on the metric.

To configure aggregated metrics, click Configure aggregation below the prompt input:

Aggregation job run limit

Only one aggregated metric job can run at a time. If an aggregation job is currently running, the Configure aggregation button is disabled and the "Aggregation job in progress; try again when it completes" tooltip appears.

On the Generate aggregated metrics panel, select metrics to calculate in aggregate and configure the Aggregate by settings. Then, enter a new Chat name, select an Evaluation dataset (to generate prompts in the new chat), and select the LLM blueprints for which the metrics should be generated. These fields are pre-populated based on the current playground:

Evaluation dataset selection

If you select an evaluation dataset metric, like Correctness, you must use the evaluation dataset used to create that evaluation dataset.

After you complete the Metrics selection and Configuration sections, click Generate metrics. This results in a new chat containing all associated prompts and responses:

View aggregated metrics

Aggregated metrics are run against an evaluation dataset, not individual prompts in a standard chat. Therefore, you can only view aggregated metrics in the generated Aggregated chat, added to the LLM blueprint's All Chats list (on the configuration page).

Evaluation datasets

When you add an evaluation dataset, it appears on the LLM Evaluation > Evaluation datasets tab, where you can click the Actions menu to Edit evaluation dataset or Delete evaluation dataset:

Synthetic datasets

When you add evaluation dataset metrics, DataRobot can use a vector database to generate synthetic datasets, composed of prompt and response pairs, to evaluate your LLM blueprint against. Synthetic datasets are generated by accessing the selected vector database, clustering the vectors, pulling a representative chunk from each cluster, and prompting the selected LLM to generate 100 question and answer pairs based on the document(s):

When you configure the synthetic evaluation dataset settings and click Generate, two events occur sequentially:

  1. A placeholder dataset is registered to the Data Registry with the required columns (question and answer), containing 65 rows and 2 columns of placeholder data (for example, Record for synthetic prompt answer 0, Record for synthetic prompt answer 1, etc.).

  2. The selected LLM and vector database pair generates question and answer pairs, and is added to the Data Registry as a second version of the synthetic evaluation dataset created in step 1. The generation time depends on the selected LLM.

How is the number of question and answer pairs determined?

To generate high-quality and diverse questions, DataRobot runs cosine similarity-based clustering. Similar chunks are grouped into the same cluster and each cluster generates a single question and answer pair. Therefore, if a vector database includes many similar chunks, they'll be grouped into a much smaller number of clusters. When this happens, the number of pairs generated is much lower than the number of chunks in the vector database.


Tracing the execution of LLM blueprints is a powerful tool for understanding how most parts of the GenAI stack work. The tracing tab provides a log of all components and prompting activity used in generating LLM responses in the playground.

Insights from tracing provide full context of everything the LLM evaluated, including prompts, VDB chunks, and past interactions within the context window. For example:

  • DataRobot metadata: Reports the timestamp, Use Case, playground, VDB, and blueprint IDs, as well as creator name and base LLM. These help pinpoint the sources of trace records if you need to surface additional information from DataRobot objects interacting with the LLM blueprint.
  • LLM parameters: Shows the parameters used when calling out to an LLM, which is useful for potentially debugging settings like temperature and the system prompts.
  • Prompts and responses: Provide a history of chats; token count and user feedback provide additional detail.
  • Latency: Highlights issues orchestrating the parts of the LLM Blueprint.
  • Token usage: displays the breakdown of token usage to accurately calculate LLM cost.
  • Evaluations and moderations (if configured): Illustrates how evaluation and moderation metrics are scoring prompts or responses.

Updated June 4, 2024