# Evaluate metrics

> Evaluate metrics - Configure evaluation metrics, add evaluation datasets, and review tracing for
> agentic workflows in a playground.

This Markdown file sits beside the HTML page at the same path (with a `.md` suffix). It summarizes the topic and lists links for tools and LLM context.

Companion generated at `2026-04-24T16:03:56.227953+00:00` (UTC).

## Primary page

- [Evaluate metrics](https://docs.datarobot.com/en/docs/agentic-ai/agentic-eval/agentic-evaluation-tools.html): Full documentation for this topic (HTML).

## Sections on this page

- [Configure evaluation metrics](https://docs.datarobot.com/en/docs/agentic-ai/agentic-eval/agentic-evaluation-tools.html#configure-evaluation-metrics): In-page section heading.
- [View agentic workflow metrics](https://docs.datarobot.com/en/docs/agentic-ai/agentic-eval/agentic-evaluation-tools.html#view-agentic-workflow-metrics): In-page section heading.
- [Configure playground metrics](https://docs.datarobot.com/en/docs/agentic-ai/agentic-eval/agentic-evaluation-tools.html#configure-playground-metrics): In-page section heading.
- [Copy metric configurations](https://docs.datarobot.com/en/docs/agentic-ai/agentic-eval/agentic-evaluation-tools.html#copy-metric-configurations): In-page section heading.
- [Add evaluation datasets](https://docs.datarobot.com/en/docs/agentic-ai/agentic-eval/agentic-evaluation-tools.html#add-evaluation-datasets): In-page section heading.
- [Add aggregated metrics](https://docs.datarobot.com/en/docs/agentic-ai/agentic-eval/agentic-evaluation-tools.html#add-aggregated-metrics): In-page section heading.

## Related documentation

- [Agentic AI](https://docs.datarobot.com/en/docs/agentic-ai/index.html): Linked from this page.
- [Evaluate](https://docs.datarobot.com/en/docs/agentic-ai/agentic-eval/index.html): Linked from this page.
- [Tracing table](https://docs.datarobot.com/en/docs/agentic-ai/agentic-eval/agentic-tracing.html): Linked from this page.
- [Workshop](https://docs.datarobot.com/en/docs/workbench/nxt-registry/nxt-model-workshop/nxt-configure-evaluation-moderation.html): Linked from this page.
- [OTel collector](https://docs.datarobot.com/en/docs/workbench/nxt-console/nxt-monitoring/nxt-otel-metrics.html): Linked from this page.
- [agent templates](https://docs.datarobot.com/en/docs/agentic-ai/agentic-develop/agentic-install.html): Linked from this page.

## Documentation content

# Evaluate metrics

The playground's agentic evaluation tools include evaluation metrics and datasets, aggregated metrics, compliance tests, and tracing. The agentic evaluation metric tools include:

| Agentic workflow evaluation tool | Description |
| --- | --- |
| Evaluation metrics | Report an array of performance, safety, and operational metrics for prompts and responses in the playground and define moderation criteria and actions for any configured metrics. |
| Evaluation datasets | Upload or generate the evaluation datasets used to evaluate an agentic workflow through evaluation dataset metrics and aggregated metrics. |
| Aggregated metrics | Combine evaluation metrics across many prompts and responses to evaluate an agentic workflow at a high level, as only so much can be learned from evaluating a single prompt or response. |
| Tracing table | Trace the execution of agentic workflows through a log of all components and prompting activity used in generating responses in the playground. |

## Configure evaluation metrics

With evaluation metrics, you can configure performance and operational metrics for agents. You can view these metrics in comparison chats and in chats with individual agents.

Playground metrics require reference information provided through an evaluation dataset and are useful for assessing if an agentic workflow is operating as expected. Because they require an evaluation dataset, they are only available in the playground. Agentic workflow metrics don't require reference data, so they are available in production and configured in the [Workshop](https://docs.datarobot.com/en/docs/workbench/nxt-registry/nxt-model-workshop/nxt-configure-evaluation-moderation.html).

| Playground metrics | Agentic workflow metrics |
| --- | --- |
| Are configured in a playground. | Are configured in Workshop. |
| Require reference data provided as an evaluation dataset. | Don't require reference data. |
| Can't be computed in production. | Can be computed in production. |
| Can only be applied to the top level agentic workflow. | Can be applied to the top-level agent and sub-agents and sub-tools of the workflow (if they are separate custom models). |

> [!NOTE] Agent moderation
> Agentic workflow-specific metrics don't support setting moderation criteria.

### View agentic workflow metrics

To enable agentic workflow metrics for a workflow, configure [evaluation and moderation in Workshop](https://docs.datarobot.com/en/docs/workbench/nxt-registry/nxt-model-workshop/nxt-configure-evaluation-moderation.html). Click the Evaluation tile to see, on the Agentic workflow metrics tab, the configured metrics that are enabled for the agentic workflow.

### Configure playground metrics

To enable playground metrics for your workflows, add one or more evaluation metrics to the agentic playground. In addition, you must provide reference data using [evaluation datasets](https://docs.datarobot.com/en/docs/agentic-ai/agentic-eval/agentic-evaluation-tools.html#add-evaluation-datasets).

1. To select and configure playground evaluation metrics for an agentic playground, do either of the following: Without connected agentsWith connected agentsIf you haven't connected any agents to the playground, in theEvaluate with metricstile, clickConfigure metricsto configure metrics before adding an agent:If you've added one or more agents to the playground, on the side navigation bar click theEvaluationtile:
2. On theEvaluation and moderationpage, click thePlayground metricstab, then clickConfigure metrics.
3. On theConfigure evaluation and moderationpage, click anOperational metric: Playground metricDescriptionAgent latencyTotal latency of running the agent workflow for a given request. Includes time for completions, tool calls, and metrics calculated by the moderations library. Always available when the agent is configured for OTel.Agent total tokensFor agents using the LLM gateway, total tokens are reported in OTel; current templates already do this. For agents using a deployed LLM, that LLM must have token count metrics enabled in themoderations configuration.Agent costAvailable only for calls to deployed LLMs. The deployed LLM must have acost metric configuredin the moderations configuration. Operational metrics and Agentic workflow comparisonAgent Cost, Agent Latency, and Agent Total Tokens use data from theOTel collector, which is configured by default in theagent templates. These metrics aggregate the reported OTel data and require tracing enabled to associate the correct spans with that data. Tracing is not available on the Agentic workflow comparison screen, so these operational evaluation metrics are only available when assessing per-agent responses, not when comparing agentic workflows. Then, optionally modify the metricNameand clickAdd. TheApply tosetting is preconfigured for these metrics.
4. On theConfigure evaluation and moderationpage, click aQuality metric: Playground metricDescriptionAgent Goal Accuracy with ReferenceUse a known benchmark to evaluate agentic workflow performance in achieving specified objectives. Requires an evaluation dataset containing an agent goal column.Tool Call AccuracyMeasure agentic workflow performance when identifying and calling the required tools for a given task. Requires an evaluation dataset containing an expected tool calls column. Evaluation dataset exampleThe example evaluation dataset below includes an expected tool calls column (toolCalls) required by theTool Call Accuracymetric and an agent goal column (agentGoal) required by theAgent Goal Accuracy with Referencemetric:example_evaluation_dataset.csvid,promptText,expectedResponse,toolCalls,agentGoal
1,What is the weather like in New York today?,It is 24 C and sunny in New York today.,"[{""name"":""weather_check"",""args"":{""location"":""New York""}},{""name"":""temperature_conversion"",""args"":{""temperature_fahrenheit"":75}}]",A concise answer to a question about weather.
2,How many planets are in the solar system?,Our solar system has 8 planets.,[],A concise answer to a question about the solar system.In addition, the DataRobot Python client providesutility classes for constructing the expected tool calls column. Then, configure the following settings, depending on the metric you selected: Playground metricDescriptionAgent Goal Accuracy with Reference(Optional) Enter a metricName.Select a playground or deployed LLM to evaluate goal accuracy.Tool Call Accuracy(Optional) Enter a metricName. After configuring the settings, clickAdd. TheApply tosetting is preconfigured for these metrics.
5. Select and configure another metric, or clickSave configuration. Edit configuration summaryAfter you add one or more metrics to the playground configuration, you can edit or delete those metrics.

### Copy metric configurations

To copy an evaluation metrics configuration to or from an agentic playground:

1. In the upper-right corner of theEvaluation and moderationpage, next toConfigure metrics, click, and then clickCopy configuration.
2. In theCopy evaluation and moderation configurationmodal, select one of the following options: From an existing playgroundTo an existing playgroundTo a new playgroundIf you selectFrom an existing playground, choose toAdd to existing configurationorReplace existing configurationand then select a playground toCopy from.If you selectTo an existing playground, choose toAdd to existing configurationorReplace existing configurationand then select a playground toCopy to.If you selectTo a new playground, enter aNew playground name.
3. Select if you want toInclude evaluation datasets, and then clickCopy configuration.

> [!NOTE] Duplicate evaluation metrics
> Selecting Add to existing configuration can result in duplicate metrics.

### Add evaluation datasets

To enable playground evaluation metrics and aggregated metrics, you must add one or more evaluation datasets to the playground to serve as reference data. The dataset must be a CSV file, in the Data Registry, and have at least one text or categorical column.

1. To add evaluation datasets in an agentic playground, do either of the following:
2. On theEvaluation and moderationpage, click theEvaluation datasetstab to view any existing datasets, or, clickAdd evaluation datasetfrom any tab, and select one of the following methods: MethodDescriptionSelect an existing datasetClick a dataset in theData Registrytable.Upload a new datasetClickUploadto register and select a new dataset from your local filesystem.ClickUpload from URL, then, enter theURLfor a hosted dataset and clickAdd. After you select a dataset, in theEvaluation dataset configurationright-hand sidebar, define the following columns: ColumnDescriptionPrompt column nameThe name of the reference dataset column containing the user prompt.Response (target) column nameThe name of the reference dataset column containing an expected agent response.Reference goals column nameThe name of the reference dataset column containing a description of the expected (goal) output of the agent. This data is used for theConfigure Agent Goal Accuracy with Referencemetric.Reference tools column nameThe name of the reference dataset column containing the expected agentic tool calls. This data is used for theConfigure Tool Call Accuracymetric. Then, clickAdd evaluation dataset. Evaluation dataset exampleThe example evaluation dataset below includes an expected tool calls column (toolCalls) required by theTool Call Accuracymetric and an agent goal column (agentGoal) required by theAgent Goal Accuracy with Referencemetric:example_evaluation_dataset.csvid,promptText,expectedResponse,toolCalls,agentGoal
1,What is the weather like in New York today?,It is 24 C and sunny in New York today.,"[{""name"":""weather_check"",""args"":{""location"":""New York""}},{""name"":""temperature_conversion"",""args"":{""temperature_fahrenheit"":75}}]",A concise answer to a question about weather.
2,How many planets are in the solar system?,Our solar system has 8 planets.,[],A concise answer to a question about the solar system.In addition, the DataRobot Python client providesutility classes for constructing the expected tool calls column.
3. After you add an evaluation dataset, it appears on theEvaluation datasetstab of theEvaluation and moderationpage, where you can:

## Add aggregated metrics

When a playground includes more than one metric, you can begin creating aggregated metrics. Aggregation is the act of combining metrics across many prompts and/or responses, which helps to evaluate agents at a high level (only so much can be learned from evaluating a single prompt/response). Aggregation provides a more comprehensive approach to evaluation.

Aggregation either averages the raw scores, counts the boolean values, or surfaces the number of categories in a multiclass model. DataRobot does this by generating the metrics for each individual prompt/response and then aggregating using one of the methods listed, based on the metric.

To configure aggregated metrics for an agentic playground:

1. In the agentic playground, clickConfigure aggregationbelow the prompt input (from theWorkflowstab, or in an individual agentChatstab): Workflows tabAgent Chats tabFrom theWorkflowstab, each agentic workflow selected is included in the aggregation job.From theChatstab for a single agent, only the current agentic workflow is included in the aggregation job. Aggregation job run limitOnly one aggregated metric job can run at a time. If an aggregation job is currently running, theConfigure aggregationbutton is disabled and the "Aggregation job in progress; try again when it completes" tooltip appears.
2. On theGenerate aggregated metricspanel, select metrics to include in aggregation and configure theAggregate bysettings. In the right-hand panel, enter a newChat name, select anEvaluation dataset(to generate prompts in the new chat), and select theWorkflowsfor which the metrics should be generated. These fields are pre-populated based on the current playground: Playground vs agentic workflow metricsIn the example below,Agent Goal Accuracy with ReferenceandTool Call Accuracyare playground metrics, whileROUGE-1andResponse Tokens are agentic workflow metrics (fromWorkshop). After you complete theMetrics selectionandConfigurationsections, clickGenerate metrics. This results in a chat, identified as aMetricchat, containing all associated prompts and responses: Aggregated metrics are run against an evaluation dataset, not individual prompts in a standard chat. Therefore, you can only view aggregated metrics in the generatedaggregated metrics chat, added to the agent'sAll Chatslist (on the agent's individualChatstab). Aggregation metric calculation for multiple agentsIf many agents are included in the metric aggregation request, aggregated metrics are computed sequentially, agent-by-agent.
3. Once an aggregated chat is generated, you can explore the resulting aggregated metrics, scores, and related assets on theAgentic aggregated metricstab andEvaluation aggregated metricstab. These tabs are available when comparing agentic chats, and when viewing a single-agent chat. You can filter byAggregation method,Evaluation dataset, andMetric: Agentic aggregated metricsEvaluation aggregated metrics