Multiple LLM blueprint chat comparison¶
The playground's LLM blueprint tab allows you to:
- View all LLM blueprints in the playground.
- Filter, group, and sort the LLM blueprint list.
- View the playground's chat history.
- Create and compare chats (LLM responses).
To use the comparison:
- Create two or more LLM blueprints in the playground.
- From the LLM blueprints tab, select up to three LLM blueprints for comparison.
- Send a prompt from the central prompting window. Each of the blueprints receives the prompt and responds, allowing you to compare responses.
Note
You can only do comparison prompting with workflows that you created. To see the results of prompting another user’s LLM blueprint or agentic flow in a shared Use Case, copy the LLM blueprint or connect to the registered agentic flow. You can chat with the same settings applied. This is intentional behavior because prompting impacts chat history, which can impact the responses that are generated. However, you can provide response feedback on the creator's asset to assist development.
Example comparison¶
The following example compares three LLM blueprints, each with the same settings except using a different system prompt to influence the style of response. First test the system prompt, when configuring the LLM blueprint, for example: Describe the novel Game of Thrones.
-
Enter the system prompt
Answer using emojis. -
Enter the system prompt
Answer in the style of a news headline. -
Enter the system prompt
Answer as a haiku.
See also the note on system prompts.
Compare LLM blueprints¶
To compare multiple LLM blueprints:
-
From the LLM blueprint tab, check the box next to each blueprint you want to compare.
-
Send a prompt (
Describe DataRobot). Each LLM blueprint responds in their configured style: -
Try different prompts (
Describe a fish taco, for example) to identify the LLM that best suits the business use case.
Interpret results¶
One obvious way to compare LLM blueprints is to read the results and see if the responses of one seem more on point. Another helpful measure is to review the evaluation metrics that are returned. Consider:
- Which LLM blueprint has the lowest latency? Is that status consistent across prompt/response sets?
- Which metrics are excluded from some LLM blueprints and why?
- How do results change when you toggle context awareness?
- Do the LLM blueprints use the citations to inform the response effectively?
- Do the they respect the system prompt such that the response has the requested tone, format, succinctness, etc.?
Change selected LLM blueprints¶
You can add blueprints to the comparison at any time, although the maximum allowed for comparison at one time is three. To add an LLM blueprint, select the checkbox to the left of its name. If three are already selected, remove a current selection first.
The comparison panel retrieves the comparison history. Because responses have not been returned for the new LLM blueprint, DataRobot provides a button to initiate that action. Click Generate to include the new results.
Consider system prompts¶
Note that system prompts are not guaranteed to be followed completely, and that wording is very important. For example, consider the comparison using the prompt Answer using emojis (EmojiGPT) and Answer using only emojis (OnlyEmojiGPT):
Chats tab¶
A comparison chat groups together one or more comparison prompts, often across multiple blueprints. Use the Chats tab to access any previous comparison prompts made from the playground or start a new chat. In this way, you can select up to three LLM blueprints, query them, and then swap out for other LLM blueprints to send the same prompts and compare results.
Note
In some cases, you will see a chat named Default chat. This entry contains any chats made in the playground before the new playground functionality was released in April, 2024. If no chats were initiated, the default chat is empty. If the playground was created after that date, the default chat isn't present but an New chat is available for prompting.
Rename or delete chats from the entry name.








