Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

LLM custom metrics reference

DataRobot's custom metrics for LLM evaluation provide basic information about prompts and responses, assess LLM performance, and help your organization report on prompt injection and hateful, toxic, or inappropriate content. These metrics can also safeguard against hallucinations, low-confidence responses, and the sharing of personally identifiable information (PII).

Name Description Requires Output type
Performance
ROUGE-1 Measures the similarity between the response generated from an LLM blueprint and the documents retrieved from the vector database. Vector database 0 to 100%
Faithfulness Evaluates whether a language model's generated answer is factually faithful (not a hallucination). Vector database 1 (faithful) or 0 (not faithful)
Correctness Evaluates the correctness and relevance of a generated answer against a reference answer. Evaluation dataset of prompt and response pairs 1 (worst) to 5 (best)
Safety
Prompt Injection Classifier Detects input manipulations, such as overwriting or altering system prompts, that are intended to modify the model's output. Prompt Injection Classifier deployment 0 (not likely prompt injection) to 1 (likely prompt injection)
Sentiment Classifier
[sidecar metric]
Classifies text sentiment as positive or negative using a pre-trained sentiment classification model. Sentiment Classifier deployment Score from 0 (negative sentiment) to 1 (positive sentiment)
Sentiment Classifier
(NLTK)
Calculates the sentiment of text using the Natural Language Toolkit (NLTK) library. -1 (negative sentiment) to 1 (positive sentiment)
PII Detection Identifies and anonymizes Personally Identifiable Information (PII) in text using the Microsoft Presidio Library to preserve individual privacy. Presidio PII Detection deployment 0 (likely no PII) to 1 (likely includes PII)
Japanese PII Occurrence Count Calculates the total number of occurrences of Personally Identifiable Information (PII) in Japanese text using the Microsoft Presidio analyzer library. Number of PII occurrences
Toxicity
[sidecar metric]
Measures the toxicity of text using a pretrained hate speech classification model to safeguard against harmful content. Toxicity Classifier deployment 0 (not likely toxic) to 1 (likely toxic)
Readability
Dale-Chall Readability Measures the U.S. grade level required to understand a text based on the percentage of difficult words and average sentence length. Text must contain at least 100 words. 0 (easy) to 10 (difficult)
Flesch Reading Ease Measures the ease of readability of text based on the average sentence length and average number of syllables per word. Text must contain at least 100 words. 0 (difficult) to 100 (easy)
Operational
Token Count Measures the number of tokens associated with the input to the LLM, output from the LLM, and/or retrieved text from a vector database. Number of tokens
Cost Estimates the financial cost of using the LLM by calculating the number of tokens in the input, output, and retrieved text, and then applying token pricing. Token pricing information Cost in USD
Latency Measures the response latency of the LLM blueprint. Time in seconds
Completion Tokens Mean Calculates the mean number of tokens in completions for the time period requested. The metric tracks the number of tokens using the tiktoken library. (See Token Count.) Average token count
Prompt Tokens Mean Calculates the mean number of tokens in prompts for the time period requested. The metric tracks the number of tokens using the tiktoken library. (See Token Count.) Average token count
Tokens Mean Calculates the mean number of tokens in prompts and completions. The metric tracks the number of tokens using the tiktoken library. (See Token Count.) Average token count
Text
Completion Reading Time Estimates the average time it takes a person to read text generated by the LLM. Time in seconds
Character Count [Japanese] Calculates the total number of Japanese characters in user prompts sent to the LLM. In DataRobot, by default the metric only analyzes prompt text, but the custom metric code can be edited to analyze completions as well. Number of Japanese characters
Sentence Count Calculates the total number of sentences in user prompts and text generated by the LLM. Number of sentences
Syllable Count Calculates the total number of syllables in the words in user prompts and text generated by the LLM. Number of syllables
Word Count Calculates the total number of words in user prompts and text generated by the LLM. Number of words

Performance

Performance metrics evaluate the factual accuracy of LLM responses.

ROUGE-1

Recall-Oriented Understudy for Gisting Evaluation (ROUGE-1) measures the quality of text generated by an LLM by determining whether the generated response uses relevant information from the retrieved context in a vector database (VDB). Specifically, ROUGE-1 assesses the overlap of unigrams (single "words") between the reference text from the VDB and the generated text. This metric is calculated as follows:

\[ \text{ROUGE-1} = \frac{\sum_{S\in\{\text{RefSummaries}\}} \sum_{gram_1 \in S} Count_{match}(gram_1)}{\sum_{S\in\{\text{RefSummaries}\}} \sum_{gram_1 \in S} Count(gram_1)} \]

In this formula, the variables are as follows:

  • \(S\): The set of reference summaries.
  • \(gram_1\): The unigrams in the reference summaries.
  • \(Count\_{match}(gram_1)\): The maximum number of unigrams co-occurring in the candidate summary and reference summary.
  • \(Count(gram_1)\): The number of unigrams in the reference summary.

In other words, ROUGE-1 is calculated as follows:

\[ \text{ROUGE-1} =\frac{\text{Total number of matching unigrams between generated and reference text}}{\text{Total number of unigrams in the reference text}} \]

ROUGE-1 scores range from 0 to 1 (or 0 to 100%), with higher scores indicating more information overlap between the generated response and the retrieved documents. DataRobot implements the metric using the rouge-score library and returns the max of:

  • Precision: The fraction of unigrams in the generated text and the reference text.
  • Recall: The fraction of unigrams in the reference text that are also in the generated text.

Faithfulness

The Faithfulness metric evaluates if the answer generated by a language model is faithful to the source documents in the vector database or if the answer contains hallucinated information not supported by the sources.

The metric uses the LlamaIndex Faithfulness Evaluator, which takes as input:

  • The generated answer.
  • The source documents/passages the answer should be based on.

The evaluator uses a language model (e.g., GPT-4) to analyze if the answer can be supported by the provided sources. It outputs:

  • A binary "passing" score of 1 (Faithful) or 0 (Not faithful).
  • A text explanation of the reasoning behind the faithfulness assessment. Usage of Faithfulness in a playground counts towards user limits on LLM prompting (see GenAI considerations).

The use of Faithfulness as a deployment guardrail requires users to provide their own OpenAI credentials.

Correctness

The Correctness metric evaluates how well a generated answer matches a reference answer. It outputs a score between 1 and 5, where 1 is the worst and 5 is the best.

The evaluation process uses the LlamaIndex Correctness Evaluator to perform the following steps:

  1. Input: The evaluator takes three inputs: a user query, a reference answer, and a generated answer.
  2. Scoring: The evaluator uses a predefined scoring system to assign a score based on the relevance and correctness of the generated answer:

    • Score 1: The generated answer is not relevant to the user query.
    • Score 2-3: The generated answer is relevant but contains mistakes.
    • Score 4-5: The generated answer is relevant and correct.
  3. Output: The evaluator provides both a score and reasoning for the score. A score greater than or equal to a specified threshold (default is 4.0) is considered passing.

The evaluation is conducted through a chat interface, where the system and user prompts are defined to guide the evaluation process. The system prompt instructs the evaluator on how to judge the answers, while the user prompt provides the specific query, reference answer, and generated answer for evaluation.

In DataRobot, Correctness is only available in playgrounds as an aggregated metric against an evaluation dataset. Usage of Correctness also counts towards user limits on LLM prompting (see GenAI considerations).

Safety

Safety metrics are custom metrics used to monitor the security and privacy of LLM responses and, in some cases, moderate output.

Prompt Injection Classifier

The Prompt Injection score uses the deberta-v3-base-injection model to classify whether a given input contains a prompt injection attempt or not. (This model was fine-tuned on the prompt-injections dataset and achieves an accuracy of 99.14% on the evaluation set.) A Prompt Injection Score of 1 indicates the input likely contains a prompt injection attempt, while a score of 0 means the input appears to be a legitimate request. Using the score provides a layer of security to help prevent prompt injection attacks, but some prompt injection attempts may still bypass detection.

In DataRobot, Prompt Injection score calculation requires a deployed Prompt Injection Classifier, available as a global model in the Registry.

Sentiment Classifier

The Sentiment score uses the distilbert-base-uncased-finetuned-sst-2-english model to classify text as either positive or negative sentiment. (This model was fine-tuned on the Stanford Sentiment Treebank SST-2 dataset and achieves an accuracy of 91.3% on the SST-2 dev set.) The model outputs a probability score between 0 and 1, with lower scores indicating more negative sentiment and higher scores indicating more positive sentiment.

In DataRobot, Sentiment score calculation requires a deployed Sentiment Classifier, available as a global model in the Registry.

Sentiment Classifier (NLTK)

The NLTK Sentiment score uses the SentimentIntensityAnalyzer, a pre-trained model in the NLTK library, to determine the sentiment polarity (positive, negative, neutral) and intensity of a given text. The analyzer is based on the VADER (Valence Aware Dictionary and sEntiment Reasoner) lexicon. Sentiment scores range from -1 to 1:

  • -1 represents a completely negative sentiment.
  • 0 represents a neutral sentiment.
  • 1 represents a completely positive sentiment.

The SentimentIntensityAnalyzer computes sentiment scores as follows:

  1. The analyzer looks up each word of the input text in the VADER lexicon, an extensive list of words and their associated sentiment scores ranging from -1 (very negative) to +1 (very positive).
  2. The analyzer considers linguistic features like capitalization, punctuation, negation, and degree modifiers to adjust the sentiment intensity.
  3. The scores for each word are combined using a weighted average. In DataRobot, the Sentiment custom metric template only evaluates English prompts.

PII Detection

The PII Detection score uses the Microsoft Presidio library to detect and anonymize sensitive Personally Identifiable Information (PII) such as:

  • Names
  • Email addresses
  • Phone numbers
  • Credit card numbers
  • Social security numbers
  • Locations
  • Financial data

Presidio uses regular expressions, rule-based logic, checksums, and named entity recognition models to detect PII with relevant context.

By measuring the PII Detection score, you can assess how well a model preserves individual privacy by identifying information that needs to be protected before release or use in downstream applications, which is critical for complying with data protection laws and maintaining trust.

In DataRobot, PII Detection score calculation requires a deployed Presidio PII Detection model, available as a global model in the Registry.

The PII Detection model also supports:

  • Specifying the types of PII to detect in a column, entities, as a comma-separated string. If this column is not specified, all supported entities are detected. Review the PII entities supported by Presidio documentation.
  • Returning the detection result with an anonymized_text column that contains a version of the input with detected PII replaced with placeholders.

Japanese PII Occurrence Count

The Japanese PII Occurrence Count metric uses the Microsoft Presidio library to quantify the presence of sensitive personal information from prompts written in Japanese text. The metric is useful for assessing privacy risks and compliance with data protection laws.

This metric quantifies the presence of PII including, but not limited to:

  • Names
  • Addresses
  • Email addresses
  • Phone numbers
  • Passport numbers

To calculate the metric:

  1. Japanese text is passed to a Presidio analyzer that scans the text and detects instances of PII based on predefined entity recognition models for Japanese.
  2. For each PII entity type detected, the analyzer returns the number of occurrences and sums up the counts across all PII entity types to get the total PII occurrence count. In DataRobot, by default, the metric only analyzes prompt text, but the custom metric code can be edited to analyze completions as well.

Toxicity

The Toxicity score uses the martin-ha/toxic-comment-model to classify the toxicity of text content. The model is a fine-tuned version of the DistilBERT model trained on data from a Kaggle competition.

The model outputs a probability score between 0 and 1, with higher scores indicating more toxic content. Note that the model may perform poorly on text that mentions certain identity subgroups.

In DataRobot, Toxicity score calculation requires a deployed Toxicity Classifier model, available as a global model in the Registry.

Readability

Readability metrics generally measure how many years of education a reader needs to comprehend text, primarily based on word length or whether words are common in day-to-day use.

Dale-Chall Readability

The Dale-Chall Readability score is a readability metric that assesses the difficulty of English text by considering two main factors: the percentage of "difficult" words and the average sentence length. The formula uses approximately 3,000 words that are considered familiar to most 4th-grade American students. Any word not on this list is considered a "difficult word." The score is calculated using the following formula:

\[ \text{Raw Score} = 0.1579 \left( \frac{\text{Number of Difficult Words}}{\text{Total Number of Words}} \times 100 \right) + 0.0496 \left( \frac{\text{Total Number of Words}}{\text{Total Number of Sentences}} \right) \]

If the percentage of difficult words is greater than 5%, an adjustment factor of 3.6365 is added to the raw score. The Dale-Chall readability score maps to grade levels as follows:

Score U.S. Grade Level
4.9 or lower 4th grade or lower
5.0 - 5.9 5th-6th grade
6.0 - 6.9 7th-8th grade
7.0 - 7.9 9th-10th grade
8.0 - 8.9 11th-12th grade
9.0 - 9.9 13th-15th grade (college)
10.0 or higher 16th grade or higher (college graduate)

In DataRobot, the text analyzed must contain at least 100 words to calculate the results.

Flesch Reading Ease

The Flesch Reading Ease score is a readability metric that indicates how easy text is to understand. It uses a formula based on the average number of syllables per word and the average number of words per sentence. The score is calculated using the following formula:

\[ \text{Flesch Reading Ease} = 206.835 - 1.015 \left(\frac{\text{Total Words}}{\text{Total Sentences}}\right) - 84.6 \left(\frac{\text{Total Syllables}}{\text{Total Words}}\right) \]

Scores typically range from 0 to 100, with higher scores indicating easier readability. Scores can be interpreted as follows:

Score Interpretation U.S. Grade Level
90-100 Very Easy 5th grade
80-89 Easy 6th grade
70-79 Fairly Easy 7th grade
60-69 Standard 8th-9th grade
50-59 Fairly Difficult 10th-12th grade
30-49 Difficult College
0-29 Very Difficult College graduate

The score doesn't account for content complexity or technical jargon. In DataRobot, the text being analyzed must contain at least 100 words to calculate results.

Operational

Operational metrics are custom metrics measuring the LLM's system-related statistics.

Token Count

The Token Count metric tracks the number of tokens associated with text using the cl100k_base tokenization scheme provided by the tiktoken library. This tokenization splits text into tokens in a way that is consistent with OpenAI's GPT-3 and GPT-4 language models. Tokens are the basic units that language models process, representing words or parts of words. In general, shorter text will have a lower token count than longer text; however, the exact number of tokens will depend on the specific words and characters used due to special rules for handling punctuation, rare words, and multibyte characters. The Token Count metric helps with managing the text processed by language models for:

  • Cost estimation: API calls are often priced based on token usage.
  • Token limit management: Ensures inputs don't exceed model token limits.
  • Performance monitoring: Token count affects processing time and resource usage.
  • Output length control: Helps manage the length of generated text.

Different tokenization schemes can produce varying token counts for the same text, and there may be limitations when using the cl100k_base encoding with language models from other providers. In DataRobot, a different encoding can be specified as a runtime parameter.

Cost

The Cost metric estimates the expenses incurred when running language models. It considers the number of tokens in the input prompt to the model, the output generated by the model, and any text retrieved from a vector database. The metric uses the cl100k_base tokenization scheme from the tiktoken library to count tokens (see Token Count) and applies two pricing variables:

  • Prompt token price: The cost per token for input and retrieved text.
  • Completion token price: The cost-per-token for the LLM's output.

The metric is useful for managing expenses related to LLM usage by:

  • Estimating API usage costs before making calls to LLM services.
  • Budgeting and resource allocation.
  • Optimizing prompts and retrieval strategies to minimize costs.
  • Comparing different LLM configurations.

In DataRobot, the token prices for prompts and completions should be specified as runtime parameters. Note that token pricing varies between different LLM providers and models, and tiered pricing or volume discounts are not accounted for in the metric by default.

Text

Text metrics provide basic information about prompt or completion text, for example, the number of words or sentences in a response. These metrics are more interpretable than the input and output token counts.

Completion Reading Time

The Completion Reading Time metric estimates the time required for an average person to read the text generated by a language model. This metric is useful for evaluating the length and complexity of text outputs in terms of human readability and time investment.

The metric is calculated using the readtime library with the following formula:

\[ \text{Reading Time} = \left(\frac{\text{Word Count}}{\text{Words Per Minute}}\right) \times 60 + (\text{Image Count} \times \text{Seconds Per Image}) \]

In this formula, the variables are as follows:

  • \(\text{Word Count}\): The number of words in the text.
  • \(\text{Words Per Minute}\): Set to 265 for an average adult's reading speed.
  • \(\text{Image Count}\): The number of images in the content (if applicable).
  • \(\text{Seconds Per Image}\): The estimated time to process an image, starting at 12 seconds and decreasing one second with each image encountered, with a minimum of 3 seconds.

The limitations of this metric are as follows:

  • The metric assumes an average reading speed, which may not accurately represent all users.
  • The complexity of the content is not considered, only its length.
  • The metric does not consider formatting or structure, which can affect actual reading time.

Sentence Count

The Sentence Count metric returns the sum of the number of sentences from prompts and completions. This metric is useful for evaluating the output of language models, ensuring that the generated text meets length and structure requirements.

The metric is calculated using the NLTK library, which uses natural language processing techniques to identify sentence boundaries based on punctuation and other linguistic cues. There may be limitations with accurate sentence detection when used with very short or informal texts or with unconventional writing styles.

Syllable Count

The Syllable Count metric calculates the total number of syllables in the words written while interacting with a language model. This metric is useful for evaluating the linguistic complexity and readability of text.

The metric is calculated using the NLTK library, which involves the following steps:

  1. Tokenization: The text from prompts and completions is broken down into individual words using word_tokenize.
  2. Syllable Counting: For each word, the number of syllables is determined using cmudict (Carnegie Mellon University Pronouncing Dictionary). This dictionary provides phonetic transcriptions of words for syllable counting.
  3. Summation: The syllable counts for all words are summed. Note that there may be limitations with accurate syllable counts depending on the comprehensiveness of the cmudict dictionary.

Word Count

The Word Count metric calculates the total number of words written while interacting with a language model. This metric is useful for the length and complexity of text.

The metric is computed using the NLTK library by tokenizing the text into individual words with word_tokenize and then counting the tokens, excluding punctuation and other non-word characters.

There may be limitations with accurate word counts depending on how the tokenizer handles punctuation, such as splitting contractions.


Updated September 3, 2024