Guardrail Metrics

Overview

Guardrail Metrics are essential for monitoring and evaluating the performance, safety, and efficiency of Large Language Models (LLMs) in the WhyLabs AI Control Center. They make it possible to define a set of boundaries that you expect your LLM to stay within, detect problematic prompts and responses based on a range of metrics, and take appropriate action in the case of a failure.

These metrics are divided into two main categories: Prompt Metrics and Response Metrics:

Prompt Metrics analyze the input given to the model
Response Metrics evaluate the model's output

Many of these metrics are associated with WhyLabs Secure Policy Rulesets (such as Customer Experience, Bad Actor, Cost, Misuse, or Truthfulness), while others stand alone without a particular ruleset assignment.

These metrics provide valuable insights into different aspects of the LLM's operation, ranging from potential security risks and PII detection to readability assessments and cost implications. Together, they offer a comprehensive framework for understanding and optimizing LLM interactions, whether the metrics are tied to specific rulesets or not.

Guardrail Metrics use LangKit, a language processing toolkit that provides a set of tools for analyzing and processing text data. Learn more about LangKit here. It's worth noting that the Guardrail Metrics are a subset of the ones included in the Secure Container Metric library, which can be found here.

Using Guardrail Metrics in the WhyLabs AI Control Center

By default, the WhyLabs AI Control Center Policy uses rulesets composed of multiple Guardrail Metrics that output a normalized metric score for each ruleset. Rulesets are intended to make policy configuration fast and simple, however you can build a custom policy composed of individual metrics and thresholds, validators, and callbacks. Custom policies can be managed via the Guardrails API or via the UI in the WhyLabs Secure Policy page.

Prompt Metrics

prompt.pii.credit_card

Range: 0 - infinite
Ruleset: Customer Experience
Normalized metric name: prompt.score.customer_experience.prompt.pii.credit_card
Description: Detects credit card numbers in the prompt using Microsoft Presidio.
Formula: NER w/ Spacy + Regex. See https://microsoft.github.io/presidio/analyzer/

prompt.pii.email_address

Range: 0 - infinite
Ruleset: Customer Experience
Normalized metric name: prompt.score.customer_experience.prompt.pii.email_address
Description: Detects email addresses in the prompt using Microsoft Presidio.
Formula: NER w/ Spacy + Regex. See https://microsoft.github.io/presidio/analyzer/

prompt.pii.phone_number

Range: 0 - infinite
Ruleset: Customer Experience
Normalized metric name: prompt.score.customer_experience.prompt.pii.phone_number
Description: Detects phone numbers in the prompt using Microsoft Presidio.
Formula: NER w/ Spacy + Regex. See https://microsoft.github.io/presidio/analyzer/

prompt.pii.redacted

Range: 30/70
Ruleset: Customer Experience
Normalized metric name: prompt.score.customer_experience.prompt.pii.redacted
Description: Indicates whether any PII was identified in the prompt. 30 if no PII was found, 70 otherwise
Formula: Redact/Hash/Replace on top of Analyzers above. See https://microsoft.github.io/presidio/anonymizer/

prompt.pii.us_ssn

Range: 0 - infinite
Ruleset: Customer Experience
Normalized metric name: prompt.score.customer_experience.prompt.pii.us_ssn
Description: Detects US SSN numbers in the prompt using Microsoft Presidio.
Formula: NER w/ Spacy + Regex. See https://microsoft.github.io/presidio/analyzer/

prompt.pii.us_bank_number

Range: 0 - infinite
Ruleset: Customer Experience
Normalized metric name: prompt.score.customer_experience.prompt.pii.us_bank_number
Description: Detects US bank numbers in the prompt using Microsoft Presidio.
Formula: NER w/ Spacy + Regex. See https://microsoft.github.io/presidio/analyzer/

prompt.regex.credit_card_number

Range: 0/1
Ruleset: None
Description: Detects credit card numbers in the prompt using a regular expression.
Formula: see pattern file

prompt.regex.email_address

Range: 0/1
Ruleset: None
Description: Detects email addresses in the prompt using a regular expression.
Formula: see pattern file

prompt.regex.mailing_address

Range: 0/1
Ruleset: None
Description: Detects a mailing address in the prompt using a regular expression.
Formula: see pattern file

prompt.regex.phone_number

Range: 0/1
Ruleset: None
Description: Detects phone numbers in the prompt using a regular expression.
Formula: see pattern file

prompt.regex.ssn

Range: 0/1
Ruleset: None
Description: Detects US SSN numbers in the prompt using a regular expression.
Formula: see pattern file

prompt.sentiment.sentiment_score

Range: -1.0 - 1.0
Ruleset: Customer Experience
Normalized metric name: prompt.score.customer_experience.prompt.sentiment.sentiment_score
Description: May indicate the user getting frustrated. Not considered for the overall customer_experience score to avoid blocking negative user prompts. Negative numbers indicate negative sentiment.
Formula: Sentiment analysis module from NLTK (see SentimentIntensityAnalyzer)

prompt.similarity.injection

Range: 0.0 - 1.0
Ruleset: Bad Actor
Normalized metric name: prompt.score.bad_actors.prompt.similarity.injection
Description: Detects prompt injection attacks by calculating cosine similarity to known injections stored in a vector DB. 0 is no injection, 1 is very likely injection.
Formula: Maximum value of cosine similarity scores computed between the embedding of the prompt and the known injections' embeddings.

prompt.similarity.jailbreak

Range: 0.0 - 1.0
Ruleset: None
Normalized metric name: prompt.score.bad_actors.prompt.similarity.jailbreak
Description: Detects jailbreak attacks, however, this metric will be deprecated in the future as the injection metric can detect both injections and jailbreaks. Formula: Maximum value of cosine similarity scores computed between the embedding of the prompt and the known jailbreaks' embeddings.

prompt.stats.char_count

Range: 0 - infinite
Ruleset: Cost
Normalized metric name: prompt.score.cost.prompt.stats.char_count
Description: Prompt character count may impact LLM usage quotas.
Formula: Returns the number of characters present in the given text (textstat function).
Other: Based on textstat

prompt.stats.difficult_words

Range: 0 - infinite
Ruleset: None
Description: This method returns the number of difficult words in the input text. "Difficult" words are those which do not belong to a list of 3000 words that fourth-grade American students can understand.
Formula: Returns the number of difficult words in the input text. Based on textstat
Other: Based on textstat

prompt.stats.flesch_kincaid_grade

Range: 0 - 18
Ruleset: None
Description: Calculates the Flesh Kincaid Grade Level of the prompt (more details about the approach on Wikipedia). This score was designed to indicate how difficult a reading passage is to understand.
Formula: 0.39 (total words / total sentences) + 11.8 (total syllables / total words) - 15.59
Other: Based on textstat

prompt.stats.flesch_reading_ease

Range: 1 - 100
Ruleset: None
Description: Calculates the Flesh Reading Ease score of the prompt (more details about the approach here)
Formula: 206.835 - 1.015 (total words / total sentences) - 84.6 (total syllables / total words)
Other: Based on textstat. Higher scores indicate material that is easier to read; lower numbers mark passages that are more difficult to read.

prompt.stats.letter_count

Range: 0 - infinite
Ruleset: None
Description: Prompt letter count.
Formula: Returns the number of letters (characters excluding punctuation) present in the given text (textstat function).
Other: Based on textstat

prompt.stats.lexicon_count

Range: 0 - infinite
Ruleset: None
Description: This method returns the number of unique words present in the input text.
Formula: Returns the number of unique words (textstat function)
Other: Based on textstat

prompt.stats.sentence_count

Range: 0 - infinite
Ruleset: None
Description: Number of sentences in the prompt.
Formula: Returns the number of sentences (textstat module)
Other: Based on textstat

prompt.stats.syllable_count

Range: 0 - infinite
Ruleset: None
Description: Number of syllables in the prompt.
Formula: Returns the number of syllables (textstat module)
Other: Based on textstat

prompt.stats.token_count

Range: 0 - infinite
Ruleset: Cost
Normalized metric name: prompt.score.cost.prompt.stats.token_count
Description: Token count in the prompt may impact LLM usage quotas.
Formula: Returns the number of tokens using tiktoken - a Byte-Pair Encoding tokenizer from OpenAI.

prompt.topics.*

Range: 0.0 - 1.0
Ruleset: Misuse
Normalized metric name: prompt.score.misuse.prompt.topics.*
Description: Detects undesirable topics in the prompt. Custom topics are supported (example policy here)
Formula:

Standard topics (legal, medical, financial): cosine similarity between the prompt and the topical references
Custom topics: zero-shot classification using MoritzLaurer's Zeroshot model.

Other: Uses MoritzLaurer's Zeroshot

Response Metrics

response.hallucination.hallucination_score

Range: 0.0 - 1.0
Ruleset: Truthfulness
Normalized metric name: response.score.truthfulness.response.hallucination.hallucination_score
Description: Expresses consistency of the LLM responses when prompted multiple times with the same question.
Formula:

Generates additional samples by prompting the LLM with the same question multiple times
Checks the consistency between target and samples with a combination of two methods: a) semantic-similarity b) asking the LLM if it's consistent
The final score is the average between the two methods

response.pii.credit_card

Range: 0/1
Ruleset: Misuse
Normalized metric name: response.score.misuse.response.pii.credit_card
Description: Detects credit card numbers in the response using Microsoft Presidio.

response.pii.email_address

Range: 0/1
Ruleset: Misuse
Normalized metric name: response.score.misuse.response.pii.email_address
Description: Detects email addresses in the response using Microsoft Presidio.

response.pii.phone_number

Range: 0/1
Ruleset: Misuse
Normalized metric name: response.score.misuse.response.pii.phone_number
Description: Detects phone numbers in the response using Microsoft Presidio.

response.pii.redacted

Range: 30/70
Ruleset: Misuse
Normalized metric name: response.score.misuse.response.pii.redacted
Description: Indicates whether any PII was identified in the response. 30 if no PII was found, 70 otherwise.

response.pii.us_ssn

Range: 0/1
Ruleset: Misuse
Normalized metric name: response.score.misuse.response.pii.us_ssn
Description: Detects US SSN numbers in the response using Microsoft Presidio.

response.pii.us_bank_number

Range: 0/1
Ruleset: Misuse
Normalized metric name: response.score.misuse.response.pii.us_bank_number
Description: Detects US bank numbers in the response using Microsoft Presidio.

response.regex.credit_card_number

Range: 0/1
Ruleset: None
Description: Detects credit card numbers in the response using a regular expression.

response.regex.email_address

Range: 0/1
Ruleset: None
Description: Detects email addresses in the response using a regular expression.

response.regex.mailing_address

Range: 0/1
Ruleset: None
Description: Detects a mailing address in the response using a regular expression.

response.regex.phone_number

Range: 0/1
Ruleset: None
Description: Detects phone numbers in the response using a regular expression.

response.regex.ssn

Range: 0/1
Ruleset: None
Description: Detects US SSN numbers in the response using a regular expression.

response.sentiment.sentiment_score

Range: -1.0 - 1.0
Ruleset: Customer Experience
Normalized metric name: response.score.customer_experience.response.sentiment.sentiment_score
Description: LLM responses with negative sentiment may impact user experience. Negative numbers indicate negative sentiment.
Formula: Sentiment analysis module from NLTK (see SentimentIntensityAnalyzer)

response.similarity.context

Range: 0.0 - 1.0
Ruleset: Truthfulness
Normalized metric name: response.score.truthfulness.response.similarity.context
Description: Measures similarity between the response and the RAG-provided context.
Formula: Maximum similarity score between the response embedding and the RAG context items. The embeddings are generated by all-MiniLM-L6-v2.

response.similarity.prompt

Range: 0.0 - 1.0
Ruleset: Truthfulness
Normalized metric name: response.score.truthfulness.similarity.prompt
Description: Measures relevance of the response to the prompt.
Formula: Cosine similarity score computed between the prompt and response embeddings generated by all-MiniLM-L6-v2

response.regex.refusal

Range: 0.0 - 1.0
Ruleset: Customer Experience
Normalized metric name: response.score.customer_experience.response.regex.refusal
Description: LLM refusing to answer the question is impacting the user experience.
Other: Cosine Similarity w/ all-MiniLM-L6-v2

response.stats.char_count

Range: 0 - infinite
Ruleset: Cost
Normalized metric name: response.score.cost.response.stats.char_count
Description: Response character count may impact LLM usage quotas.
Other: Based on textstat

response.stats.difficult_words

Range: 0 - infinite
Ruleset: None
Description: Counts the number of difficult words in the response. "Difficult" words are those which do not belong to a list of 3000 words that fourth-grade American students can understand.
Other: Based on textstat

response.stats.flesch_kincaid_grade

Range: 0 - 18
Ruleset: None
Description: This method returns the Flesch-Kincaid Grade of the response. This score is a readability test designed to indicate how difficult a reading passage is to understand.

response.stats.flesch_reading_ease

Range: 1 - 100
Ruleset: None
Description: This method returns the Flesch Reading Ease score of the response. The score is based on sentence length and word length. Higher scores indicate material that is easier to read; lower numbers mark passages that are more complex. More details about the approach here.
Other: Based on textstat

response.stats.letter_count

Range: 0 - infinite
Ruleset: None
Description: Letter count in the response.
Other: Based on textstat

response.stats.lexicon_count

Range: 0 - infinite
Ruleset: None
Description: This method returns the number of words present in the input text.
Other: Based on textstat

response.stats.sentence_count

Range: 0 - infinite
Ruleset: None
Description: This method returns the number of sentences present in the input text.
Other: Based on textstat

response.stats.syllable_count

Range: 0 - infinite
Ruleset: None
Description: This method returns the number of syllables present in the input text.
Other: Based on textstat

response.stats.token_count

Range: 0 - infinite
Ruleset: Cost
Normalized metric name: response.score.cost.response.stats.token_count
yarnDescription: May impact LLM usage quotas
Other: Uses tiktoken - a Byte-Pair Encoding tokenizer from OpenAI

response.topics.*

Range: 0.0 - 1.0
Rule set: None
Normalized metric name: response.score.misuse.response.topics.*
Description: Semantic similarity between the response and the given topic
Other: Uses MoritzLaurer's Zeroshot

response.toxicity.toxicity_score

Rule set: Customer Experience
Normalized metric name: response.score.customer_experience.response.toxicity.toxicity_score
Description: Indicates toxicity of LLM responses, which likely impact the user experience
Other: Uses toxic-comment-model

Overview​

Using Guardrail Metrics in the WhyLabs AI Control Center​

Prompt Metrics​

prompt.pii.credit_card​

prompt.pii.email_address​

prompt.pii.phone_number​

prompt.pii.redacted​

prompt.pii.us_ssn​

prompt.pii.us_bank_number​

prompt.regex.credit_card_number​

prompt.regex.email_address​

prompt.regex.mailing_address​

prompt.regex.phone_number​

prompt.regex.ssn​

prompt.sentiment.sentiment_score​

prompt.similarity.injection​

prompt.similarity.jailbreak​

prompt.stats.char_count​

prompt.stats.difficult_words​

prompt.stats.flesch_kincaid_grade​

prompt.stats.flesch_reading_ease​

prompt.stats.letter_count​

prompt.stats.lexicon_count​

prompt.stats.sentence_count​

prompt.stats.syllable_count​

prompt.stats.token_count​

prompt.topics.*​

Response Metrics​

response.hallucination.hallucination_score​

response.pii.credit_card​

response.pii.email_address​

response.pii.phone_number​

response.pii.redacted​

response.pii.us_ssn​

response.pii.us_bank_number​

response.regex.credit_card_number​

response.regex.email_address​

response.regex.mailing_address​

response.regex.phone_number​

response.regex.ssn​

response.sentiment.sentiment_score​

response.similarity.context​

response.similarity.prompt​

response.regex.refusal​

response.stats.char_count​

response.stats.difficult_words​

response.stats.flesch_kincaid_grade​

response.stats.flesch_reading_ease​

response.stats.letter_count​

response.stats.lexicon_count​

response.stats.sentence_count​

response.stats.syllable_count​

response.stats.token_count​

response.topics.*​

response.toxicity.toxicity_score​