Guardrail Metrics
Overview
Guardrail Metrics are essential for monitoring and evaluating the performance, safety, and efficiency of Large Language Models (LLMs) in the WhyLabs AI Control Center. They make it possible to define a set of boundaries that you expect your LLM to stay within, detect problematic prompts and responses based on a range of metrics, and take appropriate action in the case of a failure.
These metrics are divided into two main categories: Prompt Metrics and Response Metrics:
- Prompt Metrics analyze the input given to the model
- Response Metrics evaluate the model's output
Many of these metrics are associated with WhyLabs Secure Policy Rulesets (such as Customer Experience, Bad Actor, Cost, Misuse, or Truthfulness), while others stand alone without a particular ruleset assignment.
These metrics provide valuable insights into different aspects of the LLM's operation, ranging from potential security risks and PII detection to readability assessments and cost implications. Together, they offer a comprehensive framework for understanding and optimizing LLM interactions, whether the metrics are tied to specific rulesets or not.
Guardrail Metrics use LangKit, a language processing toolkit that provides a set of tools for analyzing and processing text data. Learn more about LangKit here. It's worth noting that the Guardrail Metrics are a subset of the ones included in the Secure Container Metric library, which can be found here.
Using Guardrail Metrics in the WhyLabs AI Control Center
By default, the WhyLabs AI Control Center Policy uses rulesets composed of multiple Guardrail Metrics that output a normalized metric score for each ruleset. Rulesets are intended to make policy configuration fast and simple, however you can build a custom policy composed of individual metrics and thresholds, validators, and callbacks. Custom policies can be managed via the Guardrails API or via the UI in the WhyLabs Secure Policy page.
Prompt Metrics
prompt.pii.credit_card
Range: 0 - infinite
Ruleset: Customer Experience
Normalized metric name: prompt.score.customer_experience.prompt.pii.credit_card
Description: Detects credit card numbers in the prompt using Microsoft Presidio.
Formula: NER w/ Spacy + Regex. See https://microsoft.github.io/presidio/analyzer/
prompt.pii.email_address
Range: 0 - infinite
Ruleset: Customer Experience
Normalized metric name: prompt.score.customer_experience.prompt.pii.email_address
Description: Detects email addresses in the prompt using Microsoft Presidio.
Formula: NER w/ Spacy + Regex. See https://microsoft.github.io/presidio/analyzer/
prompt.pii.phone_number
Range: 0 - infinite
Ruleset: Customer Experience
Normalized metric name: prompt.score.customer_experience.prompt.pii.phone_number
Description: Detects phone numbers in the prompt using Microsoft Presidio.
Formula: NER w/ Spacy + Regex. See https://microsoft.github.io/presidio/analyzer/
prompt.pii.redacted
Range: 30/70
Ruleset: Customer Experience
Normalized metric name: prompt.score.customer_experience.prompt.pii.redacted
Description: Indicates whether any PII was identified in the prompt. 30 if no PII was found, 70 otherwise
Formula: Redact/Hash/Replace on top of Analyzers above. See https://microsoft.github.io/presidio/anonymizer/
prompt.pii.us_ssn
Range: 0 - infinite
Ruleset: Customer Experience
Normalized metric name: prompt.score.customer_experience.prompt.pii.us_ssn
Description: Detects US SSN numbers in the prompt using Microsoft Presidio.
Formula: NER w/ Spacy + Regex. See https://microsoft.github.io/presidio/analyzer/
prompt.pii.us_bank_number
Range: 0 - infinite
Ruleset: Customer Experience
Normalized metric name: prompt.score.customer_experience.prompt.pii.us_bank_number
Description: Detects US bank numbers in the prompt using Microsoft Presidio.
Formula: NER w/ Spacy + Regex. See https://microsoft.github.io/presidio/analyzer/
prompt.regex.credit_card_number
Range: 0/1
Ruleset: None
Description: Detects credit card numbers in the prompt using a regular expression.
Formula: see pattern file
prompt.regex.email_address
Range: 0/1
Ruleset: None
Description: Detects email addresses in the prompt using a regular expression.
Formula: see pattern file
prompt.regex.mailing_address
Range: 0/1
Ruleset: None
Description: Detects a mailing address in the prompt using a regular expression.
Formula: see pattern file
prompt.regex.phone_number
Range: 0/1
Ruleset: None
Description: Detects phone numbers in the prompt using a regular expression.
Formula: see pattern file
prompt.regex.ssn
Range: 0/1
Ruleset: None
Description: Detects US SSN numbers in the prompt using a regular expression.
Formula: see pattern file
prompt.sentiment.sentiment_score
Range: -1.0 - 1.0
Ruleset: Customer Experience
Normalized metric name: prompt.score.customer_experience.prompt.sentiment.sentiment_score
Description: May indicate the user getting frustrated. Not considered for the overall customer_experience score to avoid blocking negative user prompts. Negative numbers indicate negative sentiment.
Formula: Sentiment analysis module from NLTK (see SentimentIntensityAnalyzer)
prompt.similarity.injection
Range: 0.0 - 1.0
Ruleset: Bad Actor
Normalized metric name: prompt.score.bad_actors.prompt.similarity.injection
Description: Detects prompt injection attacks by calculating cosine similarity to known injections stored in a vector DB. 0 is no injection, 1 is very likely injection.
Formula: Maximum value of cosine similarity scores computed between the embedding of the prompt and the known injections' embeddings.
prompt.similarity.jailbreak
Range: 0.0 - 1.0
Ruleset: None
Normalized metric name: prompt.score.bad_actors.prompt.similarity.jailbreak
Description: Detects jailbreak attacks, however, this metric will be deprecated in the future as the injection metric can detect both injections and jailbreaks.
Formula: Maximum value of cosine similarity scores computed between the embedding of the prompt and the known jailbreaks' embeddings.
prompt.stats.char_count
Range: 0 - infinite
Ruleset: Cost
Normalized metric name: prompt.score.cost.prompt.stats.char_count
Description: Prompt character count may impact LLM usage quotas.
Formula: Returns the number of characters present in the given text (textstat function).
Other: Based on textstat
prompt.stats.difficult_words
Range: 0 - infinite
Ruleset: None
Description: This method returns the number of difficult words in the input text. "Difficult" words are those which do not belong to a list of 3000 words that fourth-grade American students can understand.
Formula: Returns the number of difficult words in the input text. Based on textstat
Other: Based on textstat
prompt.stats.flesch_kincaid_grade
Range: 0 - 18
Ruleset: None
Description: Calculates the Flesh Kincaid Grade Level of the prompt (more details about the approach on Wikipedia). This score was designed to indicate how difficult a reading passage is to understand.
Formula: 0.39 (total words / total sentences) + 11.8 (total syllables / total words) - 15.59
Other: Based on textstat
prompt.stats.flesch_reading_ease
Range: 1 - 100
Ruleset: None
Description: Calculates the Flesh Reading Ease score of the prompt (more details about the approach here)
Formula: 206.835 - 1.015 (total words / total sentences) - 84.6 (total syllables / total words)
Other: Based on textstat. Higher scores indicate material that is easier to read; lower numbers mark passages that are more difficult to read.
prompt.stats.letter_count
Range: 0 - infinite
Ruleset: None
Description: Prompt letter count.
Formula: Returns the number of letters (characters excluding punctuation) present in the given text (textstat function).
Other: Based on textstat
prompt.stats.lexicon_count
Range: 0 - infinite
Ruleset: None
Description: This method returns the number of unique words present in the input text.
Formula: Returns the number of unique words (textstat function)
Other: Based on textstat
prompt.stats.sentence_count
Range: 0 - infinite
Ruleset: None
Description: Number of sentences in the prompt.
Formula: Returns the number of sentences (textstat module)
Other: Based on textstat
prompt.stats.syllable_count
Range: 0 - infinite
Ruleset: None
Description: Number of syllables in the prompt.
Formula: Returns the number of syllables (textstat module)
Other: Based on textstat
prompt.stats.token_count
Range: 0 - infinite
Ruleset: Cost
Normalized metric name: prompt.score.cost.prompt.stats.token_count
Description: Token count in the prompt may impact LLM usage quotas.
Formula: Returns the number of tokens using tiktoken - a Byte-Pair Encoding tokenizer from OpenAI.
prompt.topics.*
Range: 0.0 - 1.0
Ruleset: Misuse
Normalized metric name: prompt.score.misuse.prompt.topics.*
Description: Detects undesirable topics in the prompt. Custom topics are supported (example policy here)
Formula:
- Standard topics (legal, medical, financial): cosine similarity between the prompt and the topical references
- Custom topics: zero-shot classification using MoritzLaurer's Zeroshot model.
Other: Uses MoritzLaurer's Zeroshot
Response Metrics
response.hallucination.hallucination_score
Range: 0.0 - 1.0
Ruleset: Truthfulness
Normalized metric name: response.score.truthfulness.response.hallucination.hallucination_score
Description: Expresses consistency of the LLM responses when prompted multiple times with the same question.
Formula:
- Generates additional samples by prompting the LLM with the same question multiple times
- Checks the consistency between target and samples with a combination of two methods: a) semantic-similarity b) asking the LLM if it's consistent
- The final score is the average between the two methods
response.pii.credit_card
Range: 0/1
Ruleset: Misuse
Normalized metric name: response.score.misuse.response.pii.credit_card
Description: Detects credit card numbers in the response using Microsoft Presidio.
response.pii.email_address
Range: 0/1
Ruleset: Misuse
Normalized metric name: response.score.misuse.response.pii.email_address
Description: Detects email addresses in the response using Microsoft Presidio.
response.pii.phone_number
Range: 0/1
Ruleset: Misuse
Normalized metric name: response.score.misuse.response.pii.phone_number
Description: Detects phone numbers in the response using Microsoft Presidio.
response.pii.redacted
Range: 30/70
Ruleset: Misuse
Normalized metric name: response.score.misuse.response.pii.redacted
Description: Indicates whether any PII was identified in the response. 30 if no PII was found, 70 otherwise.
response.pii.us_ssn
Range: 0/1
Ruleset: Misuse
Normalized metric name: response.score.misuse.response.pii.us_ssn
Description: Detects US SSN numbers in the response using Microsoft Presidio.
response.pii.us_bank_number
Range: 0/1
Ruleset: Misuse
Normalized metric name: response.score.misuse.response.pii.us_bank_number
Description: Detects US bank numbers in the response using Microsoft Presidio.
response.regex.credit_card_number
Range: 0/1
Ruleset: None
Description: Detects credit card numbers in the response using a regular expression.
response.regex.email_address
Range: 0/1
Ruleset: None
Description: Detects email addresses in the response using a regular expression.
response.regex.mailing_address
Range: 0/1
Ruleset: None
Description: Detects a mailing address in the response using a regular expression.
response.regex.phone_number
Range: 0/1
Ruleset: None
Description: Detects phone numbers in the response using a regular expression.
response.regex.ssn
Range: 0/1
Ruleset: None
Description: Detects US SSN numbers in the response using a regular expression.
response.sentiment.sentiment_score
Range: -1.0 - 1.0
Ruleset: Customer Experience
Normalized metric name: response.score.customer_experience.response.sentiment.sentiment_score
Description: LLM responses with negative sentiment may impact user experience. Negative numbers indicate negative sentiment.
Formula: Sentiment analysis module from NLTK (see SentimentIntensityAnalyzer)
response.similarity.context
Range: 0.0 - 1.0
Ruleset: Truthfulness
Normalized metric name: response.score.truthfulness.response.similarity.context
Description: Measures similarity between the response and the RAG-provided context.
Formula: Maximum similarity score between the response embedding and the RAG context items. The embeddings are generated by all-MiniLM-L6-v2.
response.similarity.prompt
Range: 0.0 - 1.0
Ruleset: Truthfulness
Normalized metric name: response.score.truthfulness.similarity.prompt
Description: Measures relevance of the response to the prompt.
Formula: Cosine similarity score computed between the prompt and response embeddings generated by all-MiniLM-L6-v2
response.regex.refusal
Range: 0.0 - 1.0
Ruleset: Customer Experience
Normalized metric name: response.score.customer_experience.response.regex.refusal
Description: LLM refusing to answer the question is impacting the user experience.
Other: Cosine Similarity w/ all-MiniLM-L6-v2
response.stats.char_count
Range: 0 - infinite
Ruleset: Cost
Normalized metric name: response.score.cost.response.stats.char_count
Description: Response character count may impact LLM usage quotas.
Other: Based on textstat
response.stats.difficult_words
Range: 0 - infinite
Ruleset: None
Description: Counts the number of difficult words in the response. "Difficult" words are those which do not belong to a list of 3000 words that fourth-grade American students can understand.
Other: Based on textstat
response.stats.flesch_kincaid_grade
Range: 0 - 18
Ruleset: None
Description: This method returns the Flesch-Kincaid Grade of the response. This score is a readability test designed to indicate how difficult a reading passage is to understand.
response.stats.flesch_reading_ease
Range: 1 - 100
Ruleset: None
Description: This method returns the Flesch Reading Ease score of the response. The score is based on sentence length and word length. Higher scores indicate material that is easier to read; lower numbers mark passages that are more complex. More details about the approach here.
Other: Based on textstat
response.stats.letter_count
Range: 0 - infinite
Ruleset: None
Description: Letter count in the response.
Other: Based on textstat
response.stats.lexicon_count
Range: 0 - infinite
Ruleset: None
Description: This method returns the number of words present in the input text.
Other: Based on textstat
response.stats.sentence_count
Range: 0 - infinite
Ruleset: None
Description: This method returns the number of sentences present in the input text.
Other: Based on textstat
response.stats.syllable_count
Range: 0 - infinite
Ruleset: None
Description: This method returns the number of syllables present in the input text.
Other: Based on textstat
response.stats.token_count
Range: 0 - infinite
Ruleset: Cost
Normalized metric name: response.score.cost.response.stats.token_count
yarnDescription: May impact LLM usage quotas
Other: Uses tiktoken - a Byte-Pair Encoding tokenizer from OpenAI
response.topics.*
Range: 0.0 - 1.0
Rule set: None
Normalized metric name: response.score.misuse.response.topics.*
Description: Semantic similarity between the response and the given topic
Other: Uses MoritzLaurer's Zeroshot
response.toxicity.toxicity_score
Rule set: Customer Experience
Normalized metric name: response.score.customer_experience.response.toxicity.toxicity_score
Description: Indicates toxicity of LLM responses, which likely impact the user experience
Other: Uses toxic-comment-model