Benchmarks: WhyLabs Secure

Benchmarks

This benchmark shows the result of the WhyLabs Secure solution on various datasets.

Last Updated: May 30, 2024

Accuracy Benchmark Results

Metric	Accuracy	F1	Precision	Recall	Dataset
Injections	0.87	-	-	0.87	tensor_trust (positive-only)
Injections	0.95	-	-	0.95	JailBreakV-28k (positive-only)
Injections	0.73	-	-	-	PurpleLlama - FRR (negative-only)
Refusals	0.95	0.82	0.90	0.75	chatgpt refusals
Sentiment	0.7	0.74	0.65	0.86	imdb_sentiment
Toxicity (default model)	0.77	0.76	0.78	0.74	hsol
Toxicity (detoxify)	0.82	0.82	0.83	0.82	hsol

Datasets Information

tensor_trust

Size: 361 samples
Source: https://github.com/HumanCompatibleAI/tensor-trust-data (positive samples)

JailBreakV-28k

Size: 1232 samples
Source: JailbreakV-28K/JailBreakV-28k · Datasets at Hugging Face (positive samples)

PurpleLlama - FRR

Size: 750 samples
Source: PurpleLlama/CybersecurityBenchmarks at main · meta-llama/PurpleLlama
- Out-of-distribution: No part of this dataset was used for training purposes.

chatgpt_refusals

Size: 2346 samples (346 positives, 2000 negatives)
Source:
- Positive samples: https://github.com/maxwellreuter/chatgpt-refusals
- Negative samples: https://huggingface.co/datasets/alespalla/chatbot_instruction_prompts (train split)

imdb_sentiment

Size: 5000 samples (2506 positive sentiment, 2494 negative sentiment)
Source: https://huggingface.co/datasets/imdb

hsol

Size: 5000 samples (2500 positives, 2500 negatives)
Source: https://paperswithcode.com/dataset/hate-speech-and-offensive-language (train split)

Latency Benchmark Results

Latency benchmark

Notes:

The latency is measured in seconds and milliseconds on an AWS c5.xlarge instance
The P90, P95, and P99 columns represent the 90th, 95th, and 99th percentiles of the latency distribution.
The latency is measured for a single request.
The latency may vary depending on the load of the system and the network conditions.

This table shows the average latency of different metrics.

Certain metrics (e.g. Regexes) have a latency of 0 in milliseconds because they are not computationally intensive.

Metric	Average (milliseconds)	P90	P95	P99
prompt.toxicity.toxicity_score	102.158	0.136	0.143	0.15601
prompt.similarity.jailbreak	64.494	0.125	0.144	0.161
response.toxicity.toxicity_score	53.009	0.069	0.07	0.07601
prompt.topics.legal\|prompt.topics.fishing\|promp...	38.675	0.047	0.048	0.054
prompt.topics.misuse1\|prompt.topics.misuse2\|promp...	38.675	0.047	0.048	0.054
prompt.similarity.injection	25.978	0.033	0.036	0.042
response.pii.phone_number\|response.pii.email_ad...	25.687	0.0281	0.044	0.064
prompt.pii.phone_number\|prompt.pii.email_addres...	23.449	0.024	0.024	0.02601
prompt.sentiment.sentiment_score	2.648	0.003	0.003	0.04205
response.sentiment.sentiment_score	1.032	0.001	0.001	0.002
response.regex.ssn	0.04	0	0	0
response.stats.token_count	0.032	0	0	0
prompt.stats.token_count	0.002	0	0	0
response.similarity.refusal	0.001	0	0	0
prompt.stats.flesch_reading_ease	0.001	0	0	0
response.stats.syllable_count	0	0	0	0
prompt.regex.credit_card_number	0	0	0	0
prompt.regex.phone_number	0	0	0	0
prompt.regex.ssn	0	0	0	0
prompt.stats.difficult_words	0	0	0	0
prompt.stats.letter_count	0	0	0	0
prompt.stats.lexicon_count	0	0	0	0
prompt.stats.char_count	0	0	0	0
response.stats.char_count	0	0	0	0
response.stats.flesch_reading_ease	0	0	0	0
response.stats.sentence_count	0	0	0	0
response.stats.flesch_kincaid_grade	0	0	0	0
response.regex.mailing_address	0	0	0	0
prompt.regex.email_address	0	0	0	0
response.stats.letter_count	0	0	0	0
response.stats.difficult_words	0	0	0	0
prompt.stats.syllable_count	0	0	0	0
response.regex.phone_number	0	0	0	0
response.regex.email_address	0	0	0	0
response.regex.credit_card_number	0	0	0	0
prompt.stats.flesch_kincaid_grade	0	0	0	0
response.similarity.prompt	0	0	0	0
prompt.stats.sentence_count	0	0	0	0
prompt.regex.mailing_address	0	0	0	0
response.stats.lexicon_count	0	0	0	0

Benchmarks

Accuracy Benchmark Results​

Datasets Information​

tensor_trust​

JailBreakV-28k​

PurpleLlama - FRR​

chatgpt_refusals​

imdb_sentiment​

hsol​

Latency Benchmark Results​