Skip to main content

Metrics overview

WhyLabs Platform makes it easy to track model and data health across all essential AI use cases. This page is a big overview of the different types of metrics you can track in the platform along with links to understand these metrics further.

CategoryMetricsUse casesNotes/Resources
Tabular model inputsOverall: Number of records/rows per time period (hour/day/custom), Number of features in a time period, Data freshness
Continuous: Count, Missing value counts, Cardinality, Distribution (non-parametric histogram), Mean, Median, Max, Min, Schema
Discrete: Count, Cardinality, Missing value counts, Distribution (top 100 frequent items counts), Most common value, Schema
Distribution drift, training-serving skew, missing data, changes in schemaBasic example notebook for logging tabular data metrics
Image model input (CV)Overall: Number of records/images per time period (hour/day/custom), Data freshness
Traditional CV features extracted from every model input image: Image brightness (mean, standard deviation), Hue (mean, standard deviation), Saturation (mean, standard deviation), Height/Width, Colorspace
Metadata: Exif data
Training-serving skew, image quality (i.e. blurry images, obstructed images, disaturated images, switched color channels)Overview blog with examples of logging image data metrics
Text model inputs (basic text)Overall: Number of records per time period (hour/day/custom), Data freshness
Traditional String features extracted from every model input string/blurb: string length, word counts, character counts and distributions, digits counts and distributions
Training-serving skew, text/string quality (i.e. corrupt text, drift in inputs length, unexpected characters)Basic example notebook for logging text/string data metrics
LLM model prompts and responses (advanced text)Overall: Number of interactions records per time period (hour/day/custom), Freshness
Text Quality: Readability index, Flesch Kincaid Grade, Flesch reading ease, Smog index, Syllable count, Lexicon count
Relevance: Semantic similarity between prompt and response, Topics extraction
Toxicity: Sentiment, Toxicity
Security: PII detection, refusal risk, jailbreak risk, abuse risk
Usage: Number of tokens, latency
Detecting drift in sentiment/toxicity, jailbreaks, abuse, hallucinationsDocumentation of the LLM metrics
EmbeddingsCosine similarity, Number of clusters, Size of clusters, Distance between centroidsDetecting embedding drift over timeTroubleshooting embedding drift

Performance Metrics (extracted at the model evaluation step)

Model TypeMetricsUse casesNotes/Resources
Binary ClassificationAUC, Accuracy, Recall, FPR, Precision, F1, Confusion Matrix, ROC Curve, Precision-Recall CurvePerformance degradation, model A/B testing, performance breakdown across segments/cohortsTracking classification model performance
Multiclass ClassificationAccuracy, F1, Recall, Precision (macro averaged)Performance degradation, model A/B testing, performance breakdown across segments/cohortsTracking classification model performance
RegressionMean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE)Performance degradation, model A/B testing, performance breakdown across segments/cohortsTracking regression model performance
RankingAUC, Segment AUC, Mean Average Precision [at K], Mean Reciprocal Rank (MRR), NDCGPerformance degradation, model A/B testing, performance breakdown across segments/cohortsTracking ranking model performance
Summarization (LLM)ROUGE, text semantic similarityPerformance degradation, model A/B testing, performance breakdown across segments/cohortsTracking LLM performance across use cases
Prefooter Illustration Mobile
Run AI With Certainty
Get started for free
Prefooter Illustration