Skip to main content

Metrics overview

WhyLabs Platform makes it easy to track model and data health across all essential AI use cases. This page is a big overview of the different types of metrics you can track in the platform along with links to understand these metrics further.

Category	Metrics	Use cases	Notes/Resources
Tabular model inputs	Overall: Number of records/rows per time period (hour/day/custom), Number of features in a time period, Data freshness Continuous: Count, Missing value counts, Cardinality, Distribution (non-parametric histogram), Mean, Median, Max, Min, Schema Discrete: Count, Cardinality, Missing value counts, Distribution (top 100 frequent items counts), Most common value, Schema	Distribution drift, training-serving skew, missing data, changes in schema	Basic example notebook for logging tabular data metrics
Image model input (CV)	Overall: Number of records/images per time period (hour/day/custom), Data freshness Traditional CV features extracted from every model input image: Image brightness (mean, standard deviation), Hue (mean, standard deviation), Saturation (mean, standard deviation), Height/Width, Colorspace Metadata: Exif data	Training-serving skew, image quality (i.e. blurry images, obstructed images, disaturated images, switched color channels)	Overview blog with examples of logging image data metrics
Text model inputs (basic text)	Overall: Number of records per time period (hour/day/custom), Data freshness Traditional String features extracted from every model input string/blurb: string length, word counts, character counts and distributions, digits counts and distributions	Training-serving skew, text/string quality (i.e. corrupt text, drift in inputs length, unexpected characters)	Basic example notebook for logging text/string data metrics
LLM model prompts and responses (advanced text)	Overall: Number of interactions records per time period (hour/day/custom), Freshness Text Quality: Readability index, Flesch Kincaid Grade, Flesch reading ease, Smog index, Syllable count, Lexicon count Relevance: Semantic similarity between prompt and response, Topics extraction Toxicity: Sentiment, Toxicity Security: PII detection, refusal risk, jailbreak risk, abuse risk Usage: Number of tokens, latency	Detecting drift in sentiment/toxicity, jailbreaks, abuse, hallucinations	Documentation of the LLM metrics
Embeddings	Cosine similarity, Number of clusters, Size of clusters, Distance between centroids	Detecting embedding drift over time	Troubleshooting embedding drift

Performance Metrics (extracted at the model evaluation step)

Model Type	Metrics	Use cases	Notes/Resources
Binary Classification	AUC, Accuracy, Recall, FPR, Precision, F1, Confusion Matrix, ROC Curve, Precision-Recall Curve	Performance degradation, model A/B testing, performance breakdown across segments/cohorts	Tracking classification model performance
Multiclass Classification	Accuracy, F1, Recall, Precision (macro averaged)	Performance degradation, model A/B testing, performance breakdown across segments/cohorts	Tracking classification model performance
Regression	Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE)	Performance degradation, model A/B testing, performance breakdown across segments/cohorts	Tracking regression model performance
Ranking	AUC, Segment AUC, Mean Average Precision [at K], Mean Reciprocal Rank (MRR), NDCG	Performance degradation, model A/B testing, performance breakdown across segments/cohorts	Tracking ranking model performance
Summarization (LLM)	ROUGE, text semantic similarity	Performance degradation, model A/B testing, performance breakdown across segments/cohorts	Tracking LLM performance across use cases

Performance Metrics (extracted at the model evaluation step)

Prefooter Illustration Mobile

Run AI With Certainty

Get started for free

Prefooter Illustration