Metrics overview
WhyLabs Platform makes it easy to track model and data health across all essential AI use cases. This page is a big overview of the different types of metrics you can track in the platform along with links to understand these metrics further.
Category | Metrics | Use cases | Notes/Resources |
---|---|---|---|
Tabular model inputs | Overall: Number of records/rows per time period (hour/day/custom), Number of features in a time period, Data freshness Continuous: Count, Missing value counts, Cardinality, Distribution (non-parametric histogram), Mean, Median, Max, Min, Schema Discrete: Count, Cardinality, Missing value counts, Distribution (top 100 frequent items counts), Most common value, Schema | Distribution drift, training-serving skew, missing data, changes in schema | Basic example notebook for logging tabular data metrics |
Image model input (CV) | Overall: Number of records/images per time period (hour/day/custom), Data freshness Traditional CV features extracted from every model input image: Image brightness (mean, standard deviation), Hue (mean, standard deviation), Saturation (mean, standard deviation), Height/Width, Colorspace Metadata: Exif data | Training-serving skew, image quality (i.e. blurry images, obstructed images, disaturated images, switched color channels) | Overview blog with examples of logging image data metrics |
Text model inputs (basic text) | Overall: Number of records per time period (hour/day/custom), Data freshness Traditional String features extracted from every model input string/blurb: string length, word counts, character counts and distributions, digits counts and distributions | Training-serving skew, text/string quality (i.e. corrupt text, drift in inputs length, unexpected characters) | Basic example notebook for logging text/string data metrics |
LLM model prompts and responses (advanced text) | Overall: Number of interactions records per time period (hour/day/custom), Freshness Text Quality: Readability index, Flesch Kincaid Grade, Flesch reading ease, Smog index, Syllable count, Lexicon count Relevance: Semantic similarity between prompt and response, Topics extraction Toxicity: Sentiment, Toxicity Security: PII detection, refusal risk, jailbreak risk, abuse risk Usage: Number of tokens, latency | Detecting drift in sentiment/toxicity, jailbreaks, abuse, hallucinations | Documentation of the LLM metrics |
Embeddings | Cosine similarity, Number of clusters, Size of clusters, Distance between centroids | Detecting embedding drift over time | Troubleshooting embedding drift |
Performance Metrics (extracted at the model evaluation step)
Model Type | Metrics | Use cases | Notes/Resources |
---|---|---|---|
Binary Classification | AUC, Accuracy, Recall, FPR, Precision, F1, Confusion Matrix, ROC Curve, Precision-Recall Curve | Performance degradation, model A/B testing, performance breakdown across segments/cohorts | Tracking classification model performance |
Multiclass Classification | Accuracy, F1, Recall, Precision (macro averaged) | Performance degradation, model A/B testing, performance breakdown across segments/cohorts | Tracking classification model performance |
Regression | Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) | Performance degradation, model A/B testing, performance breakdown across segments/cohorts | Tracking regression model performance |
Ranking | AUC, Segment AUC, Mean Average Precision [at K], Mean Reciprocal Rank (MRR), NDCG | Performance degradation, model A/B testing, performance breakdown across segments/cohorts | Tracking ranking model performance |
Summarization (LLM) | ROUGE, text semantic similarity | Performance degradation, model A/B testing, performance breakdown across segments/cohorts | Tracking LLM performance across use cases |