Model Monitoring with WhyLabs
It's painless to start monitoring whylogs profiles after they have been uploaded to the platform. Unlike other monitoring solutions, WhyLabs' monitor requires zero configuration out-of-the box, and can be enabled or disabled with a single click. This approach removes the friction of manually configuring monitors on an individual basis where each monitor requires a specific metric, threshold, or range to be defined.
It's possible to manually adjust monitor settings for each input feature, model output, or performance metric. This combination of model level defaults with manual overrides enables a fine grain of control over the monitor's behavior.
Monitor Capabilities Overview
WhyLabs Observatory enables a comprehensive set of model health and data health monitors for both inputs and outputs of production models. All monitors are enabled on 100% of data with no data volume limits. Popular monitors include:
- Data drift (distribution similarity, descriptive statistics)
- Data quality (missing values, schema, cardinality)
- Concept drift / label drift
- Model performance monitoring (various performance metrics for classification and regression)
- Top K values for categorical input features
- Data volume (inputs and outputs)
- Data ingestion (detects whether data has been delivered to a particular model)
Users can also split their datasets into segments and monitor the above at the segment level. WhyLabs has the ability to monitor new data profiles against a static reference profile or against a sliding window.
Static reference profile- Users can select a single profile to monitor against. This can be from a single profile generated from the training set, for example. Alternatively, users can create a static reference profile by merging a number of profiles over a given time interval.
Sliding window- Users can set a sliding window to monitor against. For example, selecting a 7 day sliding window will monitor new data against the aggregate of profiles uploaded within the last 7 days.
Anomaly detection & alerting can be done using either manually set thresholds or learned thresholds.
Manually set thresholds- Users can manually set thresholds for the various metrics tracked by WhyLabs. For example, a user can set a maximum threshold of 10% for the percentage of missing values for a particular feature.
Learned thresholds- WhyLabs uses cutting edge techniques to dynamically calculate thresholds for features which can be used in place of manual thresholds. Every dataset is different, and WhyLabs is particularly adept at automatically tuning these thresholds to your unique use case. For example, learned thresholds can be used to automatically set the maximum heliger distance threshold between the distribution of new data and the distribution of data from the last 7 days.
Once a model is onboarded, alerts can be configured. Alerts are triggered automatically and sent to the team's workflow (typically Slack, PagerDuty, email, etc). Read more on our Alerts and Notification Workflows pages.
Monitor Concepts
Batches and Profiles
Dataset profiles are generated by the whylogs library. Profiles contain statistical summaries of some subset of a dataset, typically with an associated time.
Batches refer to profiles of data collected over some period of time. As profiles are uploaded to WhyLabs for 1:00pm for an hourly dataset, for example, they'll get merged into a single profile. That merged profile is a batch.
Discrete vs. Continuous Columns
It's important to keep in mind that some algorithm features only work for discrete probability distributions. That is, distributions of values that can often be matched for equality (==
). This works well for low cardinality numbers (e.g. a small set of integers) or strings. This doesn’t work well for continuous numbers, such as 0.95 and 0.96, because comparing with strict equality won't capture practical similarities. Instead, we do comparisons within some interval, such as 0.9 <= value < 1.0.
WhyLabs automatically determines if a dataset column is continuous or discrete based on a statistical test of the cardinality of the data relative to the overall count. That classification slightly changes the way some of the monitors below behave.
Reference Profile
When a monitor runs, it compares a given batch of data against a reference data profile. That reference profile can represent a single batch of data, configured by its timestamp or reference ID, or a whole time window of data profiles defined by the number of days or hours into the past you want to look.
See MonitorRequestReference
in our api docs for more detail.
We have a few different ways you can reference profiles.
- Trailing Window: A number of previous batches describes how many hours/days/weeks of data prior to the current time to use as a baseline for comparison. The granularity of the batch is controlled by the model's granularity.
- Reference Profile Batch: Any single batch of data logged previously can be used as the baseline for comparison against the current batch. For instance, this could be the very first day of data logged for the dataset.
- Single Reference Profile: An individual profile uploaded via the LogReference API can be used as the baseline for comparison. This profile might represent the training dataset, for instance. This example notebook demonstrates this process.
Config Defaults and Learned Thresholds
There are a few strategies that we use to determine defaults for various configurations.
- For configuration fields that represent thresholds, we default to dynamically learning them based on analyzing windows of time relative to today. We're constantly tuning the exact approach, so reach out to us on Slack to ask about the details.
- For some configuration parameters that aren't specified explicitly, we have defaults that will be used. When you GetMonitorConfig via our API, any defaults that we applied will be returned as part of the config.
Monitor Types
WhyLabs has several types of monitoring algorithms that can be configured to run, enumerated below.
Data Type Monitor
- Output: 0 or 1 if a column in a profile has a type that is different from the reference profile
- Config: See
DatatypeMonitorRequestConfig
in our api docs.
This monitor alerts based on the inferred data type of a column compared against the inferred data type of the same column in the reference profile. If the inferred data type is different, this triggers an alarm.
For example, for some column name
, if you enable this monitor on a dataset that is configured to use the previous three days, and the previous three days have inferred name
to be of type TEXT
, but the current batch of data inferred name
to be of type NULL
, then this will trigger an alarm.
The inferred data type for a column for a batch of data is determined to be the most frequent data type observed. For example, if 45% of values in a column are TEXT
, 20% are FRACTIONAL
, and 25% are NULL
then the inferred type will be TEXT
. It works the same for a window of profiles. For example, if the last 7 days of profiles have 45% of values in a column are TEXT
, 20% are FRACTIONAL
, and 25% are NULL
then the inferred type will be TEXT
.
Distribution Monitor
- Output: Value within interval
[0, 1]
that represents the distance between current and previous batches of data. - Config: See
DistributionMonitorRequestConfig
in our api docs.
This monitor alerts based on the distribution distance compared with previous batches of data, as computed by the Hellinger distance. If this monitor is in alert, then it means that the distribution of the batch of data being analyzed is different enough from the reference batch that it exceeded the threshold set in the monitor configuration.
For example, if you enable this monitor on a dataset that is configured to use the previous 7 days with a threshold of 2, then we will compute the Hellinger distance for every new profile that you upload relative to the last 7 days and trigger an alarm if that number is greater than 2.
The calculation for distribution distance can differ slightly for discrete and non-discrete features. Discrete features utilize the most frequent items collected while continuous numerical values utilize the histogram.
Missing Value Monitor
- Output: Value within interval
[0, 1]
. - Config: See
MissingValuesMonitorRequestConfig
in our api docs.
This monitor alerts based on the ratio of missing values calculated as num_null_type_values
/num_total_values
. If the output of the monitor is higher than the configured threshold then this monitor will trigger an alarm. For example, if the monitor is configured with a threshold of 0.01 and a batch is generated that has 11 null values out of 1000 total values in some column, then an alert will be triggered for that column.
If a dataset is configured to use a time range as a baseline reference then we use the median missing value ratio for the time range. For example, if we're using this monitor on a model that has been configured to use the past seven days as a baseline and missing value ratios across the last seven days were [0, 0, 0.1, 0.2, 0.3, 0.4, 0.6]
then we would choose 0.2
as our baseline missing value ratio.
Unique Value Monitor
- Output: Value within interval
[0, 1]
for non-discrete features, integer value for discrete features. - Config: See
UniqueValuesMonitorRequestConfig
in our api docs.
This monitor tracks the unique values ratio and works similarly to the missing value monitor for continuous features. For discrete features, we monitor based on the total count of unique values.
For example, if a dataset is configured to use this monitor with a minimum threshold of .2 and a maximum threshold of .8 and a column in a new batch of data contains values A
, B
, and C
with 100 total rows, then the unique value ratio is 0.03, or 3/100.
Missing Profiles Monitor
- Output: 0 or 1
When a dataset ceases to log new profiles, WhyLabs will generate an alert to indicate a potential issue with the integration. By default WhyLabs will alert after two subsequent missing batches. For example, if you have configured an hourly dataset and something breaks in your system at 8:00 such that profiles are no longer sent to WhyLabs for 9:00 and 10:00, then this will alert when our monitor runs.
Data Ingestion Monitor
- Output: 0 or 1
Turning on this monitor will enable alerts for instances in which no profiles are uploaded with timestamps within the last 2 days (or 2 hours for models with an hourly batch frequency). Note that this number of days/hours can be changed on the backend if desired. Note that this monitor looks at the timestamp associated with the profile as opposed to the timestamp associated with the upload event. This means that the data ingestion monitor is also useful for cases in which a date format issue causes uploaded profiles to be associated with an invalid date (e.g. 1970), even though profiles were uploaded within the last 2 days.
Note: Users are advised to only use this monitor if they expect regular uploads. False positives would be generated otherwise.
Seasonal Timeseries Monitor
- Output: upper and lower 95% Confidence Intervals for expected value
We estimate 95% confidence intervals for each point in a timeseries using Seasonal ARIMA (SARIMA) forecasting. Observed values that fall outside the forecast window will trigger alerts. This forecasting method requires much more baseline data than other monitors, but works well with seasonal data we have seen so far.
Configuring the Monitor
The monitor can be configured either through the WhyLabs UI or programmatically via our REST API if you want to override any of our defaults. If you're updating settings through the UI, then you'll want to head to Monitor Settings page when scoped into a model. It's accessed via the last tab in the model page, which can be seen below.
Alternatively you can access the page via its URL, specifying the appropriate model ID: hub.whylabsapp.com/models/model-ID/monitor-settings
.
Requests can be made via http REST calls using the Swagger api as a guide. We also have a Python and Java client that supply types and call logic.
You can supply a configuration for the monitor using our PutMonitorConfig API. It takes a JSON config as the body with the following schema.
interface ProfileReference {
type: "reference_profile";
// Pick one of the following to supply
profileTimestamp?: number; // Millisecond timestamp of the reference profile to use (if it was uploaded with the dataset)
profileId?: string; // ID of the profile to use (if it was uploaded separately)
}
interface TimeWindowReference {
type: "reference_window";
num_batches?: number; // global number of batches to use when monitoring the model/dataset
}
interface SourceMonitorConfig {
schema_version: "0.1.0";
/**
* If the monitored value falls more than the specified number of
* std dev away from mean, the value is considered an anomaly.
*
* This value is used in each of the thresholds in each of the monitor
* configs that are unset, which means they're implicitly set to be
* dynamically learned using this standard deviation.
*/
num_std_dev?: number;
/**
* A profile to reference as the baseline.
*/
reference?: TimeWindowReference | ProfileReference;
distribution?: {
enable?: boolean;
threshold?: number | null;
};
missing_values?: {
enable?: boolean;
threshold_lower_bound?: number;
threshold_upper_bound?: number;
};
unique_values?: {
enable?: boolean;
/**
* The minimum number of rows of data required before we'll consider
* alerting. This is to cut down on noise.
*/
min_record_count?: number;
min_threshold?: number | null;
max_threshold?: number | null;
};
datatype?: {
enable?: boolean;
};
missingRecentData?: {
enable?: boolean;
}
missingRecentProfiles?: {
enable?: boolean;
}
seasonalARIMA?: {
enable?: boolean;
/**
* The number of batches that comprise the seasonality.
* For a daily dataset, a seasonality of 7 would indicate that the
* data is expected to have a weekly cycle.
*/
seasonalityBatches?: number;
/**
* The list of metrics that you want to enable ARIMA for.
* See WhyLogsMetric below
*/
metrics?: WhyLogsMetric[];
};
}
/**
* Metrics that are present inside of whylogs profiles
*/
enum WhyLogsMetric {
TotalCount,
Median,
Min,
Max,
StdDev,
Mean,
}
interface SourceDatasetMonitorConfig {
config: SourceMonitorConfig;
per_feature_config?: Record<string, SourceMonitorConfig>;
}
Here is an example of what a config looks like. The sample has a global configuration that sets the baseline to the profile with an id of ref-vLlXzaNvysnuqacL
and configures the distribution monitor to have a threshold of 2
. It also overrides the monitor settings for the feature_a
and feature_b
features in this particular model. feature_a
's distribution threshold is nudged a bit higher to 3
and feature_b
's datatype monitor is disabled.
const myConfig: SourceDatasetMonitorConfig = {
config: {
schema_version: "0.1.0",
reference: {
type: "reference_profile",
profileId: "ref-vLlXzaNvysnuqacL",
},
distribution: {
enable: true,
threshold: 2,
},
},
per_feature_config: {
feature_a: {
schema_version: "0.1.0",
distribution: {
enable: true,
threshold: 3,
},
},
feature_b: {
schema_version: "0.1.0",
datatype: {
enable: false,
},
},
},
};
For quick prototyping, you can open up this playground to customize the myConfig
object at the bottom. If you want to use the API to upload a reference profile, then take note of the profile ID that gets returned to you. You can use that as the profileId
in the config reference. The playground uses TypeScript as an easy way to validate the types, but the payload will have to be converted into JSON before it can be sent to our API.
Adjusting monitor sensitivity
Different models and use cases will require different monitoring settings for the best signal-to-noise ratio. For example, in some instances low levels of missing values within an input feature may be tolerable, while in others any reading above 0% may indicate a critical issue. While our monitors come with preconfigured defaults that should work for most teams, some monitor configuration tweaks may be necessary once patterns begin to emerge in the underlying data.
You can manually adjust how sensitive a particular monitor is by changing its thresholds. This can be accomplished both via the UI and our API. For more information, see the section about the particular monitor you are interested in modifying.
It is also possible to mark a given monitor alert as "unhelpful" through the UI:
This feedback is used to evaluate our default monitor configuration settings, and, in the case of non-deterministic monitoring algorithms such as Seasonal ARIMA, is also utilized as part of the monitor retraining process to minimize the number of false positives in the monitor's output.