Skip to main content

Advanced Monitor Configuration

Working with Configurations#

In most cases monitors can be set up using the monitor manager user interface. This guide is for users of the config investigator as well users configuring monitors with the REST API.

Config investigator is currently in beta, contact us to have it enabled for your account. Once enabled it can be found within the monitor manager.

model dashboard

Analyzers vs Monitors#

Before we get started what do we mean by analyzers and how are they different from monitors? Once data profiling with whylogs is integrated with whylabs it can be analyzed on the platform. This typically means taking a target bucket of time and either comparing it to some baseline data or a fixed threshold. While analysis is great, many customers want certain anomalies to alert their internal systems. A monitor specifies which anomalies are of such importance and where to notify (email, pager duty, etc).

Analyzers#

Whylogs is capable of profiling wide datasets with thousands of columns. Inevitably customers have a subset of columns they wish to focus on for a particular type of analysis. This can be accomplished within the target matrix.

Targeting Columns#

After uploading a profile, each column from a dataset is automatically analyzed with schema inference making it easy to configure analyzers to scoped to certain groups. If schema inference guessed incorrectly or a schema changes, that can be corrected by updating the entity schema editor.

Allowed options include:

  • group:continuous - Continuous data has an infinite number of possible values that can be measured
  • group:discrete - Discrete data is a finite value that can be counted
  • group:input - By default columns are considered input unless they contain the word output
  • group:output
  • * An asterisk wildcard specifies all columns
  • sample_column - The name of the column as it was profiled. This is type sensitive

Note: In cases where a column is both included and excluded, it will be excluded.

Example

{
"targetMatrix": {
"type": "column",
"include": [
"group:discrete",
"favorite_animal"
],
"exclude": [
"group:output",
"sales_engineer_id"
]
}
}

Targeting Segments#

Whylogs can profile both segmented and unsegmented data. Whylabs can scope analysis to the overall segment, specific segments, or all segments. This option lives on the targetMatrix config level.

  • [] - An empty tags array indicates you would like analysis to run on the overall/entire dataset merged together.
  • [{"key" : "purpose", "value": "small_business"}] - Indicates you would like analysis on a specific segment. Note tag keys and values are case sensitive.
  • [{"key" : "car_make", "value": "*"}] - Asterisk wildcards are allowed in tag values to in this case generate analysis separately for every car_make in the dataset as well

Example:

{
"targetMatrix": {
"segments": [
{
"tags": []
},
{
"tags": [
{
"key": "purpose",
"value": "small_business"
}
]
},
{
"tags": [
{
"key": "car_make",
"value": "*"
}
]
}
]
}
}

Targeting Datasets#

Some analysis operates at the dataset level rather than on individual columns. This includes monitoring on model accuracy, missing uploads, and more. For example, this analyzer targets the accuracy metric on a classification model.

{
"id": "cheerful-lemonchiffon-echidna-2235-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "dataset",
"segments": []
},
"config": {
"metric": "classification.accuracy",
"baseline": {
"type": "TrailingWindow",
"size": 7
},
"type": "diff",
"mode": "pct",
"threshold": 2
}
}

Setting Your Baseline#

Trailing Windows#

The most frequently used baseline would be the trailing window. Use the size parameter to indicate how much baseline to use for comparison. This aligns with the dataset granularity, so a baseline of 7 on a daily model would use 7 days worth of data as the baseline. Many metrics can be configured with a minBatchSize to prevent analysis when there's insufficient baseline. This is useful for making alerts on sparse data less chatty.

{
"id": "cheerful-lemonchiffon-2244-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "dataset",
"segments": []
},
"config": {
"metric": "classification.accuracy",
"baseline": {
"type": "TrailingWindow",
"size": 7
},
"type": "diff",
"mode": "pct",
"threshold": 2
}
}

Exclusion Ranges#

Trailing window baselines can exclude time ranges. In this scenario the first day of January is excluded from being part of a baseline.

{
"id": "successful-cornsilk-hamster-3862-analyzer",
"config": {
"metric": "classification.recall",
"baseline": {
"type": "TrailingWindow",
"size": 7,
"exclusionRanges": [
{
"start": "2021-01-01T00:00:00.000Z",
"end": "2021-01-02T00:00:00.000Z"
}
]
},
"type": "diff",
"mode": "pct",
"threshold": 2
},
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "dataset",
"segments": []
}
}

Reference Profiles#

Instead of comparing targets to some variation on a rolling time window baseline, reference profiles are referenced by profileId. This scenario compares a target every day against a profile of known good data for drift on frequent items. For more information about sending reference profiles to whylabs see whylogs documentation.

{
"id": "muddy-green-chinchilla-1108-analyzer",
"config": {
"metric": "frequent_items",
"baseline": {
"type": "Reference",
"profileId": "ref-MHxddU9naW0ptlAg"
},
"type": "drift",
"algorithm": "hellinger",
"threshold": 0.7
},
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "column",
"include": [
"group:discrete"
],
"exclude": [
"group:output"
]
}
}

Fixed Threshold#

Compare a target value to fixed upper and lower bounds using the fixed configuration.

{
"id": "missing_upload_analyzer",
"config": {
"type": "fixed",
"upper": 86400,
"lower": 0,
"metric": "secondsSinceLastUpload"
},
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"disabled": false,
"targetMatrix": {
"type": "dataset",
"segments": [
{
"tags": []
}
]
}
}

Fixed Time Range#

Time range baselines compare a target against a baseline with a fixed start/end time range. This scenario performs drift detection on the histogram comparing a target against a fixed period of time which was considered normal.

{
"id": "continuous-distribution-58f73412",
"config": {
"baseline": {
"type": "TimeRange",
"range": {
"start": "2022-02-25T00:00Z",
"end": "2022-03-25T00:00Z"
}
},
"metric": "histogram",
"type": "drift",
"algorithm": "hellinger",
"threshold": 0.7,
"minBatchSize": 1
},
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"disabled": false,
"targetMatrix": {
"type": "column",
"include": [
"group:continuous"
]
},
"backfillGracePeriodDuration": "P30D"
}

Seasonal Forecast#

Some datasets have a strong seasonality aspect. Whylabs has a proprietary seasonal forecasting algorithm which generates anomalies when a target deviates from what was forecasted from the baseline. Contact us for adding this feature to your account.

Scheduling Analysis#

Whylabs provides a number of configuration options giving you full control over exactly when analysis and monitoring runs.

Dataset Granularity#

When creating models you're asked to provide a dataset granularity which determines the analysis granularity and to a large extent the analysis cadence. Whylabs currently supports four options:

Hourly#

Hourly datasets are sometimes used in profiling streaming applications. Data can be profiled with any timestamp and the UI will automatically roll up data to hourly granularity. When analyzing, a single target hour will be compared to the configured baseline.

The default flow waits for the hour to end before beginning analysis, assuming more data could arrive. For example, data logged with timestamps of [1:03, 1:35] to wait until after the hour has ended (2pm) before beginning analysis.

Daily#

Daily datasets conclude at midnight UTC. When analyzing, a single target day will be compared to the configured baseline.

The default flow waits for the day to end before beginning analysis, assuming more data could arrive. If that's too long of a wait and more eager analysis is desired, read the section below on allowPartialTargetBatches.

Weekly#

Weekly datasets conclude on Monday at midnight UTC. When analyzing, a single target week will be compared to the configured baseline.

The default flow waits for the week to end before beginning analysis, assuming more data could arrive. If that's too long of a wait and more eager analysis is desired, read the section below on allowPartialTargetBatches.

Monthly#

Monthly datasets conclude on 1st of each month at midnight UTC. When analyzing, a single target month will be compared to the configured baseline.

The default flow waits for the month to end before beginning analysis, assuming more data could arrive. If that's too long of a wait and more eager analysis is desired, read the section below on allowPartialTargetBatches.

Gating Analysis#

Some data pipelines run on a fixed schedule, some run when cloud resources are cheaper, some run continuously in a distributed environment, and sometimes it's just running on a laptop.

Whylabs provides a number of gating options to hold off on analysis until you're done profiling a hour/day/week/month. Analysis is immutable unless explicitly deleted, so controlling when analyzers run avoids running analysis when more data is expected to arrive.

Data Readiness Duration#

If you need to delay the analysis, optionally specify a dataReadinessDuration at the analyzer configuration level. If you recall from dataset granularities, a dataset marked as daily would normally be considered ready for analysis at midnight UTC. A common scenario would be a customer running their data pipeline later in the day and wanting to pause analysis a minimum fixed amount of time to accommodate. This is a perfect case for applying the dataReadinessDuration parameter.

Example options: P1D, PT19H

{
"id": "stddev-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"dataReadinessDuration": "PT1H",
"config": {
"baseline": {
"type": "TimeRange",
"range": {
"start": "2022-02-25T00:00Z",
"end": "2022-03-25T00:00Z"
}
},
"metric": "histogram",
"type": "drift",
"algorithm": "hellinger",
"threshold": 0.7,
"minBatchSize": 1
},
"disabled": false,
"targetMatrix": {
"type": "column",
"include": [
"group:continuous"
]
},
"backfillGracePeriodDuration": "P30D"
}

Batch Cooldown#

At the analyzer configuration level, optionally specify a batchCoolDownPeriod. If you recall from dataset granularities, a dataset marked as daily would normally be considered ready for analysis at midnight UTC. While an analysis will wait to run until a profile has arrived, this setting specifies once data starts arriving to delay the analysis until there's been some amount of radio silence.

Scenario: A customer's data pipeline is a distributed batch job with heavy data skew causing some tasks to take a while. A batchCoolDownPeriod of PT1H delays analysis until it's been one hour since receiving any additional profiles.

Example options: P1D, PT1H, PT30M

{
"id": "stddev-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily",
"batchCoolDownPeriod": "PT1H"
},
"config": {
"baseline": {
"type": "TimeRange",
"range": {
"start": "2022-02-25T00:00Z",
"end": "2022-03-25T00:00Z"
}
},
"metric": "histogram",
"type": "drift",
"algorithm": "hellinger",
"threshold": 0.7,
"minBatchSize": 1
},
"disabled": false,
"targetMatrix": {
"type": "column",
"include": [
"group:continuous"
]
},
"backfillGracePeriodDuration": "P30D"
}

Gating Target Completion#

The default flow is to wait for the window specified by the dataset granularity (hourly/daily/weekly/monthly) to end before analyzing a new datapoint.

For example, say a monthly dataset has data logged for the 6th of this month. The assumption is the future may introduce more data with further logging and analysis should hold off until the window has ended. If desirable, one can auto-acknowledge as soon as any data has been received to run analysis eagerly. This is controlled at the top level of the monitor config by setting allowPartialTargetBatches.

{
"orgId": "org-0",
"datasetId": "model-0",
"granularity": "monthly",
"allowPartialTargetBatches": true
...
}

With allowPartialTargetBatches enabled, analyzers can be fine tuned. In this scenario data for the current month can be analyzed after the 10th day.

{
"id": "stddev-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily",
"delayDuration": "P10D"
}
...
}

In another scenario with allowPartialTargetBatches enabled, this analyzer config would trigger analysis for the month as soon as profiles have arrived but with one hour of wiggle room for a distributed environment to finish uploading.

{
"id": "stddev-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily",
"batchCoolDownPeriod": "PT1H"
}
...
}

Backfills#

Customers commonly upload months or years of data profiles when establishing a new model. Whylabs will automatically backdate some analysis, but the extent of how far back is user configurable at the analyzer level using backfillGracePeriodDuration.

Scenario: A customer backfills 5 years of data for a dataset. With a backfillGracePeriodDuration of P365D the most recent year of analysis will be filled in automatically. Note large backfills can take overnight to fully propagate.

{
"id": "missing-values-ratio-eb484613",
"backfillGracePeriodDuration": "P365D",
"schedule": {
"type": "fixed",
"cadence": "daily"
}
...
}

Prevent Notifications From Backfills#

It's typically undesirable for analysis on old data from a backfill to trigger a notification (pager duty/email/slack). Digest monitors can be configured with the datasetTimestampOffset filter on the monitor level config to prevent notifications from going out on old data.

Scenario: Customer uploads 365 days of data with backfillGracePeriodDuration=P180D configured on an analyzer. An attached monitor specifies datasetTimestampOffset=P3D. In this scenario the most recent 180 days would be analyzed, but only data points for the last 3 days would send notifications.

{
"id": "outstanding-seagreen-okapi-2337",
"displayName": "Output missing value ratio",
"analyzerIds": [
"outstanding-seagreen-okapi-2337-analyzer"
],
"schedule": {
"type": "immediate"
},
"severity": 3,
"mode": {
"type": "DIGEST",
"datasetTimestampOffset": "P3D"
}
}

Monitors#

Monitors define if and how to notify when anomalies are detected during analysis.

Digest Notifications#

This monitor runs immediately after analysis has identified anomalies. Notifications include high level statistics and a sample of up to 100 anomalies in detail. There must be at least one anomaly for a digest to be generated.

{
"id": "outstanding-seagreen-okapi-2337",
"displayName": "Output missing value ratio",
"analyzerIds": [
"outstanding-seagreen-okapi-2337-analyzer"
],
"schedule": {
"type": "immediate"
},
"severity": 3,
"mode": {
"type": "DIGEST"
},
"actions": [
{
"type": "global",
"target": "slack"
}
]
}

Filter Noisy Anomalies From Notifying#

Some analysis is useful, but too noisy or not important enough to be worth notifying systems like pager duty. There's a multitude of filter options to be selective on what alerts get sent.

  • includeColumns - By default the floodgates are wide open. When includeColumns is provided only columns specified in this list will be notified on.
  • excludeColumns - Exclude notifications on any columns specified
  • minWeight - For customers supplying column/feature weights this option allows filtering out columns below the threshold
  • maxWeight - Same as minWeight but setting a cap on the weight of the column
{
"monitors": [
{
"id": "outstanding-seagreen-okapi-2337",
"displayName": "Output missing value ratio",
"analyzerIds": [
"outstanding-seagreen-okapi-2337-analyzer"
],
"schedule": {
"type": "immediate"
},
"mode": {
"type": "DIGEST",
"filter": {
"includeColumns": [
"a"
],
"excludeColumns": [
"very_noisey"
],
"minWeight": 0.5,
"maxWeight": 0.8
}
}
}
]
}

Every Anomaly Notification#

Monitor digests are the most commonly used delivery option, but some customers require being notified of every single anomaly. In this scenario the slack channel will be notified of every single anomaly generated by the drift_analyzer.

{
"id": "drift-monior-1",
"analyzerIds": [
"drift_analyzer"
],
"actions": [
{
"type": "global",
"target": "slack"
}
],
"schedule": {
"type": "immediate"
},
"disabled": false,
"severity": 2,
"mode": {
"type": "EVERY_ANOMALY"
}
}

Notification Actions#

Internal systems such as email, slack, pager duty, etc can be notified of anomalies. These are configured as global actions in the UI and subsequently referenced by monitors.

In this scenario a monitor has been created to deliver every anomaly generated by the drift_analyzer to the slack channel.

{
"id": "drift-monior-1",
"analyzerIds": [
"drift_analyzer"
],
"actions": [
{
"type": "global",
"target": "slack"
}
],
"schedule": {
"type": "immediate"
},
"disabled": false,
"severity": 2,
"mode": {
"type": "EVERY_ANOMALY"
}
}

Severity#

Monitors can specify a severity which gets included in the delivered notification. In this scenario a monitor digest which only reacts to anomalies on columns with a >.5 feature weight delivers severity level 3 notifications to the slack global action. For more information about setting feature weights, see whylogs notebook.

{
"id": "adorable-khaki-kudu-3389",
"displayName": "adorable-khaki-kudu-3389",
"severity": 3,
"analyzerIds": [
"drift_analyzer"
],
"schedule": {
"type": "immediate"
},
"mode": {
"type": "DIGEST",
"filter": {
"minWeight": 0.5
}
},
"actions": [
{
"type": "global",
"target": "slack"
}
]
}

Comparison#

Whylabs provides a number of ways to compare targets to baseline data.

Drift#

Whylabs uses Hellinger distance to calculate drift. Whylabs uses Hellinger distance because it is a symmetric (unlike say, KL divergence), well defined for categorical and numerical features (unlike say, Kolmogorov-Smirnov statistic), and has a clear analogy to Euclidean distance. It's not as popular in the ML community, but has a stronger adoption in both statistics and physics. If additional drift algorithms are needed, contact us.

{
"id": "muddy-green-chinchilla-1108-analyzer",
"config": {
"metric": "frequent_items",
"baseline": {
"type": "Reference",
"profileId": "ref-MHxddU9naW0ptlAg"
},
"type": "drift",
"algorithm": "hellinger",
"threshold": 0.7
},
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "column",
"include": [
"group:discrete"
],
"exclude": [
"group:output"
]
}
}

Diff#

Percent#

A target can be compared to a baseline for percentage change. In this case an anomaly would be generated for a 2 percent change.

{
"id": "cheerful-lemonchiffon-echidna-4053-analyzer",
"config": {
"metric": "classification.accuracy",
"type": "diff",
"mode": "pct",
"threshold": 2,
"baseline": {
"type": "TrailingWindow",
"size": 7
}
},
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "dataset",
"segments": []
}
}

Fixed Threshold#

Compare a target against a static upper/lower bound.

{
"id": "missing-datapoint-analyzer",
"config": {
"metric": "missingDatapoint",
"type": "fixed",
"upper": 1
},
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"disabled": false,
"targetMatrix": {
"type": "dataset",
"segments": [
{
"tags": []
}
]
},
"dataReadinessDuration": "P1DT18H"
}

Standard Deviations#

{
"id": "drift_analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"config": {
"type": "stddev",
"factor": 5,
"metric": "median",
"minBatchSize": 7,
"baseline": {
"type": "TrailingWindow",
"size": 14
}
},
"disabled": false,
"targetMatrix": {
"type": "column",
"include": [
"*"
]
},
"backfillGracePeriodDuration": "P30D"
}

Metrics#

Whylabs provides a wide array of metrics to use for analysis.

Median#

Median metric is derived from the KLL datasketch histogram in whylogs. The following analyzer compares a daily target against the previous 14 day trailing window on all columns for both the overall segment and the purpose=small_business segment.

Notes:

  • Factor of 5 is multiplier factor for calculating upper bounds and lower bounds
  • minBatchSize indicates there must be at least 7 days of data in the 14d trailing window present in order to analyze. This can be used for making analyzers less noisy when there's not much data in the baseline to compare against.
{
"id": "drift_analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"config": {
"version": 1,
"type": "stddev",
"metric": "median",
"factor": 5,
"minBatchSize": 7,
"baseline": {
"type": "TrailingWindow",
"size": 14
}
},
"disabled": false,
"targetMatrix": {
"type": "column",
"include": [
"*"
]
},
"backfillGracePeriodDuration": "P30D"
}

Frequent Items#

This metric captures the most frequently used values in a dataset. Capturing this metric can be disabled at the whylogs level for customers profiling sensitive data. In this scenario a target's frequent items will be compared against a reference profile with known good data. Additionally, this example will only target discrete input columns due to the targetMatrix configuration.

{
"id": "muddy-green-chinchilla-1108-analyzer",
"config": {
"metric": "frequent_items",
"baseline": {
"type": "Reference",
"profileId": "ref-MHxddU9naW0ptlAg"
},
"type": "drift",
"algorithm": "hellinger",
"threshold": 0.7
},
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "column",
"include": [
"group:discrete"
],
"exclude": [
"group:output"
]
}
}

Classification Recall#

In this scenario the classification recall metric will compare a target against the previous 7 days with an anomaly threshold of two percent change. For more information about sending model performance metrics to whylabs see https://nbviewer.org/github/whylabs/whylogs/blob/mainline/python/examples/integrations/writers/Writing_Classification_Performance_Metrics_to_WhyLabs.ipynb

{
"id": "successful-cornsilk-hamster-3862-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "dataset",
"segments": []
},
"config": {
"metric": "classification.recall",
"baseline": {
"type": "TrailingWindow",
"size": 7
},
"type": "diff",
"mode": "pct",
"threshold": 2
}
}

Classification Precision#

{
"id": "odd-powderblue-owl-9385-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "dataset",
"segments": []
},
"config": {
"metric": "classification.precision",
"baseline": {
"type": "TrailingWindow",
"size": 7
},
"type": "diff",
"mode": "pct",
"threshold": 2
}
}

Classification Fpr#

{
"id": "odd-powderblue-owl-9385-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "dataset",
"segments": []
},
"config": {
"metric": "classification.fpr",
"baseline": {
"type": "TrailingWindow",
"size": 7
},
"type": "diff",
"mode": "pct",
"threshold": 2
}
}

Classification Accuracy#

{
"id": "odd-powderblue-owl-9385-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "dataset",
"segments": []
},
"config": {
"metric": "classification.accuracy",
"baseline": {
"type": "TrailingWindow",
"size": 7
},
"type": "diff",
"mode": "pct",
"threshold": 2
}
}

Classification F1#

{
"id": "odd-powderblue-owl-9385-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "dataset",
"segments": []
},
"config": {
"metric": "classification.f1",
"baseline": {
"type": "TrailingWindow",
"size": 7
},
"type": "diff",
"mode": "pct",
"threshold": 2
}
}

Regression Mse#

{
"id": "odd-powderblue-owl-9385-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "dataset",
"segments": []
},
"config": {
"metric": "regression.mse",
"baseline": {
"type": "TrailingWindow",
"size": 7
},
"type": "diff",
"mode": "pct",
"threshold": 2
}
}

Regression Mae#

{
"id": "odd-powderblue-owl-9385-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "dataset",
"segments": []
},
"config": {
"metric": "regression.mae",
"baseline": {
"type": "TrailingWindow",
"size": 7
},
"type": "diff",
"mode": "pct",
"threshold": 2
}
}

Regression Rmse#

{
"id": "odd-powderblue-owl-9385-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "dataset",
"segments": []
},
"config": {
"metric": "regression.rmse",
"baseline": {
"type": "TrailingWindow",
"size": 7
},
"type": "diff",
"mode": "pct",
"threshold": 2
}
}

Uniqueness#

Uniqueness in whylogs is efficiently measured with the HyperLogLog algorithm with typically %2 margin of error.

  • unique_est - The estimated unique values
  • unique_est_ratio - estimated unique/total count
{
"id": "pleasant-linen-albatross-6992-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "column",
"include": [
"group:discrete"
],
"exclude": [
"group:output"
],
"segments": []
},
"config": {
"metric": "unique_est_ratio",
"type": "fixed",
"upper": 0.5,
"lower": 0.2
}
}

Count#

{
"id": "pleasant-linen-albatross-6992-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "column",
"include": [
"group:continuous"
],
"exclude": [
"group:output"
],
"segments": []
},
"config": {
"metric": "count",
"type": "fixed",
"upper": 100,
"lower": 10
}
}

Mean#

{
"id": "pleasant-linen-albatross-6992-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "column",
"include": [
"group:continuous"
],
"exclude": [
"group:output"
],
"segments": []
},
"config": {
"metric": "mean",
"type": "fixed",
"lower": 10.0
}
}

Min/Max#

Min and max values are derived from the kll sketch.

{
"id": "pleasant-linen-albatross-6992-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "column",
"include": [
"group:continuous"
],
"exclude": [
"group:output"
],
"segments": []
},
"config": {
"metric": "min",
"type": "fixed",
"lower": 10.0
}
}

Expected Values Comparison#

Compare a numeric metric against a static set of values. In the example below the count_bool metric is expected to be either 0 or 10. A value of 7 would generate an anomaly.

Operators

  • in - Metric is expected to be contained within these values, generate an anomaly otherwise
  • not_in - Metric is expected to never fall within these values, generate an anomaly if they do

Expected

  • int - Compare two integers
  • float - Compare two floating point numbers down to the unit of the least precision
{
"id": "list-comparison-6992-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "column",
"include": [
"*"
],
"segments": []
},
"config": {
"metric": "count_bool",
"type": "list_comparison",
"operator": "in",
"expected": [
{"int": 0},
{"int": 10}
]
}
}

Schema Count Metrics#

Whylogs performs schema inference, tracking counts for inferred data types. Each count as well as a ratio of that count divided by the total can be accessed with the following metrics:

  • count_bool
  • count_bool_ratio
  • count_integral
  • count_integral_ratio
  • count_fractional
  • count_fractional_ratio
  • count_string
  • count_string_ratio
  • count_null
  • count_null_ratio
{
"id": "pleasant-linen-albatross-6992-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "column",
"include": [
"group:continuous"
],
"exclude": [
"group:output"
],
"segments": []
},
"config": {
"metric": "count_bool",
"type": "fixed",
"lower": 10
}
}

Missing Data#

Most metrics await profile data to be uploaded before analyzing however missingDatapoint is an exception. This metric is most useful for detecting broken integrations with whylabs. Use dataReadinessDuration to control how long to wait before notifying. While very similar in purpose to the secondsSinceLastUpload metric, the missingDatapoint analyzer can detect misconfigured timestamps at the whylogs level. Note this metric does not fire for datasets which have never had data uploaded.

In the following scenario this analyzer will create an anomaly for a datapoint which has not been uploaded to whylabs after 1 day and 18 hours has passed. Given the empty tags array it will create an anomaly if no data has been uploaded for the entire dataset. Segmentation can be used to raise alarms for specific segments. See "Targeting Segments" for more information.

{
"id": "missing-datapoint-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"disabled": false,
"targetMatrix": {
"type": "dataset",
"segments": []
},
"dataReadinessDuration": "P1DT18H",
"config": {
"metric": "missingDatapoint",
"type": "fixed",
"upper": 0
}
}

Missing Segment Data#

In the example below a dataset has been segmented by country. We wish to alert if any countries stopped receiving data after 18 hours has passed.

{
"id": "missing-datapoint-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"disabled": false,
"targetMatrix": {
"type": "dataset",
"segments": [
{
"tags": [
{
"key": "country",
"value": "*"
}
]
}
]
},
"dataReadinessDuration": "PT18H",
"backfillGracePeriodDuration": "P30D",
"config": {
"metric": "missingDatapoint",
"type": "fixed",
"upper": 0
}
}

Seconds Since Last Upload#

Most metrics await profile data to be uploaded before analyzing however secondsSinceLastUpload is an exception. In this scenario an anomaly will be generated when it's been more than a day since the last upload for this dataset. Note this metric does not fire for datasets which have never had data uploaded.

{
"id": "missing_upload_analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"config": {
"type": "fixed",
"version": 1,
"metric": "secondsSinceLastUpload",
"upper": 86400
},
"disabled": false,
"targetMatrix": {
"type": "dataset",
"segments": []
}
}

Frequent String Comparison#

The frequent string analyzer utilizes the frequent item sketch. Capturing this can be disabled at the whylogs level for customers profiling sensitive data. In the example below a reference profile with the column dayOfWeek has been uploaded with Monday-Sunday as expected values. A target value of "September" would generate an anomaly.

Note: This analyzer is only suitable for low cardinality (<100 possible value) columns. Comparisons are case-sensitive. In later versions of whylogs, only the first 128 characters of a string are considered significant.

Operators

  • eq - Target is expected to contain every element in the baseline and vice versa. When not the case, generate an anomaly.
  • target_includes_all_baseline - Target is expected to contain every element in the baseline. When not the case, generate an anomaly.
  • baseline_includes_all_target - Baseline is expected to contain every element in the target. When not the case, generate an anomaly.
{
"id": "frequent-items-comparison-analyzer",
"schedule": {
"type": "fixed",
"cadence": "daily"
},
"targetMatrix": {
"type": "column",
"include": [
"dayOfWeek"
],
"segments": []
},
"config": {
"metric": "frequent_items",
"type": "frequent_string_comparison",
"operator": "baseline_includes_all_target",
"baseline": {
"type": "Reference",
"profileId": "ref-MHxddU9naW0ptlAg"
}
}
}
Prefooter Illustration Mobile
Run AI With Certainty
Get started for free
Prefooter Illustration