Mergeability

Mergeability is a powerful property of whylogs profiles. When multiple profiles are uploaded for the same date (or same hour for hourly models), WhyLabs automatically combines these profiles into one.

This means that multiple fragments of your dataset can be profiled asynchronously and these profiles will be automatically aggregated to a single profile which is equivalent to a profile you would generate if you profiled the entire dataset in one operation.

Merged Profile

It should be noted that whylogs profiles are designed such that this merging process is compatible with subprofiles of different row counts (we are not simply computing a straight mean of descriptive statistics when merging).

The mergeability property allows for a variety of benefits for users with different use cases.

Distributed Pipelines

Mergeability allows for easy profiling of data which lives in distributed pipelines.

spark

In this example, each data partition can be profiled independently to produce a holistic view of your dataset within WhyLabs.

In the case of multi-modal models, users may have two distinct datasets which feed into their models. These datasets can be profiles independently and tracked as a single dataset within WhyLabs. In the example below, an image dataset is supplemented with tabular metadata which have their profiles merged within a single WhyLabs model.

Multi Modal Dataset

Reference Profiles

WhyLabs' monitors detect anomalies in a batch profiles by comparing against baseline metrics, typically an aggregate of metrics from earlier time-points. However a trailing window baseline may not be appropriate for every application. Some monitors may need to compare batch profiles against a training set, a validation set, or just a sample of their dataset which is known to be healthy. That is what Reference Profiles are for.

Reference profiles are just like batch profiles but they are never be merged across time stamps, and they do not show up in timeseries graphs on the WhyLabs dashboard. Reference profiles are assigned a unique id which may be used as a baseline in any monitor configuration.

Note that a reference profile should include the same features as other batch profiles in the model. There would be no point in comparing a batch profile containing a feature "annual_income" against a reference profile that does not include that feature.

Learn how to upload reference profiles to WhyLabs in the following example notebook:

-Writing Reference Profiles to WhyLabs

Best Practices

Since mergeability is built into whylogs/WhyLabs, users must follow some best practices when uploading profiles to whylogs. Most importantly, a particular dataset should only be profiled once per day for daily models and once per hour for hourly models.

If multiple profiles are uploaded twice within one day, these profiles will always be merged. One profile will never overwrite another. This can result in changes to the value counts, but won’t result in false alerts since the distribution shape, null value fraction, etc. remain unchanged.

Distributed Pipelines​

Multi-Modal Datasets​

Reference Profiles​

Best Practices​

Distributed Pipelines

Multi-Modal Datasets

Reference Profiles

Best Practices