Skip to main content

Mergeability

Mergeability is a powerful property of whylogs profiles. When multiple profiles are uploaded for the same date (or same hour for hourly models), WhyLabs automatically combines these profiles into one.

This means that multiple fragments of your dataset can be profiled asynchronously and these profiles will be automatically aggregated to a single profile which is equivalent to a profile you would generate if you profiled the entire dataset in one operation.

Merged Profile

It should be noted that whylogs profiles are designed such that this merging process is compatible with subprofiles of different row counts (we are not simply computing a straight mean of descriptive statistics when merging).

The mergeability property allows for a variety of benefits for users with different use cases.

Distributed Pipelines

Mergeability allows for easy profiling of data which lives in distributed pipelines.

spark

In this example, each data partition can be profiled independently to produce a holistic view of your dataset within WhyLabs.

Multi-Modal Datasets

In the case of multi-modal models, users may have two distinct datasets which feed into their models. These datasets can be profiles independently and tracked as a single dataset within WhyLabs. In the example below, an image dataset is supplemented with tabular metadata which have their profiles merged within a single WhyLabs model.

Multi Modal Dataset

Reference Profiles

Normally, profiles are uploaded as batch profiles. Profiles can also be uploaded as reference, or static, profiles, which can be used as baseline for monitors, data drift inspection and data quality in general. Reference profiles are never merged, regardless of the associated timestamp.

If you want to know how to upload reference profiles to WhyLabs, please refer to the following example notebook:

-Writing Reference Profiles to WhyLabs

Best Practices

Since mergeability is built into whylogs/WhyLabs, users must follow some best practices when uploading profiles to whylogs. Most importantly, a particular dataset should only be profiled once per day for daily models and once per hour for hourly models.

If multiple profiles are uploaded twice within one day, these profiles will always be merged. One profile will never overwrite another. This can result in changes to the value counts, but won’t result in false alerts since the distribution shape, null value fraction, etc. remain unchanged.

Prefooter Illustration Mobile
Run AI With Certainty
Get started for free
Prefooter Illustration