Skip to main content

Uploading Profiles to WhyLabs

Collecting Profiles

Data ingestion in WhyLabs is done by uploading whylogs profiles to the platform, which are statistical summaries of the source data. The process of profiling data can be done offline and locally within a customer's environment. To upload the generated profiles to your project's dashboard, you will need to utilize an API key and explicitly write the profile to WhyLabs.

Once your WhyLabs credentials are configured, profiling your data with whylogs and sending the profile to WhyLabs can be as simple as:

import whylogs as why
from whylogs.api.writer.whylabs import WhyLabsWriter

profile = why.log(df).profile()

writer = WhyLabsWriter()
writer.write(file=profile.view())

If you want to know more about writing profiles to WhyLabs, please refer to the following example notebook in the whylogs GitHub repository:

Profile Management in WhyLabs

To understand how profiles are managed in WhyLabs, we need to first remind that:

  • profiles are mergeable
  • profiles have a dataset timestamp

An uploaded profile will be safely stored in WhyLabs, but how it is displayed in the dashboard depends on the project's configuration. For example, if the project was configured to have a daily/hourly/weekly frequency, profiles will be merged and displayed in your dashboard in a daily/hourly/weekly granularity, respectively.

For example, if 17 profiles were uploaded to an hourly project between 02:00 and 03:00, this information would be displayed throughout your dashboard as a single profile representing the statistical summary for the combination of the 17 profiles.

Integration Options

REST API

Most customers integrating with WhyLabs will opt to upload profiles via the mechanism built into the open source Whylogs library. Simply obtain an API key, plug it in, and you're ready to go.

Speed of data availability

Uploading a profile for any data timestamp, past and present, will typically become visible in the UI in under a minute (often just a few seconds).

S3

Some enterprise customers operate with very strict network egress rules blocking access to cloud based REST APIs. For such scenarios we offer the ability to pull profiles dumped to a blob store such as AWS S3. Contact us for more information!

Speed of data availability

Cross account blob store integrations currently have a one day turnaround for new profiles to be processed and visible in the UI.

Distributed Environment Logging (Spark/Flink/Kafka)

Depending on how you use whylogs in your environment you might have multiple machines in a distributed context generating profiles for the same dataset+time+segment. Whylogs profiles are easily mergable making it easy to reduce data egress volume prior to uploading. If profiles are not merged prior to upload, that's okay too! The whyLabs platform will merge them automatically.

For example, suppose a kafka topic is being profiled with whylogs using multiple consumers instances emitting a profile once an hour. Profiles will automatically merge and reflect changes throughout the day.

Realtime Profile Merging

If you want to know more about uploading profile in a distributed environment using specific tools, check the respective documentation in the Integrations Section.

Deleting Profiles

To clear any data displayed in the platform, one needs to remove profiles using the DeleteDatasetProfiles API. It's important to make sure you're entering the correct start and end time values, which need to be passed as UTC milliseconds timestamps. This conversion can be performed programmatically (see the below Python code) or using one of the available converter websites.

from datetime import datetime

def datetime_to_timestamp(dt):
epoch = datetime.utcfromtimestamp(0)
return int((dt - epoch).total_seconds() * 1000)

# convert '11/28/2022, 20:00:00' to a unix timestamp
datetime_to_timestamp(datetime(2022, 11, 28, 20, 0, 0))
>>> 1669665600000

For example, if we need to delete the profile(s) from November 28th 2022 uploaded to model-123 in org-0, we should execute the following command:

curl -I -X 'DELETE' \
'https://api.whylabsapp.com/v0/organizations/org-0/dataset-profiles/models/model-123?profile_start_timestamp=1669593600000&profile_end_timestamp=1669680000000' \
-H 'accept: application/json' \
-H 'X-API-Key: your-api-key-here'

The start timestamp denotes 11-28-2022 00:00:00, whereas the end timestamp - 11-29-2022 00:00:00. Again, it's crucial to remember that these timestamps refer to the UTC time zone. Please be aware that the Deletion API won’t work if any of the timestamps are fresher than 1 hour.

The same can be achieved using our Python API client.

The deletion happens at the top of the hour and may take up to an hour depending on the volume of data to be removed.

Overwriting Profiles

In short, to overwrite a profile you will first need to delete the given profile, wait until the next hour to have it removed and then upload the data for the same timestamp.
The deletion instructions are listed here.
After the deletion of the unwanted profiles you can re-upload the corrected data for this period. To ensure the backfill will happen with the next monitor run, please check if the backfillGracePeriodDuration in your analyzer config specifies a sufficient time period to cover the deleted profile. You can find more details about this parameter on this documentation page.
To suppress any notifications related to past anomalies identified during the backfill, please add the datasetTimestampOffset parameter to your monitor configuration as described in our documentation here.

Prefooter Illustration Mobile
Run AI With Certainty
Get started for free
Prefooter Illustration