Uploading Profiles to WhyLabs

Collecting Profiles

Data ingestion in WhyLabs is done by uploading whylogs profiles to the platform, which are statistical summaries of the source data. The process of profiling data can be done offline and locally within a customer's environment. To upload the generated profiles to your project's dashboard, you will need to utilize an API key and explicitly write the profile to WhyLabs.

Once your WhyLabs credentials are configured, profiling your data with whylogs and sending the profile to WhyLabs can be as simple as:

import whylogs as why
from whylogs.api.writer.whylabs import WhyLabsWriter

profile = why.log(df).profile()

writer = WhyLabsWriter()
writer.write(file=profile.view())

If you want to know more about writing profiles to WhyLabs, please refer to the following example notebook in the whylogs GitHub repository:

Writing Profiles to WhyLabs

Profile Management in WhyLabs

To understand how profiles are managed in WhyLabs, we need to first remind that:

profiles are mergeable
profiles have a dataset timestamp

An uploaded profile will be safely stored in WhyLabs, but how it is displayed in the dashboard depends on the project's configuration. For example, if the project was configured to have a daily/hourly/weekly frequency, profiles will be merged and displayed in your dashboard in a daily/hourly/weekly granularity, respectively.

For example, if 17 profiles were uploaded to an hourly project between 02:00 and 03:00, this information would be displayed throughout your dashboard as a single profile representing the statistical summary for the combination of the 17 profiles.

Integration Options

REST API

Most customers integrating with WhyLabs will opt to upload profiles via the mechanism built into the open source Whylogs library. Simply obtain an API key, plug it in, and you're ready to go.

Speed of data availability

Uploading a profile for any data timestamp, past and present, will typically become visible in the UI in under a minute (often just a few seconds).

S3

Some enterprise customers operate with very strict network egress rules blocking access to cloud based REST APIs. For such scenarios we offer the ability to pull profiles dumped to a blob store such as AWS S3. Contact us for more information!

Speed of data availability

Cross account blob store integrations currently have a one day turnaround for new profiles to be processed and visible in the UI.

Distributed Environment Logging (Spark/Flink/Kafka)

Depending on how you use whylogs in your environment you might have multiple machines in a distributed context generating profiles for the same dataset+time+segment. Whylogs profiles are easily mergable making it easy to reduce data egress volume prior to uploading. If profiles are not merged prior to upload, that's okay too! The whyLabs platform will merge them automatically.

For example, suppose a kafka topic is being profiled with whylogs using multiple consumers instances emitting a profile once an hour. Profiles will automatically merge and reflect changes throughout the day.

Realtime Profile Merging

If you want to know more about uploading profile in a distributed environment using specific tools, check the respective documentation in the Integrations Section.

Batch upload transactions

Sometimes it's practical to upload multiple profiles in an all-or-nothing way, which is what the batch upload transactions API was designed to enable. In such a workflow, the user first initializes a transaction, stages multiple profiles for uploading and then commits the transaction. Only after the transaction is closed, all the staged profiles are ingested on the platform. If any of the uploads fails, the entire transaction gets aborted and no profiles are ingested.

Below you can see a transaction implementation presented on a use case with five batches of data, one per each day:

writer = WhyLabsWriter()
transaction_id = writer.start_transaction()
print(f"Started transaction {transaction_id}")
for i in range(5):
  batch_df = list_daily_batches[i]
  profile = why.log(batch_df)
  timestamp = datetime.now(tz=timezone.utc) - timedelta(days=i+1)
  profile.set_dataset_timestamp(timestamp)
  status, id = writer.write(profile)
  print(status, id)
writer.commit_transaction()

In case any of the profiles wasn't successfully uploaded, the writer.commit_transaction() call will throw an exception and the transaction will fail to commit.

Distributed batch upload transactions

Transactions are also supported in a distributed workflow using whylogs+PySpark implementation. Assuming that we have multiple Spark DataFrame objects defined and the whylogs[spark] module installed, we can facilitate the transaction in the following way:

# %pip install 'whylogs[spark]'
from whylogs.api.writer.whylabs import WhyLabsWriter
from whylogs.api.pyspark.experimental import collect_dataset_profile_view
from datetime import datetime, timezone, timedelta

writer = WhyLabsWriter()
transaction_id = writer.start_transaction()
print(f"Started transaction {transaction_id}")
for i in range(3):
    profile = collect_dataset_profile_view(spark_dataframe)
    timestamp = datetime.now(tz=timezone.utc) - timedelta(days=i+1)
    profile.dataset_timestamp = timestamp
    writer.write(profile)
writer.commit_transaction()
print(f"Committed transaction {transaction_id}")

Collecting Profiles​

Profile Management in WhyLabs​

Integration Options​

REST API​

Speed of data availability​

S3​

Speed of data availability​

Distributed Environment Logging (Spark/Flink/Kafka)​

Batch upload transactions​

Distributed batch upload transactions​

Collecting Profiles

Profile Management in WhyLabs

Integration Options

REST API

Speed of data availability

S3

Speed of data availability

Distributed Environment Logging (Spark/Flink/Kafka)

Batch upload transactions

Distributed batch upload transactions