Profile Overview

Introduction

Profiles are statistical summaries of datasets produced as a result of data logging with the whylogs open source library. The WhyLabs Platform is custom built to ingest and monitor these statistical summaries.

Profiles have three properties that make them ideal for observability and monitoring use cases: they are efficient, customizable, and mergeable. whylogs profiling can be performed on all types of data including tabular, images, text, and embeddings. Profiles are then sent to the WhyLabs Platform to enable observability and to be monitored.

What are profiles

whylogs profiles are the core of the whylogs library. They capture key statistical properties of data, such as the distribution (far beyond simple mean, median, and standard deviation measures), the number of missing values, and a wide range of configurable custom metrics. By capturing these summary statistics, we are able to accurately represent the data and enable all of the use cases described in the introduction.

whylogs profiles have three properties that make them ideal for data logging: they are efficient, customizable, and mergeable.

Efficient: whylogs profiles efficiently describe the dataset that they represent. This high fidelity representation of datasets is what enables whylogs profiles to be effective snapshots of the data. They are better at capturing the characteristics of a dataset than a sample would be—as discussed in our Data Logging: Sampling versus Profiling blog post—and are very compact.

Customizable: The statistics that whylogs profiles collect are easily configured and customizable. This is useful because different data types and use cases require different metrics, and whylogs users need to be able to easily define custom trackers for those metrics. It’s the customizability of whylogs that enables our text, image, and other complex data trackers.

Mergeable: One of the most powerful features of whylogs profiles is their mergeability. Mergeability means that whylogs profiles can be combined together to form new profiles which represent the aggregate of their constituent profiles. Typically the merged profiles are roughly the same size as the original constituent profiles, which allows for an efficient reduce step in any map/reduce system. This enables logging for distributed and streaming systems, and allows users to view aggregated data across any time granularity at scales.

How you generate profiles

Once whylogs is installed, it's easy to generate profiles in both Python and Java environments.

To generate a profile from a Pandas dataframe in Python, simply run:

import whylogs as why
import pandas as pd

# dataframe
df = pd.read_csv("path/to/file.csv")
results = why.log(df)

What you do with profiles

Once you’ve generated whylogs profiles, you can upload them to the WhyLabs Platform. From there, you can automatically set up monitoring for your machine learning models, getting notified on both data quality and data change issues (such as data drift).

More information about whylogs and whylogs profiles can be found here.