Skip to main content

Onboarding to the Platform

Overview#

With our self-serve option, users can start getting value out of WhyLabs right away for free. Users can onboard their first project with the following steps:

  1. Install the whylogs library, which generates statistical profiles of any dataset on the client’s side.

  2. Sign up for a WhyLabs account and create your first project for free! For users ready to take advantage of the full functionality of WhyLabs without ever having to go through sales, subscriptions can be purchased via the AWS Marketplace.

  3. Inject whylogs into your data pipeline to deliver dataset profiles to the WhyLabs project you just created in the WhyLabs platform. Our whylogs library is interoperable with any ML/Data infrastructure and framework. See more about our integrations here.

  4. Customize alerts and monitoring to your needs to stay on top of data/model health.

The following example walks through a simple Python integration using a tabular dataset. However, this exercise is also possible using any one of our integrations. As an example, see our Spark integration to begin profiling datasets using Spark instead of general Python.

Install whylogs Library#

The whylogs logging agent is the easiest way to enable logging, testing, and monitoring in an ML/AI application. The lightweight agent profiles data in real-time, collecting thousands of metrics from structured data, unstructured data, and ML model predictions with zero configuration.

Install the whylogs library along with the module containing the WhyLabs writer. The WhyLabs writer will be used for uploading profiles to the WhyLabs platform.

pip install whylogs
pip install "whylogs[whylabs]"

Before even getting set up with WhyLabs, users can start logging statistical properties of dataset features, model inputs, and model outputs to enable explorative analysis, data unit testing, and monitoring. The Python code below will profile a dataset and generate a DataFrame capturing basic telemetry describing your data. This represents just a portion of the information captured in a profile.

# Note- uploading profile to WhyLabs is not yet supported by whylogs v1
import whylogs as why
import pandas as pd
df = pd.read_csv("https://whylabs-public.s3.us-west-2.amazonaws.com/datasets/tour/current.csv")
# profile dataframe
results = why.log(pandas=df)
# grab profile object from result set
profile = results.profile()
# grab a view object of the profile for inspection
prof_view = profile.view()
# inspect profile as a Pandas DataFrame
prof_df = prof_view.to_pandas()

The next step is for users to begin uploading these profiles to the WhyLabs platform where customizable monitoring/alerting can be done on profiles like this collected over time.

Getting Started in WhyLabs#

Once you sign up for an account, add a new project within the project dashboard.

Set up model from Model Dashboard

A Project is any ML model, data pipeline, data stream, or dataset that you want to monitor in WhyLabs. Profiles generated by whylogs can be regularly uploaded to a particular WhyLabs project for ongoing monitoring. There are multiple types of projects available.

A Model project is optimized for monitoring model inputs, outputs, and performance. A Dataset project has many of the same characteristics, but excludes performance monitoring and does not categorize dataset features as inputs or outputs. For this example, we will create a Model project.

From the Project Management page, simply provide a name for your project. For Model projects, you can optionally select the type of model. The type of model will determine which performance metrics will be displayed in the Model Performance tab. The type can be updated later if it's left blank when setting up the model.

Project Management

Upload a Profile to a WhyLabs Project#

The process of uploading a profile to a WhyLabs project is slightly different from the process of saving a profile to disk. Users will need the following:

  • A WhyLabs API Key
  • The organization ID in which the target model lives
  • The target Dataset/Model ID

You can create a new API token from the Access Tokens page in the "Settings" section of the platform. The token will be displayed only while the page is loaded. Alternatively, you can use a previously created access token that you have already saved to your local environment.

Access Tokens

You will need your organization ID, which can also be found on the API token page:

Access Tokens

The Dataset/Model ID for each project is available from the Project Dashboard page. Once you have these 3 items along with a dataset you wish to profile, you can run the following Python example to upload a profile (note that the datasetId parameter is used regardless of the project type).

import whylogs as why
import pandas as pd
import os
from whylogs.api.writer.whylabs import WhyLabsWriter
os.environ["WHYLABS_DEFAULT_ORG_ID"] = "YOUR-ORG-ID" # ORG-ID is case sensistive
os.environ["WHYLABS_API_KEY"] = 'YOUR-API-KEY'
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = 'YOUR-MODEL-ID' #can also be provided as dataset_id param in WhyLabsWriter constructor
# Point to your local CSV if you have your own data
df = pd.read_csv("path/to/your/data.csv")
#log dataframe and generate profile
profile = why.log(pandas=df).profile()
# Instantiate WhyLabs Writer
writer = WhyLabsWriter()
#pass a profile view to the writer's write method
writer.write(profile=profile.view())

After running this code, users will see a single datapoint appear under the relevant WhyLabs project. The image below shows the “inputs” view as an example.

Single Data Point

The real value of WhyLabs is gained from uploading multiple profiles over a span of time, after which point WhyLabs can apply out-of-the-box anomaly detection and automatically send alerts to users about their data and model health!

Tracking Data Profiles Over Time#

By Default, an uploaded profile will be associated with the current date at the time of upload. If a user uploaded two profiles for two different batches of a dataset consecutively, then these two profiles will be merged together as a single aggregated profile for the day of upload.

This property of aggregating profiles is called mergability and provides great flexibility when integrating whylogs with data pipelines or setting reference profiles. For example, in a distributed data pipeline, profiles of partial datasets can be published from different nodes and will be automatically merged within WhyLabs for a holistic view of the entire dataset.

*For models with an hourly batch frequency, all of the above applied at the hourly level

In lieu of the above, any particular batch of data should only have their profiles published once, though batches of data may have their profiles published throughout the day after which point profiles will be merged into a single profile at the day level (or hourly level for models with hourly batch frequency).

In order to get started with tracking a dataset over time, there are a few options. The first is to continue uploading profiles over a period of multiple days by integrating whylogs with your current pipeline (see integrations here). Another option is to perform backfilling.

When uploading a profile to WhyLabs, users have the option to include a dataset_timestamp parameter associated with the uploaded profile. For example, modifying the previous code example with the following will associate the uploaded profile with a date of February 7th, 2022:

import datetime
#log a dataframe and extract its profile
profile = why.log(df).profile()
#set the dataset timestamp for the profile
profile.set_dataset_timestamp(datetime.datetime(2022,2,7,0,0))
#write the profile to the WhyLabs platform
writer.write(profile=profile.view())

Users can readily backfill profiles for the last 7 days and view the results in the UI right away. Users may experience up to a 24 hour delay when backdating further than that. Read more about backfilling here.

Users are encouraged to experiment with WhyLabs by artificially manipulating a test dataset by injecting anomalies, warping data distributions, etc. over a backfilled time to see immediate results of WhyLabs monitoring abilities.

Monitoring and Alerting#

WhyLabs monitors a variety of dataset properties at the feature level including distribution distance (how similar is a current features distribution from historical ones), count/ratio of missing values, count/ratio of unique values, and inferred data types.

Users have the ability to customize monitor thresholds and have a variety of options for defining a baseline to compare against (trailing window, static reference profile, specific date range). See more about monitoring here.

Users can configure automatic actions to be taken upon the detection of an anomaly such as notifications via Slack, Pager Duty, email, etc. See more about alerts and notifications here.

When clicking on a project from the project dashboard, users will be directed to a view containing a summary of anomalies and monitored metrics over the date range chosen in the date picker. This example shows a Model project’s “Inputs” page.

Lending Inputs View

Users can click on one of these features to get a more granular view of each of the monitored metrics as well as the events which triggered alerts. For example, we see that the “loan status” feature contained less unique values than usual on Feb 7th:

Loan Status

Prefooter Illustration Mobile
Run AI With Certainty
Get started for free
Prefooter Illustration