Amazon Sagemaker

Amazon SageMaker is a managed Machine Learning platform that enables developers and data scientists to build, train, and deploy models at scale. It provides a variety of tools and capabilities, such as pre-built ML algorithms and libraries, Jupyter notebook-based authoring and collaboration, and easy model deployment to the cloud. SageMaker also integrates with other AWS services, such as S3, to provide a seamless workflow for storing and accessing data, and with AWS IAM for security and access control.

In this section we will learn how to integrate Sagemaker's training and prediction pipelines with whylogs and uploading profiles to WhyLabs.

Monitor data used for model training

One of the things that will be useful to integrate your existing models with WhyLabs is to capture and monitor reference profiles. Those will be useful in case you wish to make a static comparison, and understand whether the distribution you saw during training is the same as in production. For that, we will profile the same data used for training and write it to WhyLabs, using whylogs.

Make sure you have whylogs with the WhyLabs extension installed by running:

!pip install "whylogs[whylabs]"

Then, run a batch profiling job to create a whylogs reference profile from the training data and upload it to WhyLabs. This will be the profile used to compare your model's predictions and trigger alerts in case drift happens. You can run this code in any environment that has access to S3 and can run Python scripts, such as ECS, Lambda, EKS, etc.

import boto3
import pandas as pd
import whylogs as why

# read the data from S3
df = pd.read_csv('s3://my-bucket/training_data.csv')

# profile and send it to WhyLabs as a Reference Profile
profile_results = why.log(df)
profile_results.writer("whylabs").option(reference_profile_name="reference_training_dataset").write()

Note that you will need to set your credentials to WhyLabs as environment variables. You can use the following code snippet to do so:

import os
from getpass import getpass

print("Enter your WhyLabs Org ID")
os.environ["WHYLABS_DEFAULT_ORG_ID"] = input()

print("Enter your WhyLabs Dataset ID")
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = input()

print("Enter your WhyLabs API key")
os.environ["WHYLABS_API_KEY"] = getpass()

Monitoring model predictions

On this section we present a way to log your model predictions to WhyLabs along with it's input features. Those will be useful to run monitors and see if your expectations are met when compared to a reference profile.

Rolling logger

To log your model prediction data, you will need to modify the predict function on your entry point. You will also enable the rolling logger, which is set to write profiles to WhyLabs on a fixed cadence. On this example we will persist profiles every 15 minutes.

To do so, define predict_fn, as:

%%writefile predict.py
import pandas as pd
import numpy as np
import whylogs as why

logger = why.logger(mode="rolling", interval=15, when="M",
                                base_name="my_sagemaker_model")
logger.append_writer("whylabs")


def predict_fn(data, model):
    pandas_df = pd.DataFrame(data)
    pandas_df["predicted_output"] = model.predict(data)
    logger.log(pandas_df)
    return np.array(pandas_df["predicted_output"].values)

And deploy your model with WhyLabs' credentials set as environment variables, making sure that predict.py is written on the same path as the model call.

predictor = estimator.deploy(
    instance_type='ml.m5.large',
    initial_instance_count=1,    
    endpoint_name='sagemaker-whylabs-endpoint',
    entry_point='predict.py',
    source_dir='.',
    env={
        "WHYLABS_DEFAULT_ORG_ID": "org-id",
        "WHYLABS_DEFAULT_DATASET_ID": "dataset-id",
        "WHYLABS_API_KEY": "api-key",
    } 
)

NOTE: You must have whylogs[whylabs] available on the inference environment. To do that, you can persist a requirements.txt file with the desired whylogs package version on the same path as your source directory.

Online inferences: Decoupling whylogs

To monitor your production online models, another option is to decouple whylogs profiling from inferencing. Some options are listed below:

Persist data to a messaging queue service (like AWS' SNS) and have a lambda function responsible for profiling and uploading to WhyLabs on a schedule
Use our whylogs container application to host a dedicated profiling endpoint
For cases where messaging queue services aren't supported, persist data to an HDFS (like S3) and have a batch pySpark profiling job running on a managed Spark service, such as AWS EMR.

Get in touch

In this documentation page, we brought some insights on how to integrate WhyLabs with your Sagemaker models, both at training and inference time, using whylogs profiles and its built-in WhyLabs writer. If you have questions or wish to understand more on how you can use WhyLabs with your models, contact us at anytime!