Embeddings Data

With WhyLabs, you are able to profile embeddings data by comparing them to reference data points. These references can be completely determined by users (helpful when they represent prototypical "ideal" representations of a cluster or scenario) but can also be chosen programmatically.

To use this functionality, perform the following steps:

Choose reference embeddings
Log with whylogs after adding metric configuration and resolver for the EmbeddingMetric
View and (optionally) add monitors in whylogs

Logging and monitoring embeddings data

To get started, install whylogs with the embeddings and whylabs extras:

⚠️ Note that embeddings are only available in whylogs >= 1.1.22.

pip install --upgrade "whylogs[embeddings,whylabs]"

Choosing reference embeddings

Reference embeddings can be chosen manually, but we provide functions for choosing references programmatically as well.

Manual
With labels
Without labels

You may select or create references manually. Ensure the data is in a two-dimensional numpy.ndarray with shape (number of references, dimensionality of embeddings). You may optionally assign text labels to each reference, otherwise they will be referenced with integers.

Number of references should remain less than 50.

Here, X is a two-dimensional numpy.ndarray of training vectors with shape (number of vectors, dimensionality of embeddings), and y is a one dimensional numpy.ndarray of labels (integers or strings) of the training vectors.

from whylogs.experimental.preprocess.embeddings.selectors import PCACentroidsSelector

references, labels = PCACentroidsSelector(n_components=20).calculate_references(X, y)

The n_components must be less than embeddings dimensionality, but high enough to capture primary shape of the data. Values between 10-50 often work well in practice.

Here, X is a two-dimensional numpy.ndarray of training vectors with shape (number of vectors, dimensionality of embeddings).

from whylogs.experimental.preprocess.embeddings.selectors import PCAKMeansSelector

references, labels = PCAKMeansSelector(n_clusters=8, n_components=20).calculate_references(X)

Labels will be consecutive integers starting at 0.

The n_clusters will determine the number of references. The n_components must be less than embeddings dimensionality, but high enough to capture primary shape of the data. Values between 10-50 often work well in practice.

Log embeddings in whylogs

Here, X is a two-dimensional numpy.ndarray of embedding vectors to log with shape (number of vectors, dimensionality of embeddings). Log using the following code:

import whylogs as why
from whylogs.core.resolvers import MetricSpec, ResolverSpec
from whylogs.core.schema import DeclarativeSchema
from whylogs.experimental.extras.embedding_metric import (
    DistanceFunction,
    EmbeddingConfig,
    EmbeddingMetric,
)

# Configuring EmbeddingMetric
config = EmbeddingConfig(
    references=references,
    labels=labels,
    distance_fn=DistanceFunction.euclidean,
)

# Setting resolver
schema = DeclarativeSchema([ResolverSpec(column_name="FEATURE_NAME", metrics=[MetricSpec(EmbeddingMetric, config)])])

# Logging
profile = why.log(row={"FEATURE_NAME": X})

# Uploading to WhyLabs
os.environ["WHYLABS_DEFAULT_ORG_ID"] = 'YOUR-ORG-ID'
os.environ["WHYLABS_API_KEY"] = 'YOUR-API-KEY'
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = 'YOUR-MODEL-ID'

writer = WhyLabsWriter()
writer.write(profile)

Viewing and Monitoring in WhyLabs

Visualizations

You can see initial visualizations in WhyLabs, but many more are forthcoming in both whylogs and WhyLabs!

Embeddings space visualization in WhyLabs

In the profile and input pages, we can see distributions of the embeddings data that tell us detailed information about the embeddings space.

Monitoring

For the beta version, we may want to set several monitors for the distributions that are produced by the embeddings logging.

Set a discrete drift monitor on the FEATURE_NAME.closest feature to see overall drifts in the distribution of embeddings data.

Monitor setup for embeddings discrete drift

Set continuous drift monitors on the FEATURE_NAME.REFERENCE_distance features for individual references that are of interest to monitor differences from important references.

Embeddings Data

Logging and monitoring embeddings data

Choosing reference embeddings

Log embeddings in whylogs

Viewing and Monitoring in WhyLabs

Visualizations

Monitoring

Additional Resources

Example Notebook

Blog Post

Logging and monitoring embeddings data​

Choosing reference embeddings​

Log embeddings in whylogs​

Viewing and Monitoring in WhyLabs​

Visualizations​

Monitoring​

Additional Resources​

Example Notebook​

Blog Post​

Logging and monitoring embeddings data

Choosing reference embeddings

Log embeddings in whylogs

Viewing and Monitoring in WhyLabs

Visualizations

Monitoring

Additional Resources

Example Notebook

Blog Post