Skip to main content

Embeddings Data

With WhyLabs, you are able to profile embeddings data by comparing them to reference data points. These references can be completely determined by users (helpful when they represent prototypical "ideal" representations of a cluster or scenario) but can also be chosen programmatically.

To use this functionality, perform the following steps:

  1. Choose reference embeddings
  2. Log with whylogs after adding metric configuration and resolver for the EmbeddingMetric
  3. View and (optionally) add monitors in whylogs

Logging and monitoring embeddings data

To get started, install whylogs with the embeddings and whylabs extras:

⚠️ Note that embeddings are only available in whylogs >= 1.1.22.

pip install --upgrade "whylogs[embeddings,whylabs]"

Choosing reference embeddings

Reference embeddings can be chosen manually, but we provide functions for choosing references programmatically as well.

Here, X is a two-dimensional numpy.ndarray of training vectors with shape (number of vectors, dimensionality of embeddings).

from whylogs.experimental.preprocess.embeddings.selectors import PCAKMeansSelector

references, labels = PCAKMeansSelector(n_clusters=8, n_components=20).calculate_references(X)

Labels will be consecutive integers starting at 0.

The n_clusters will determine the number of references. The n_components must be less than embeddings dimensionality, but high enough to capture primary shape of the data. Values between 10-50 often work well in practice.

Log embeddings in whylogs

Here, X is a two-dimensional numpy.ndarray of embedding vectors to log with shape (number of vectors, dimensionality of embeddings). Log using the following code:

import whylogs as why
from whylogs.core.resolvers import MetricSpec, ResolverSpec
from whylogs.core.schema import DeclarativeSchema
from whylogs.experimental.extras.embedding_metric import (
DistanceFunction,
EmbeddingConfig,
EmbeddingMetric,
)

# Configuring EmbeddingMetric
config = EmbeddingConfig(
references=references,
labels=labels,
distance_fn=DistanceFunction.euclidean,
)

# Setting resolver
schema = DeclarativeSchema([ResolverSpec(column_name="FEATURE_NAME", metrics=[MetricSpec(EmbeddingMetric, config)])])

# Logging
profile = why.log(row={"FEATURE_NAME": X})

# Uploading to WhyLabs
os.environ["WHYLABS_DEFAULT_ORG_ID"] = 'YOUR-ORG-ID'
os.environ["WHYLABS_API_KEY"] = 'YOUR-API-KEY'
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = 'YOUR-MODEL-ID'

writer = WhyLabsWriter()
writer.write(profile)

Viewing and Monitoring in WhyLabs

Visualizations

You can see initial visualizations in WhyLabs, but many more are forthcoming in both whylogs and WhyLabs!

Embeddings space visualization in WhyLabs

In the profile and input pages, we can see distributions of the embeddings data that tell us detailed information about the embeddings space.

Monitoring

For the beta version, we may want to set several monitors for the distributions that are produced by the embeddings logging.

Set a discrete drift monitor on the FEATURE_NAME.closest feature to see overall drifts in the distribution of embeddings data.

Monitor setup for embeddings discrete drift

Set continuous drift monitors on the FEATURE_NAME.REFERENCE_distance features for individual references that are of interest to monitor differences from important references.

Additional Resources

Example Notebook

Logging Generic Embeddings Data using Reference Distances

Blog Post

How to Troubleshoot Embeddings Without Eye-balling t-SNE or UMAP Plots

Prefooter Illustration Mobile
Run AI With Certainty
Get started for free
Prefooter Illustration