Embeddings Data
⚠️ Embeddings features are currently in beta. Expect changes to the whylogs functionality and platform capabilities.
With WhyLabs, you are able to profile embeddings data by comparing them to reference data points. These references can be completely determined by users (helpful when they represent prototypical "ideal" representations of a cluster or scenario) but can also be chosen programmatically.
To use this functionality, perform the following steps:
- Choose reference embeddings
- Log with whylogs after adding metric configuration and resolver for the
EmbeddingMetric
- View and (optionally) add monitors in whylogs
#
Logging and monitoring embeddings dataTo get started, install whylogs with the embeddings and whylabs extras:
⚠️ Note that embeddings are only available in whylogs >= 1.1.22.
#
Choosing reference embeddingsReference embeddings can be chosen manually, but we provide functions for choosing references programmatically as well.
- Manual
- With labels
- Without labels
You may select or create references manually. Ensure the data is in a two-dimensional numpy.ndarray
with shape (number of references, dimensionality of embeddings). You may optionally assign text labels to each reference, otherwise they will be referenced with integers.
Number of references should remain less than 50.
The n_components
must be less than embeddings dimensionality, but high enough to capture primary shape of the data. Values between 10-50 often work well in practice.
Labels will be consecutive integers starting at 0.
The n_clusters
will determine the number of references. The n_components
must be less than embeddings dimensionality, but high enough to capture primary shape of the data. Values between 10-50 often work well in practice.
#
Log embeddings in whylogsLog using the following code:
#
Viewing and Monitoring in WhyLabs#
VisualizationsYou can see initial visualizations in WhyLabs, but many more are forthcoming in both whylogs and WhyLabs!
In the profile and input pages, we can see distributions of the embeddings data that tell us detailed information about the embeddings space.
#
MonitoringFor the beta version, we may want to set several monitors for the distributions that are produced by the embeddings logging.
Set a discrete drift monitor on the FEATURE_NAME.closest feature to see overall drifts in the distribution of embeddings data.
Set continuous drift monitors on the FEATURE_NAME.REFERENCE_distance features for individual references that are of interest to monitor differences from important references.
#
Additional Resources#
Example NotebookLogging Generic Embeddings Data using Reference Distances
#
Blog PostHow to Troubleshoot Embeddings Without Eye-balling t-SNE or UMAP Plots