Glossary

The following is a glossary of terms commonly used throughout the WhyLabs Documentation. Note that some of these terms have a special meaning in the context of whylogs/WhyLabs.

whylogs is a data logging library that captures statistical properties of data and ML models.

A batch is a collection of data points, often grouped by time.

In batch mode, whylogs processes a dataset in batches.

In streaming mode, whylogs processes individual data points.

A dataset is a collection of related data that will be analyzed together. In case of tabular data: each column of the table represents a particular variable, and each row represents a record of the dataset. When used alongside a statistical model, the dataset often represents features as columns, with additional columns for the output. For non-structured/complex data, the representation depends on the datatype.

A DatasetProfile is a collection of summary statistics and related metadata for a dataset that whylogs has processed.

Data Sketches are a class of algorithms that efficiently extract information from large or streaming datasets in a single pass. This term is sometimes used to refer specifically to the Apache DataSketches project.

A logger represents the whylogs tracking object for a given dataset (in batch mode) or a collection of data points (in streaming mode). A logger is always associated with a timestamp for its creation and a timestamp for the dataset. Different loggers may write to different storage systems using different output formats.

Metadata is data that describes either a dataset or information from whylogs’ processing of the dataset.

The whylogs output is available in the following formats: protobuf, JSON, and flat. Protobuf is a lightweight binary format that maps one-to-one with the memory representation of a whylogs object. JSON displays the protobuf data in JSON format. Flat outputs multiple files with both CSV and JSON content to represent different views of the data, including histograms, upper-bound, lower-bound, and frequent values. To apply advanced transformation on whylogs, we recommend using Protobuf.

A pipeline consists of the components data moves through, as well as any infrastructure associated with those components. A project may have multiple ML pipelines, but it’s common to have one pipeline for a multi-stage project.

A Project is any ML model, data pipeline, data stream, or dataset that you want to monitor in WhyLabs. Profiles generated by whylogs can be regularly uploaded to a particular WhyLabs project for ongoing monitoring. There are multiple types of projects available, including Models and Datasets.

A record is an observation of data. whylogs represents this as a map of keys (string data - feature names) to values (numerical/textual data).

A session represents your configuration for how your application interacts with whylogs, including logger configuration, input and output formats. Using a single session for your application is recommended.

A statistical profile is a collection of statistical properties of a feature. Properties can be different for discrete and continuous features. In the context of this documentation, "DatasetProfile" is a specific instance of a statistical profile which is specific to whylogs.