Skip to main content

Segmenting Data


Collecting information on all data passing through a data processing pipeline using whylogs can elicit a large amount of information on its own. But slicing the dataset into segments of interest can increase the amount of visibility into user-defined subgroups that may behave differently.

Manual segmentation#

There are often domain- and organization-specific factors that determine which features of the data should be treated as segments. For this control, we highly encourage teams set up manual segments on their profiling.

For now, these are done slightly differently in Python and Spark. In Python, you may select segments either at the feature level (i.e., column name) or at the feature-value level (i.e., value for a given column). In the Apache Spark use cases, segments can only be chosen at the feature level at this time.

from whylogs import get_or_create_session
# Assume a dataset with the following structure:
# index col1 col2 col3
# 0 "a" 1 100
# 1 "b" 2 100
# 2 "c" 1 200
# 3 "a" 2 100
pandas_df = pd.read_csv("demo.csv")
sess = get_or_create_session()
# Option 1: feature level, list of feature names
with sess.logger(dataset_name="my_dataset") as logger:
logger.log_dataframe(pandas_df, segments=["col1", "col3"])
# OR Option 2: feature-value level, list of dictionary objects
with sess.logger(dataset_name="my_dataset") as logger:
logger.log_dataframe(pandas_df, segments=[{"col1": "a"}, {"col1": "b"}, {"col3": 100}])

Automatic segmentation#

We also provide a simple algorithm for automatic selection of the segmentation features that can be calculated on a static dataset, such as a training dataset. This entropy-based calculation will return a list of features with the highest information gain on which we suggest you segment your data.

This calculation has a number of optional parameters, such as the maximum number of segments allowed.

All methods allow for the segmentation in Python, whylogs will automatically ingest this segment data when profiling data using the same dataset name.

from whylogs import get_or_create_session
pandas_df = pd.read_csv("demo.csv")
sess = get_or_create_session()
auto_segments = sess.estimate_segments(pandas_df, max_segments=10)
with sess.logger(dataset_name="my_dataset") as logger:
logger.log_dataframe(pandas_df, segments=auto_segments)
Prefooter Illustration Mobile
Run AI With Certainty
Get started for free
Prefooter Illustration