whylogs.core.datasetprofile

Defines the primary interface class for tracking dataset statistics.

SCALAR_NAME_MAPPING#

NOTE: I use ordered dicts here to control the ordering of generated columns dictionaries are also valid Define (some of) the mapping from dataset summary to flat table

DatasetProfile Objects#

class DatasetProfile()

Statistics tracking for a dataset.

A dataset refers to a collection of columns.

Parameters#

name: str A human readable name for the dataset profile. Could be model name. This is stored under "name" tag dataset_timestamp: datetime.datetime The timestamp associated with the data (i.e. batch run). Optional. session_timestamp : datetime.datetime Timestamp of the dataset columns : dict Dictionary lookup of ColumnProfiles tags : dict A dictionary of key->value. Can be used upstream for aggregating data. Tags must match when merging with another dataset profile object. metadata: dict Metadata that can store arbitrary string mapping. Metadata is not used when aggregating data and can be dropped when merging with another dataset profile object. session_id : str The unique session ID run. Should be a UUID. constraints: DatasetConstraints Static assertions to be applied to tracked numeric data and profile summaries.

session_timestamp_ms#

| @property
| session_timestamp_ms()

Return the session timestamp value in epoch milliseconds.

track_metrics#

| track_metrics(targets: List[Union[str, bool, float, int]], predictions: List[Union[str, bool, float, int]], scores: List[float] = None, target_field: str = None, prediction_field: str = None, score_field: str = None)

Function to track metrics based on validation data.

user may also pass the associated attribute names associated with target, prediction, and/or score.

Parameters#

targets : List[Union[str, bool, float, int]] actual validated values predictions : List[Union[str, bool, float, int]] inferred/predicted values scores : List[float], optional assocaited scores for each inferred, all values set to 1 if not passed target_field : str, optional prediction_field : str, optional score_field : str, optional score_field : str, optional

track#

| track(columns, data=None)

Add value(s) to tracking statistics for column(s).

Parameters#

columns : str, dict Either the name of a column, or a dictionary specifying column names and the data (value) for each column If a string, data must be supplied. Otherwise, data is ignored. data : object, None Value to track. Specify if columns is a string.

track_array#

| track_array(x: np.ndarray, columns=None)

Track statistics for a numpy array

Parameters#

x : np.ndarray 2D array to track. columns : list Optional column labels

track_dataframe#

| track_dataframe(df: pd.DataFrame)

Track statistics for a dataframe

Parameters#

df : pandas.DataFrame DataFrame to track

to_properties#

| to_properties()

Return dataset profile related metadata

Returns#

properties : DatasetProperties The metadata as a protobuf object.

to_summary#

| to_summary()

Generate a summary of the statistics

Returns#

summary : DatasetSummary Protobuf summary message.

generate_constraints#

| generate_constraints() -> DatasetConstraints

Assemble a sparse dict of constraints for all features.

Returns#

summary : DatasetConstraints Protobuf constraints message.

flat_summary#

| flat_summary()

Generate and flatten a summary of the statistics.

See :func:flatten_summary for a description

chunk_iterator#

| chunk_iterator()

Generate an iterator to iterate over chunks of data

validate#

| validate()

Sanity check for this object. Raises an AssertionError if invalid

merge#

| merge(other)

Merge this profile with another dataset profile object.

We will use metadata and timestamps from the current DatasetProfile in the result.

This operation will drop the metadata from the 'other' profile object.

Parameters#

other : DatasetProfile

Returns#

merged : DatasetProfile New, merged DatasetProfile

merge_strict#

| merge_strict(other)

Merge this profile with another dataset profile object. This throws exception if session_id, timestamps and tags don't match.

This operation will drop the metadata from the 'other' profile object.

Parameters#

other : DatasetProfile

Returns#

merged : DatasetProfile New, merged DatasetProfile

serialize_delimited#

| serialize_delimited() -> bytes

Write out in delimited format (data is prefixed with the length of the datastream).

This is useful when you are streaming multiple dataset profile objects

Returns#

data : bytes A sequence of bytes

to_protobuf#

| to_protobuf() -> DatasetProfileMessage

Return the object serialized as a protobuf message

Returns#

message : DatasetProfileMessage

write_protobuf#

| write_protobuf(protobuf_path: str, delimited_file: bool = True)

Write the dataset profile to disk in binary format

Arguments:

  • protobuf_path: the local path for storage. The parent directory must already exist
  • delimited_file: whether to prefix the data with the length of output or not. Default is True

read_protobuf#

| @staticmethod
| read_protobuf(protobuf_path: str, delimited_file: bool = True)

Parse a protobuf file and return a DatasetProfile object

Arguments:

  • protobuf_path: the path of the protobuf data
  • delimited_file: whether the data is delimited or not. Default is True

Returns:

a DatasetProfile object if successful :rtype: whylogs.DatasetProfile

from_protobuf#

| @staticmethod
| from_protobuf(message: DatasetProfileMessage)

Load from a protobuf message

Parameters#

message : DatasetProfileMessage The protobuf message. Should match the output of DatasetProfile.to_protobuf()

Returns#

dataset_profile : DatasetProfile

from_protobuf_string#

| @staticmethod
| from_protobuf_string(data: bytes)

Deserialize a serialized DatasetProfileMessage

Parameters#

data : bytes The serialized message

Returns#

profile : DatasetProfile The deserialized dataset profile

parse_delimited_single#

| @staticmethod
| parse_delimited_single(data: bytes, pos=0)

Parse a single delimited entry from a byte stream

Parameters#

data : bytes The bytestream pos : int The starting position. Default is zero

Returns#

pos : int Current position in the stream after parsing profile : DatasetProfile A dataset profile

parse_delimited#

| @staticmethod
| parse_delimited(data: bytes)

Parse delimited data (i.e. data prefixed with the message length).

Java protobuf writes delimited messages, which is convenient for storing multiple dataset profiles. This means that the main data is prefixed with the length of the message.

Parameters#

data : bytes The input byte stream

Returns#

profiles : list List of all Dataset profile objects

columns_chunk_iterator#

columns_chunk_iterator(iterator, marker: str)

Create an iterator to return column messages in batches

Parameters#

iterator An iterator which returns protobuf column messages marker Value used to mark a group of column messages

flatten_summary#

flatten_summary(dataset_summary: DatasetSummary) -> dict

Flatten a DatasetSummary

Parameters#

dataset_summary : DatasetSummary Summary to flatten

Returns#

data : dict A dictionary with the following keys:

summary : pandas.DataFrame
Per-column summary statistics
hist : pandas.Series
Series of histogram Series with (column name, histogram) key,
value pairs. Histograms are formatted as a `pandas.Series`
frequent_strings : pandas.Series
Series of frequent string counts with (column name, counts)
key, val pairs. `counts` are a pandas Series.

Notes#

Some relevant info on the summary mapping:

.. code-block:: python

>>> from whylogs.core.datasetprofile import SCALAR_NAME_MAPPING
>>> import json
>>> print(json.dumps(SCALAR_NAME_MAPPING, indent=2))

flatten_dataset_quantiles#

flatten_dataset_quantiles(dataset_summary: DatasetSummary)

Flatten quantiles from a dataset summary

flatten_dataset_histograms#

flatten_dataset_histograms(dataset_summary: DatasetSummary)

Flatten histograms from a dataset summary

flatten_dataset_frequent_numbers#

flatten_dataset_frequent_numbers(dataset_summary: DatasetSummary)

Flatten frequent number counts from a dataset summary

flatten_dataset_frequent_strings#

flatten_dataset_frequent_strings(dataset_summary: DatasetSummary)

Flatten frequent strings summaries from a dataset summary

get_dataset_frame#

get_dataset_frame(dataset_summary: DatasetSummary, mapping: dict = None)

Get a dataframe from scalar values flattened from a dataset summary

Parameters#

dataset_summary : DatasetSummary The dataset summary. mapping : dict, optional Override the default variable mapping.

Returns#

summary : pd.DataFrame Scalar values, flattened and re-named according to mapping

dataframe_profile#

dataframe_profile(df: pd.DataFrame, name: str = None, timestamp: datetime.datetime = None)

Generate a dataset profile for a dataframe

Parameters#

df : pandas.DataFrame Dataframe to track, treated as a complete dataset. name : str Name of the dataset timestamp : datetime.datetime, float Timestamp of the dataset. Defaults to current UTC time. Can be a datetime or UTC epoch seconds.

Returns#

prof : DatasetProfile

array_profile#

array_profile(x: np.ndarray, name: str = None, timestamp: datetime.datetime = None, columns: list = None)

Generate a dataset profile for an array

Parameters#

x : np.ndarray Array-like object to track. Will be treated as an full dataset name : str Name of the dataset timestamp : datetime.datetime Timestamp of the dataset. Defaults to current UTC time columns : list Optional column labels

Returns#

prof : DatasetProfile