whylogs.core.datasetprofile | WhyLabs Documentation

whylogs.core.datasetprofile

Defines the primary interface class for tracking dataset statistics.

DatasetProfile Objects

class DatasetProfile()

Statistics tracking for a dataset.

A dataset refers to a collection of columns.

Parameters

name: str A human readable name for the dataset profile. Could be model name. This is stored under "name" tag dataset_timestamp: datetime.datetime The timestamp associated with the data (i.e. batch run). Optional. session_timestamp : datetime.datetime Timestamp of the dataset columns : dict Dictionary lookup of ColumnProfiles tags : dict A dictionary of key->value. Can be used upstream for aggregating data. Tags must match when merging with another dataset profile object. metadata: dict Metadata that can store arbitrary string mapping. Metadata is not used when aggregating data and can be dropped when merging with another dataset profile object. session_id : str The unique session ID run. Should be a UUID. constraints: DatasetConstraints Static assertions to be applied to tracked numeric data and profile summaries.

session_timestamp_ms

 | @property
 | session_timestamp_ms()

Return the session timestamp value in epoch milliseconds.

track_metrics

 | track_metrics(targets: List[Union[str, bool, float, int]], predictions: List[Union[str, bool, float, int]], scores: List[float] = None, model_type: ModelType = None, target_field: str = None, prediction_field: str = None, score_field: str = None)

Function to track metrics based on validation data.

user may also pass the associated attribute names associated with target, prediction, and/or score.

Parameters

targets : List[Union[str, bool, float, int]] actual validated values predictions : List[Union[str, bool, float, int]] inferred/predicted values scores : List[float], optional assocaited scores for each inferred, all values set to 1 if not passed target_field : str, optional Description prediction_field : str, optional Description score_field : str, optional Description model_type : ModelType, optional Defaul is Classification type. target_field : str, optional prediction_field : str, optional score_field : str, optional score_field : str, optional

track

 | track(columns, data=None, character_list=None, token_method=None)

Add value(s) to tracking statistics for column(s).

Parameters

columns : str, dict Either the name of a column, or a dictionary specifying column names and the data (value) for each column If a string, data must be supplied. Otherwise, data is ignored. data : object, None Value to track. Specify if columns is a string.

track_array

 | track_array(x: np.ndarray, columns=None)

Track statistics for a numpy array

Parameters

x : np.ndarray 2D array to track. columns : list Optional column labels

track_dataframe

 | track_dataframe(df: pd.DataFrame, character_list=None, token_method=None)

Track statistics for a dataframe

Parameters

df : pandas.DataFrame DataFrame to track

to_properties

 | to_properties()

Return dataset profile related metadata

Returns

properties : DatasetProperties The metadata as a protobuf object.

to_summary

 | to_summary()

Generate a summary of the statistics

Returns

summary : DatasetSummary Protobuf summary message.

generate_constraints

 | generate_constraints() -> DatasetConstraints

Assemble a sparse dict of constraints for all features.

Returns

summary : DatasetConstraints Protobuf constraints message.

flat_summary

 | flat_summary()

Generate and flatten a summary of the statistics.

See :func:flatten_summary for a description

chunk_iterator

 | chunk_iterator()

Generate an iterator to iterate over chunks of data

validate

 | validate()

Sanity check for this object. Raises an AssertionError if invalid

merge

 | merge(other)

Merge this profile with another dataset profile object.

We will use metadata and timestamps from the current DatasetProfile in the result.

This operation will drop the metadata from the 'other' profile object.

Parameters

other : DatasetProfile

Returns

merged : DatasetProfile New, merged DatasetProfile

merge_strict

 | merge_strict(other)

Merge this profile with another dataset profile object. This throws exception if session_id, timestamps and tags don't match.

This operation will drop the metadata from the 'other' profile object.

Parameters

other : DatasetProfile

Returns

merged : DatasetProfile New, merged DatasetProfile

serialize_delimited

 | serialize_delimited() -> bytes

Write out in delimited format (data is prefixed with the length of the datastream).

This is useful when you are streaming multiple dataset profile objects

Returns

data : bytes A sequence of bytes

to_protobuf

 | to_protobuf() -> DatasetProfileMessage

Return the object serialized as a protobuf message

Returns

message : DatasetProfileMessage

write_protobuf

 | write_protobuf(protobuf_path: str, delimited_file: bool = True)

Write the dataset profile to disk in binary format

Parameters

protobuf_path : str local path or any path supported supported by smart_open: https://github.com/RaRe-Technologies/smart_open#how. The parent directory must already exist delimited_file : bool, optional whether to prefix the data with the length of output or not. Default is True

read_protobuf

 | @staticmethod
 | read_protobuf(protobuf_path: str, delimited_file: bool = True) -> "DatasetProfile"

Parse a protobuf file and return a DatasetProfile object

Parameters

protobuf_path : str the path of the protobuf data, can be local or any other path supported by smart_open: https://github.com/RaRe-Technologies/smart_open#how delimited_file : bool, optional whether the data is delimited or not. Default is True

Returns

DatasetProfile whylogs.DatasetProfile object from the protobuf

from_protobuf

 | @staticmethod
 | from_protobuf(message: DatasetProfileMessage) -> "DatasetProfile"

Load from a protobuf message

Parameters

message : DatasetProfileMessage The protobuf message. Should match the output of DatasetProfile.to_protobuf()

Returns

dataset_profile : DatasetProfile

from_protobuf_string

 | @staticmethod
 | from_protobuf_string(data: bytes) -> "DatasetProfile"

Deserialize a serialized DatasetProfileMessage

Parameters

data : bytes The serialized message

Returns

profile : DatasetProfile The deserialized dataset profile

parse_delimited_single

 | @staticmethod
 | parse_delimited_single(data: bytes, pos=0)

Parse a single delimited entry from a byte stream

Parameters

data : bytes The bytestream pos : int The starting position. Default is zero

Returns

pos : int Current position in the stream after parsing profile : DatasetProfile A dataset profile

parse_delimited

 | @staticmethod
 | parse_delimited(data: bytes)

Parse delimited data (i.e. data prefixed with the message length).

Java protobuf writes delimited messages, which is convenient for storing multiple dataset profiles. This means that the main data is prefixed with the length of the message.

Parameters

data : bytes The input byte stream

Returns

profiles : list List of all Dataset profile objects

columns_chunk_iterator

columns_chunk_iterator(iterator, marker: str)

Create an iterator to return column messages in batches

Parameters

iterator An iterator which returns protobuf column messages marker Value used to mark a group of column messages

dataframe_profile

dataframe_profile(df: pd.DataFrame, name: str = None, timestamp: datetime.datetime = None)

Generate a dataset profile for a dataframe

Parameters

df : pandas.DataFrame Dataframe to track, treated as a complete dataset. name : str Name of the dataset timestamp : datetime.datetime, float Timestamp of the dataset. Defaults to current UTC time. Can be a datetime or UTC epoch seconds.

Returns

prof : DatasetProfile

array_profile

array_profile(x: np.ndarray, name: str = None, timestamp: datetime.datetime = None, columns: list = None)

Generate a dataset profile for an array

Parameters

x : np.ndarray Array-like object to track. Will be treated as an full dataset name : str Name of the dataset timestamp : datetime.datetime Timestamp of the dataset. Defaults to current UTC time columns : list Optional column labels

Returns

prof : DatasetProfile

Table of Contents

DatasetProfile Objects​

Parameters​

session_timestamp_ms​

track_metrics​

Parameters​

track​

Parameters​

track_array​

Parameters​

track_dataframe​

Parameters​

to_properties​

Returns​

to_summary​

Returns​

generate_constraints​

Returns​

flat_summary​

chunk_iterator​

validate​

merge​

Parameters​

Returns​

merge_strict​

Parameters​

Returns​

serialize_delimited​

Returns​

to_protobuf​

Returns​

write_protobuf​

Parameters​

read_protobuf​

Parameters​

Returns​

from_protobuf​

Parameters​

Returns​

from_protobuf_string​

Parameters​

Returns​

parse_delimited_single​

Parameters​

Returns​

parse_delimited​

Parameters​

Returns​

columns_chunk_iterator​

Parameters​

dataframe_profile​

Parameters​

Returns​

array_profile​

Parameters​

Returns​

DatasetProfile Objects

Parameters

session_timestamp_ms

track_metrics

Parameters

track

Parameters

track_array

Parameters

track_dataframe

Parameters

to_properties

Returns

to_summary

Returns

generate_constraints

Returns

flat_summary

chunk_iterator

validate

merge

Parameters

Returns

merge_strict

Parameters

Returns

serialize_delimited

Returns

to_protobuf

Returns

write_protobuf

Parameters

read_protobuf

Parameters

Returns

from_protobuf

Parameters

Returns

from_protobuf_string

Parameters

Returns

parse_delimited_single

Parameters

Returns

parse_delimited

Parameters

Returns

columns_chunk_iterator

Parameters

dataframe_profile

Parameters

Returns

array_profile

Parameters

Returns