Table of Contents
Defines the primary interface class for tracking dataset statistics.
DatasetProfile Objects
class DatasetProfile()
Statistics tracking for a dataset.
A dataset refers to a collection of columns.
Parameters
name: str
A human readable name for the dataset profile. Could be model name.
This is stored under "name" tag
dataset_timestamp: datetime.datetime
The timestamp associated with the data (i.e. batch run). Optional.
session_timestamp : datetime.datetime
Timestamp of the dataset
columns : dict
Dictionary lookup of ColumnProfile
s
tags : dict
A dictionary of key->value. Can be used upstream for aggregating data. Tags must match when merging
with another dataset profile object.
metadata: dict
Metadata that can store arbitrary string mapping. Metadata is not used when aggregating data
and can be dropped when merging with another dataset profile object.
session_id : str
The unique session ID run. Should be a UUID.
constraints: DatasetConstraints
Static assertions to be applied to tracked numeric data and profile summaries.
session_timestamp_ms
| @property
| session_timestamp_ms()
Return the session timestamp value in epoch milliseconds.
track_metrics
| track_metrics(targets: List[Union[str, bool, float, int]], predictions: List[Union[str, bool, float, int]], scores: List[float] = None, model_type: ModelType = None, target_field: str = None, prediction_field: str = None, score_field: str = None)
Function to track metrics based on validation data.
user may also pass the associated attribute names associated with target, prediction, and/or score.
Parameters
targets : List[Union[str, bool, float, int]] actual validated values predictions : List[Union[str, bool, float, int]] inferred/predicted values scores : List[float], optional assocaited scores for each inferred, all values set to 1 if not passed target_field : str, optional Description prediction_field : str, optional Description score_field : str, optional Description model_type : ModelType, optional Defaul is Classification type. target_field : str, optional prediction_field : str, optional score_field : str, optional score_field : str, optional
track
| track(columns, data=None, character_list=None, token_method=None)
Add value(s) to tracking statistics for column(s).
Parameters
columns : str, dict
Either the name of a column, or a dictionary specifying column
names and the data (value) for each column
If a string, data
must be supplied. Otherwise, data
is
ignored.
data : object, None
Value to track. Specify if columns
is a string.
track_array
| track_array(x: np.ndarray, columns=None)
Track statistics for a numpy array
Parameters
x : np.ndarray 2D array to track. columns : list Optional column labels
track_dataframe
| track_dataframe(df: pd.DataFrame, character_list=None, token_method=None)
Track statistics for a dataframe
Parameters
df : pandas.DataFrame DataFrame to track
to_properties
| to_properties()
Return dataset profile related metadata
Returns
properties : DatasetProperties The metadata as a protobuf object.
to_summary
| to_summary()
Generate a summary of the statistics
Returns
summary : DatasetSummary Protobuf summary message.
generate_constraints
| generate_constraints() -> DatasetConstraints
Assemble a sparse dict of constraints for all features.
Returns
summary : DatasetConstraints Protobuf constraints message.
flat_summary
| flat_summary()
Generate and flatten a summary of the statistics.
See :func:flatten_summary
for a description
chunk_iterator
| chunk_iterator()
Generate an iterator to iterate over chunks of data
validate
| validate()
Sanity check for this object. Raises an AssertionError if invalid
merge
| merge(other)
Merge this profile with another dataset profile object.
We will use metadata and timestamps from the current DatasetProfile in the result.
This operation will drop the metadata from the 'other' profile object.
Parameters
other : DatasetProfile
Returns
merged : DatasetProfile New, merged DatasetProfile
merge_strict
| merge_strict(other)
Merge this profile with another dataset profile object. This throws exception if session_id, timestamps and tags don't match.
This operation will drop the metadata from the 'other' profile object.
Parameters
other : DatasetProfile
Returns
merged : DatasetProfile New, merged DatasetProfile
serialize_delimited
| serialize_delimited() -> bytes
Write out in delimited format (data is prefixed with the length of the datastream).
This is useful when you are streaming multiple dataset profile objects
Returns
data : bytes A sequence of bytes
to_protobuf
| to_protobuf() -> DatasetProfileMessage
Return the object serialized as a protobuf message
Returns
message : DatasetProfileMessage
write_protobuf
| write_protobuf(protobuf_path: str, delimited_file: bool = True)
Write the dataset profile to disk in binary format
Parameters
protobuf_path : str local path or any path supported supported by smart_open: https://github.com/RaRe-Technologies/smart_open#how. The parent directory must already exist delimited_file : bool, optional whether to prefix the data with the length of output or not. Default is True
read_protobuf
| @staticmethod
| read_protobuf(protobuf_path: str, delimited_file: bool = True) -> "DatasetProfile"
Parse a protobuf file and return a DatasetProfile object
Parameters
protobuf_path : str
the path of the protobuf data, can be local or any other path supported by smart_open: https://github.com/RaRe-Technologies/smart_open#how
delimited_file : bool, optional
whether the data is delimited or not. Default is True
Returns
DatasetProfile whylogs.DatasetProfile object from the protobuf
from_protobuf
| @staticmethod
| from_protobuf(message: DatasetProfileMessage) -> "DatasetProfile"
Load from a protobuf message
Parameters
message : DatasetProfileMessage
The protobuf message. Should match the output of
DatasetProfile.to_protobuf()
Returns
dataset_profile : DatasetProfile
from_protobuf_string
| @staticmethod
| from_protobuf_string(data: bytes) -> "DatasetProfile"
Deserialize a serialized DatasetProfileMessage
Parameters
data : bytes The serialized message
Returns
profile : DatasetProfile The deserialized dataset profile
parse_delimited_single
| @staticmethod
| parse_delimited_single(data: bytes, pos=0)
Parse a single delimited entry from a byte stream
Parameters
data : bytes The bytestream pos : int The starting position. Default is zero
Returns
pos : int Current position in the stream after parsing profile : DatasetProfile A dataset profile
parse_delimited
| @staticmethod
| parse_delimited(data: bytes)
Parse delimited data (i.e. data prefixed with the message length).
Java protobuf writes delimited messages, which is convenient for storing multiple dataset profiles. This means that the main data is prefixed with the length of the message.
Parameters
data : bytes The input byte stream
Returns
profiles : list List of all Dataset profile objects
columns_chunk_iterator
columns_chunk_iterator(iterator, marker: str)
Create an iterator to return column messages in batches
Parameters
iterator An iterator which returns protobuf column messages marker Value used to mark a group of column messages
dataframe_profile
dataframe_profile(df: pd.DataFrame, name: str = None, timestamp: datetime.datetime = None)
Generate a dataset profile for a dataframe
Parameters
df : pandas.DataFrame Dataframe to track, treated as a complete dataset. name : str Name of the dataset timestamp : datetime.datetime, float Timestamp of the dataset. Defaults to current UTC time. Can be a datetime or UTC epoch seconds.
Returns
prof : DatasetProfile
array_profile
array_profile(x: np.ndarray, name: str = None, timestamp: datetime.datetime = None, columns: list = None)
Generate a dataset profile for an array
Parameters
x : np.ndarray Array-like object to track. Will be treated as an full dataset name : str Name of the dataset timestamp : datetime.datetime Timestamp of the dataset. Defaults to current UTC time columns : list Optional column labels
Returns
prof : DatasetProfile