One of the unique features of the WhyLabs platform is the ability to backfill historical data to use in monitoring. Backfilling historical data allows us to increase the accuracy of alerts by taking into account seasonality in data, decreasing both false negatives and false positives.
Imagine monitoring data in the retail world where numbers go haywire every Black Friday. Black Friday would appear to be a huge yearly anomaly if only compared to a couple trailing weeks of data. No different than time series forecasting, data monitoring benefits from looking further into the past in order to establish the most accurate baseline possible.
By default Whylogs will log data using the current timestamp. Here we'll show various techniques for specifying timestamps in the past. Backfilling data is as simple as specifying a date for your data when profiling.
The most common technique for backfilling large datasets is is to use Whylabs with Apache Spark. Add a column to your Spark Dataframe indicating the date for each record. Then pass said column name to whylabs using the "withTimeColumn" option. Whylabs will automatically profile each day of data independently while still only requiring a single pass over the data. For more detail see batch profiling.
Much the same as with the Scala example, PySpark allows you to specify a date column using the "withTimeColumn" option. Whylabs will automatically profile each day of data independently while still only requiring a single pass over the data. For more detail see batch profiling.
Most Pandas users will create a dataframe for older dates and log them separately specifying the dataset_timestamp. For a full fledged example demonstrating such a backfill performed in a loop, see check out the example notebook;