Backfilling Data

One of the unique features of the WhyLabs platform is the ability to backfill historical data to use in monitoring. Backfilling historical data allows us to increase the accuracy of alerts by taking into account seasonality in data, decreasing both false negatives and false positives.

Imagine monitoring data in the retail world where numbers go haywire every Black Friday. Black Friday would appear to be a huge yearly anomaly if only compared to a couple trailing weeks of data. No different than time series forecasting, data monitoring benefits from looking further into the past in order to establish the most accurate baseline possible.

By default whylogs will log data using the current timestamp. Here we'll show various techniques for specifying timestamps in the past. Backfilling data is as simple as specifying a date for your data when profiling.

Backfilling (Simple)

Most users will create a dataframe or python dictionary for older dates and log them separately specifying the dataset_timestamp. For a full fledged example demonstrating such a backfill performed in a loop in whylogs v0, see check out the example notebook;

whylogs v0
whylogs v1

# whylogs v0.x can be installed via the following
# pip install "whylogs<1.0"

profile = session.new_profile(dataset_name="lendingClub", 
                              dataset_timestamp=datetime.datetime(2017, 1, 1, 0, 0))

import whylogs as why
import datetime

#log a dataframe and extract its profile
profile = why.log(df).profile()

#set the dataset timestamp for the profile
profile.set_dataset_timestamp(datetime.datetime(2022,2,7,0,0))

Backfilling With Spark (Scala)

The most common technique for backfilling large datasets is to use WhyLabs with Apache Spark. Add a column to your Spark Dataframe indicating the date for each record. Then pass said column name to whyLabs using the "withTimeColumn" option. WhyLabs will automatically profile each day of data independently while still only requiring a single pass over the data. For more detail see batch profiling.

import org.apache.spark.sql.functions._
// implicit import for WhyLogs to enable newProfilingSession API
import com.whylogs.spark.WhyLogs._

// load the data
val raw_df = spark.read.option("header", "true").csv("/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv")
val df = raw_df.withColumn("call_date", to_timestamp(col("Call Date"), "MM/dd/YYYY"))

val profiles = df.newProfilingSession("profilingSession") // start a new whylogs profiling job
                 .withTimeColumn("call_date") // split dataset by call_date
                 .groupBy("City", "Priority") // tag and group the data with categorical information
                 .aggProfiles() //  runs the aggregation. returns a dataframe of <timestamp, datasetProfile> entries

Backfilling With PySpark

With PySpark in whylogs v1, backfilling is achieved by setting the dataset_timestamp to the desired date. This can be done when creating a profile view as seen in the example below.

whylogs v0
whylogs v1

# whylogs v0.x can be installed via the following
# pip install "whylogs<1.0"

whylogs_jar = "/path/to/whylogs/bundle.jar"
spark = pyspark.sql.SparkSession.builder
  .appName("whylogs")
  .config("spark.pyspark.driver.python", sys.executable)
  .config("spark.pyspark.python", sys.executable)
  .config("spark.executor.userClassPathFirst", "true")
  .config("spark.submit.pyFiles", whylogs_jar)
  .config("spark.jars", whylogs_jar)
  .getOrCreate()

# this comes from whylogs bundle jar
import whyspark

pdf = pd.read_parquet("demo.csv")
df = spark.createDataFrame(pdf)
session = whyspark.new_profiling_session(df, "my-dataset-name").withTimeColumn('date')
profile_df = session.aggProfiles().cache()

#be sure to first install the spark module for whylogs using the command below
#pip install "whylogs[spark]"

from pyspark.sql import SparkSession
from pyspark import SparkFiles
from whylogs.api.pyspark.experimental import collect_dataset_profile_view
import datetime

spark = SparkSession.builder.appName('whylogs-testing').getOrCreate()
arrow_config_key = "spark.sql.execution.arrow.pyspark.enabled"
spark.conf.set(arrow_config_key, "true")

data_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
spark.sparkContext.addFile(data_url)

spark_dataframe = spark.read.option("delimiter", ";") \
  .option("inferSchema", "true") \
  .csv(SparkFiles.get("winequality-red.csv"), header=True)

#provide timestamp when creating profile view
ds_timestamp= datetime.datetime(2022,2,7,0,0)
dataset_profile_view = collect_dataset_profile_view(input_df=spark_dataframe, dataset_timestamp=ds_timestamp)