Backfilling Data

Backfilling Data#

One of the unique features of the WhyLabs platform is the ability to backfill historical data to use in monitoring. Backfilling historical data allows us to increase the accuracy of alerts by taking into account seasonality in data, decreasing both false negatives and false positives.

Imagine monitoring data in the retail world where numbers go haywire every Black Friday. Black Friday would appear to be a huge yearly anomaly if only compared to a couple trailing weeks of data. No different than time series forecasting, data monitoring benefits from looking further into the past in order to establish the most accurate baseline possible.

By default Whylogs will log data using the current timestamp. Here we'll show various techniques for specifying timestamps in the past. Backfilling data is as simple as specifying a date for your data when profiling.

Backfilling With Spark (Scala)#

The most common technique for backfilling large datasets is is to use Whylabs with Apache Spark. Add a column to your Spark Dataframe indicating the date for each record. Then pass said column name to whylabs using the "withTimeColumn" option. Whylabs will automatically profile each day of data independently while still only requiring a single pass over the data. For more detail see batch profiling.

import org.apache.spark.sql.functions._
// implicit import for WhyLogs to enable newProfilingSession API
import com.whylogs.spark.WhyLogs._
// load the data
val raw_df = spark.read.option("header", "true").csv("/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv")
val df = raw_df.withColumn("call_date", to_timestamp(col("Call Date"), "MM/dd/YYYY"))
val profiles = df.newProfilingSession("profilingSession") // start a new WhyLogs profiling job
.withTimeColumn("call_date") // split dataset by call_date
.groupBy("City", "Priority") // tag and group the data with categorical information
.aggProfiles() // runs the aggregation. returns a dataframe of <timestamp, datasetProfile> entries

Backfilling With Spark (Python)#

Much the same as with the Scala example, PySpark allows you to specify a date column using the "withTimeColumn" option. Whylabs will automatically profile each day of data independently while still only requiring a single pass over the data. For more detail see batch profiling.

# this integration is current in private beta. Please reach out to [email protected] to get access
import pandas as pd
whylogs_jar = "/path/to/whylogs/bundle.jar"
spark = pyspark.sql.SparkSession.builder
.appName("whylogs")
.config("spark.pyspark.driver.python", sys.executable)
.config("spark.pyspark.python", sys.executable)
.config("spark.executor.userClassPathFirst", "true")
.config("spark.submit.pyFiles", whylogs_jar)
.config("spark.jars", whylogs_jar)
.getOrCreate()
# this comes from whylogs bundle jar
import whyspark
pdf = pd.read_parquet("demo.csv")
df = spark.createDataFrame(pdf)
session = whyspark.new_profiling_session(df, "my-dataset-name").withTimeColumn('date')
profile_df = session.aggProfiles().cache()

Backfilling With Pandas#

Most Pandas users will create a dataframe for older dates and log them separately specifying the dataset_timestamp. For a full fledged example demonstrating such a backfill performed in a loop, see check out the example notebook;

profile = session.new_profile(dataset_name="lendingClub",
dataset_timestamp=datetime.datetime(2017, 1, 1, 0, 0))
Prefooter Illustration Mobile
Run AI With Certainty
Get started for free
Prefooter Illustration