whylogs profiles are mergeable and therefore suitable for Spark's map-reduce style processing. Since whylogs requires only a single pass of data, the integration is highly efficient: no shuffling is required to build whylogs profiles with Spark.
To get started, users will need to build the jar bundle from our GitHub:
The JAR bundle is under
whylogs-java/spark-bundle/build/libs. You'll need this JAR bundle for the following examples.
- Add the JAR bundle to your Spark session
--jarsparameter of your
spark-submitscript ( see documentation)
spark.jarsin your Spark configuration
- [Python only] Configure your Spark session:
This example shows how we use WhyLogs to profile a dataset based on time and categorical information. The data is from the public dataset for Fire Department Calls & Incident .
The follow example shows the same workflow above, except we run it in Python
You can then extract and analyze individual profiles: