By leveraging whylogs’ existing integration with Apache Spark, integrating whylogs with Databricks is simple.
First, install the spark, whylabs, and viz modules from whylogs on the desired Spark cluster within Databricks:
- The spark module enables users to profile Spark DataFrames with whylogs.
- The whylabs module enables users to upload these profiles to the WhyLabs AI Observatory.
- The viz module allows users to visualize one or more profiles directly in a Databricks notebook.
First, enable Apache Arrow.
Next, read your data into a Spark DataFrame. This syntax will be different depending on how your data is stored.
Now, we profile the data and optionally view the result as a Pandas DataFrame.
From here, users may wish to build visualizations their profile directly in the Databricks notebook as demonstrated in this example notebook.
Users can upload this profile to WhyLabs using the following:
For more on uploading profiles to WhyLabs, visit the Onboarding to the Platform page.
The above assumes a whylogs version >= 1.0 and Spark cluster running a Pyspark version >= 3.0.
Users of Pyspark 2.x will need to use whylogs v0 and will need to load a JAR file specific to their Pyspark and Scala version. Please submit a support request for the appropriate JAR file if you are running a Spark cluster using Pyspark v2.x.