Skip to main content

Apache Airflow

To integrate with Apache Airflow, we have created a Python package for the whylogs provider. With whylogs and Airflow, users are able to generate dataset profiles which they can use to:

  • Track changes in their dataset
  • Create data constraints to know whether their data looks the way it should
  • Quickly visualize key summary statistics about their datasets

The integration simplifies the creation of Airflow operators that use whylogs and it depends on generating whylogs profiles at some point before the operator is used. See our integration overview page for profile generation options or reach out to us for recommendations on how best to integrate.

Installation

You can install this package on top of an existing Airflow 2.0+ installation (Requirements) by running:

pip install airflow-provider-whylogs

To install this provider from source, run these instead:

git clone [email protected]:whylabs/airflow-provider-whylogs.git
cd airflow-provider-whylogs
python3 -m venv .env && source .env/bin/activate
pip3 install -e .

Usage example

To start, you'll need to generate a dataset profile using whylogs. This would typically happen either during model inference or offline in some batch job. The snippet below just creates a profile and writes it locally for this example.

import whylogs as why

df = pd.read_csv("some_file.csv")
results = why.log(df)
results.writer("local").write()

Next, create an Airflow operator to generate a Summary Drift Report, which will tell you how much drift took place between two profiles.

from whylogs_provider.operators.whylogs import WhylogsSummaryDriftOperator

summary_drift = WhylogsSummaryDriftOperator(
task_id="drift_report",
target_profile_path="data/profile.bin",
reference_profile_path="data/profile.bin",
reader="local",
write_report_path="data/Profile.html",
)

Or, run a Constraints check, which lets you fail your workflow based on customized checks against a whylogs dataset profile.

from whylogs_provider.operators.whylogs import WhylogsConstraintsOperator
from whylogs.core.constraints.factories import greater_than_number

constraints = WhylogsConstraintsOperator(
task_id="constraints_check",
profile_path="data/profile.bin",
reader="local",
constraint=greater_than_number(column_name="my_column", number=0.0),
)

Macros

The path variables are all templated using Airflow's Jinja-based macros. This means that you can use macros to dynamically set the paths to your profiles and reports. For example, you can use the {{ ds }} macro to set the path to your profile to be the same as the execution date of your DAG.

A full DAG example can be found on the whylogs_provider package directory.

Requirements

The current requirements to use this Airflow Provider are described on the table below.

PIP packageVersion required
apache-airflow>=2.0
whylogs[viz, s3]>=1.0.10

Contributing

Users are always welcome to ask questions and contribute to this repository, by submitting issues and communicating with us through our community Slack. Feel free to reach out and make whylogs even more awesome to use with Airflow.

Happy coding! 😄

Prefooter Illustration Mobile
Run AI With Certainty
Get started for free
Prefooter Illustration