Skip to main content

Apache Airflow

To integrate with Apache Airflow, we have created a Python package for the whylogs provider. With whylogs and Airflow, users are able to generate dataset profiles which they can use to:

  • Track changes in their dataset
  • Create data constraints to know whether their data looks the way it should
  • Quickly visualize key summary statistics about their datasets

The integration simplifies the creation of Airflow operators that use whylogs and it depends on generating whylogs profiles at some point before the operator is used. See our integration overview page for profile generation options or reach out to us for recommendations on how best to integrate.

Installation#

You can install this package on top of an existing Airflow 2.0+ installation (Requirements) by running:

pip install airflow-provider-whylogs

To install this provider from source, run these instead:

git clone [email protected]:whylabs/airflow-provider-whylogs.git
cd airflow-provider-whylogs
python3 -m venv .env && source .env/bin/activate
pip3 install -e .

Usage example#

To start, you'll need to generate a dataset profile using whylogs. This would typically happen either during model inference or offline in some batch job. The snippet below just creates a profile and writes it locally for this example.

import whylogs as why
df = pd.read_csv("some_file.csv")
results = why.log(df)
results.writer("local").write()

Next, create an Airflow operator to generate a Summary Drift Report, which will tell you how much drift took place between two profiles.

from whylogs_provider.operators.whylogs import WhylogsSummaryDriftOperator
summary_drift = WhylogsSummaryDriftOperator(
task_id="drift_report",
target_profile_path="data/profile.bin",
reference_profile_path="data/profile.bin",
reader="local",
write_report_path="data/Profile.html",
)

Or, run a Constraints check, which lets you fail your workflow based on customized checks against a whylogs dataset profile.

from whylogs_provider.operators.whylogs import WhylogsConstraintsOperator
from whylogs.core.constraints.factories import greater_than_number
constraints = WhylogsConstraintsOperator(
task_id="constraints_check",
profile_path="data/profile.bin",
reader="local",
constraint=greater_than_number(column_name="my_column", number=0.0),
)

A full DAG example can be found on the whylogs_provider package directory.

Requirements#

The current requirements to use this Airflow Provider are described on the table below.

PIP packageVersion required
apache-airflow>=2.0
whylogs[viz, s3]>=1.0.10

Contributing#

Users are always welcome to ask questions and contribute to this repository, by submitting issues and communicating with us through our community Slack. Feel free to reach out and make whylogs even more awesome to use with Airflow.

Happy coding! 😄

Prefooter Illustration Mobile
Run AI With Certainty
Get started for free
Prefooter Illustration