Apache Airflow
To integrate with Apache Airflow, we have created a Python package for the whylogs provider. With whylogs and Airflow, users are able to generate dataset profiles which they can use to:
- Track changes in their dataset
- Create data constraints to know whether their data looks the way it should
- Quickly visualize key summary statistics about their datasets
The integration simplifies the creation of Airflow operators that use whylogs and it depends on generating whylogs profiles at some point before the operator is used. See our integration overview page for profile generation options or reach out to us for recommendations on how best to integrate.
Installation
You can install this package on top of an existing Airflow 2.0+ installation (Requirements) by running:
pip install airflow-provider-whylogs
To install this provider from source, run these instead:
git clone [email protected]:whylabs/airflow-provider-whylogs.git
cd airflow-provider-whylogs
python3 -m venv .env && source .env/bin/activate
pip3 install -e .
Usage example
To start, you'll need to generate a dataset profile using whylogs. This would typically happen either during model inference or offline in some batch job. The snippet below just creates a profile and writes it locally for this example.
import whylogs as why
df = pd.read_csv("some_file.csv")
results = why.log(df)
results.writer("local").write()
Next, create an Airflow operator to generate a Summary Drift Report, which will tell you how much drift took place between two profiles.
from whylogs_provider.operators.whylogs import WhylogsSummaryDriftOperator
summary_drift = WhylogsSummaryDriftOperator(
task_id="drift_report",
target_profile_path="data/profile.bin",
reference_profile_path="data/profile.bin",
reader="local",
write_report_path="data/Profile.html",
)
Or, run a Constraints check, which lets you fail your workflow based on customized checks against a whylogs dataset profile.
from whylogs_provider.operators.whylogs import WhylogsConstraintsOperator
from whylogs.core.constraints.factories import greater_than_number
constraints = WhylogsConstraintsOperator(
task_id="constraints_check",
profile_path="data/profile.bin",
reader="local",
constraint=greater_than_number(column_name="my_column", number=0.0),
)
Macros
The path variables are all templated using Airflow's Jinja-based macros.
This means that you can use macros to dynamically set the paths to your profiles and reports.
For example, you can use the {{ ds }}
macro to set the path to your profile to be the same as the execution date of your DAG.
A full DAG example can be found on the whylogs_provider package directory.
Requirements
The current requirements to use this Airflow Provider are described on the table below.
PIP package | Version required |
---|---|
apache-airflow | >=2.0 |
whylogs[viz, s3] | >=1.0.10 |
Contributing
Users are always welcome to ask questions and contribute to this repository, by submitting issues and communicating with us through our community Slack. Feel free to reach out and make whylogs
even more awesome to use with Airflow.
Happy coding! 😄