Apache Airflow
To integrate with Apache Airflow, we have created a Python package for the whylogs provider. With whylogs and Airflow, users are able to generate dataset profiles which they can use to:
- Track changes in their dataset
- Create data constraints to know whether their data looks the way it should
- Quickly visualize key summary statistics about their datasets
The integration simplifies the creation of Airflow operators that use whylogs and it depends on generating whylogs profiles at some point before the operator is used. See our integration overview page for profile generation options or reach out to us for recommendations on how best to integrate.
#
InstallationYou can install this package on top of an existing Airflow 2.0+ installation (Requirements) by running:
To install this provider from source, run these instead:
#
Usage exampleTo start, you'll need to generate a dataset profile using whylogs. This would typically happen either during model inference or offline in some batch job. The snippet below just creates a profile and writes it locally for this example.
Next, create an Airflow operator to generate a Summary Drift Report, which will tell you how much drift took place between two profiles.
Or, run a Constraints check, which lets you fail your workflow based on customized checks against a whylogs dataset profile.
A full DAG example can be found on the whylogs_provider package directory.
#
RequirementsThe current requirements to use this Airflow Provider are described on the table below.
PIP package | Version required |
---|---|
apache-airflow | >=2.0 |
whylogs[viz, s3] | >=1.0.10 |
#
ContributingUsers are always welcome to ask questions and contribute to this repository, by submitting issues and communicating with us through our community Slack. Feel free to reach out and make whylogs
even more awesome to use with Airflow.
Happy coding! 😄