The whylogs container is a good integration solution for anyone that doesn't want to manually include the whylogs library into their data pipeline. Rather than adding whylogs code to an existing application, you'll send post requests with data to this container and that data will be converted into whylogs profiles and occasionally uploaded to WhyLabs or S3. You can host the container on whichever container platform you prefer and it can be configured to run in several different modes that are covered below.
You can more or less think of the container as a dictionary of timestamps to whylogs profiles. As data is uploaded, the timestamp of that data is used to reduce it into the existing profile for that timestamp if one exists, otherwise one is created. Periodically, all of the profiles that are stored in the container are uploaded one-by-one based on the container's configuration and local copies are erased.
Like whylogs, the container is open source and we welcome contributions and feedback.
The container is configured through environment variables. See the environment variable documentation for a list of all of the variables, their meanings, and their default values.
When you configure the container you're primarily picking two things: a method for getting data into the container to be converted into profiles, and a destination for those profiles to be uploaded to. These options are independent, so you can pair whichever input method you prefer with any upload method.
For ingesting data, the container has a REST interface and a Kafka consumer based interface. For uploading profiles, the container can target WhyLabs, S3, or local file system.
The default data ingestion method is the REST interface. Below is a minimal configuration to take data from REST calls and upload them to WhyLabs.
The REST API of the container can be viewed as a swagger page on the container itself, hosted at
http:<container>:<port>/swagger-ui. You can also view the API docs from the most recent build here.
The data format of the REST interface was made with
pandas in mind. The easiest way to get the data for the log api if you're using pandas is as follows.
The container can also run as a Kafka consumer. The configuration below will consume data from a Kafka cluster located at
http://localhost:9092, from the topic
my-topic (which we'll say has been configured to have 4 partitions), using 4 Kafka consumers with dedicated threads, uploading profiles to WhyLabs for
The REST interface is still active when Kafka is enabled. The
KAFKA_CONSUMER_THREADS option controls how many consumer instances are started. Each one of them is given a dedicated thread to run on, so one container can have multiple consumers at once. The thread count should probably be set to the partition count of your Kafka topic, feel free to reach out for advice while you're configuring though.
The container assumes JSON format for the data in the topic. Nested values will be flattened into keys like
a.b.c by default, but can be configured via a environment variable.
The REST configuration above also highlighted sending profiles to WhyLabs.
This will result in daily uploads of daily data to your WhyLabs account. You can get an access token from the token management page in your account settings.
Below is a minimal configuration for uploading profiles to s3.
The container uses the AWS Java SDK so authentication happens through the standard environment variables that it checks. Once configured, profiles are uploaded to the specified bucket in the following format.
Below is a minimal configuration for writing files to disk.
This was developed as a debugging tool mostly but could come in handy if external storage was mounted to the right location in the container. The profiles are written to disk in the following format.
Let us know if you have a use case around this and we'll work to smooth out some of the rough edges.