In order to accommodate the breadth of environments and pipelines that exist today, we offer multiple solutions for transporting dataset profiles into WhyLabs. One of these solutions is a container that starts a REST endpoint that takes in data in pandas format and runs whylogs on that data for you, periodically uploading the dataset profiles it creates to WhyLabs. It is intended to be run on customer premises.
The container is a Java REST service managed by supervisord that accepts data in a JSON wrapper and runs whylogs-java on it for you. The container is configured mostly via environment variables and then run in Docker. The variables you can use are as follows.
And you'll start it with.
--rm option isn't required, it just ensures the container is discarded
when you're done. If you omit it then the container's on disk cache will
persist across container restarts.
You can view the API documentation (generated by swagger) at
https://your-endpoint/swagger-ui. Use whatever language and network library
you prefer to send your data in the following shape.
If you're working in python with pandas then you can get the right format for
data field directly out of the pandas DataFrame.
Contact us to get permissions to our Docker ECR. There are many ways to run Docker containers in general so the details will depend a lot on your individual setup and requirements. We're happy to make suggestions in our slack channel if you're looking for ideas.
The container runs supervisord as its main command so it won't ever exit on its own. You can manipulate the REST server from within the container without shutting down the container by using supervisord as follows.
The REST server does make use of temporary files on its file system so restarting the REST server without terminating the container has the advantage of not wiping out the ephemeral storage if it would have resulted in the container getting destroyed.
The persistence is accomplished through a map abstraction that eventually calls sqlite. You can directly interface with sqlite in the container to verify its contents.
The db is pretty simple. Just a key and value column where the values are serialized dataset profiles. The dataset profiles are base64 encoded strings based on the output of the protobuf serialization. Everything else is human readable.
Sometimes it's useful to monitor network traffic if you're having issues
actually connecting to WhyLabs or some request is failing for a mysterious
reason. You can use
ngrep to show the headers, url, etc.