Rest Container

In order to accommodate the breadth of environments and pipelines that exist today, we offer multiple solutions for transporting dataset profiles into WhyLabs. One of these solutions is a container that starts a REST endpoint that takes in data in pandas format and runs whylogs on that data for you, periodically uploading the dataset profiles it creates to WhyLabs. It is intended to be run on customer premises.

How it works#

REST Container Sequence Diagram

The container is a Java REST service managed by supervisord that accepts data in a JSON wrapper and runs whylogs-java on it for you. The container is configured mostly via environment variables and then run in Docker. The variables you can use are as follows.

## Defaults to HOURS, supports MINUTES, HOURS, DAYS
WHYLOGS_PERIOD=HOURS
## Optionally define options for the java server inside the container.
JAVA_OPTS=-XX:+UseZGC -XX:+UnlockExperimentalVMOptions -XX:-ZUncommit -Xmx4G
## Optional alternate api endpoint for WhyLabs
WHYLABS_API_ENDPOINT=http://localhost:8080
## Required API Key that you'll get from us.
WHYLABS_API_KEY=xxxxxx
## Required key that must be present for each request, set by you.
## A precaution to make sure no one else can upload through your container.
## The key is expected as value for the header key X-API-Key.
CONTAINER_API_KEY=secret-key
# OPTIONAL additional set of strings considered to be null values.
# Do not include spaces or quotes around the strings.
NULL_STRINGS=nil,NaN,nan,null

And you'll start it with.

## Assuming you're setting the above env variables with a local.env file.
docker run -it --rm -p 127.0.0.1:8080:8080 --env-file local.env --name whycontainer whycontainer

The --rm option isn't required, it just ensures the container is discarded when you're done. If you omit it then the container's on disk cache will persist across container restarts.

You can view the API documentation (generated by swagger) at https://your-endpoint/swagger-ui. Use whatever language and network library you prefer to send your data in the following shape.

{
"datasetId": "demo-model",
"tags": {
"tag1": "value1"
},
"multiple": {
"columns": ["Brand", "Price"],
"data": [
["Honda Civic", 22000],
["Toyota Corolla", 25000],
["Ford Focus", 27000],
["Audi A4", 35000]
]
}
}

If you're working in python with pandas then you can get the right format for the data field directly out of the pandas DataFrame.

import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'Price': [22000,25000,27000,35000]
}
df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
data = df.to_json(orient="split")

Running#

Contact us to get permissions to our Docker ECR. There are many ways to run Docker containers in general so the details will depend a lot on your individual setup and requirements. We're happy to make suggestions in our slack channel if you're looking for ideas.

Debugging and Troubleshooting#

Controlling the live REST service#

The container runs supervisord as its main command so it won't ever exit on its own. You can manipulate the REST server from within the container without shutting down the container by using supervisord as follows.

## Connect to the running container, assuming you used `--name whycontainer` to run it.
docker exec -it whycontainer sh
## Restart the server
./scripts/restart-server.sh
## The script is a convenience around supervisorctl. You can manually run
supervisorctl -c /opt/whylogs/supervisord.conf restart app
supervisorctl -c /opt/whylogs/supervisord.conf start app
supervisorctl -c /opt/whylogs/supervisord.conf stop app

The REST server does make use of temporary files on its file system so restarting the REST server without terminating the container has the advantage of not wiping out the ephemeral storage if it would have resulted in the container getting destroyed.

Inspecting the persistent storage#

The persistence is accomplished through a map abstraction that eventually calls sqlite. You can directly interface with sqlite in the container to verify its contents.

## Connect to the running container, assuming you used `--name whycontainer` to run it.
docker exec -it whycontainer sh
## Dump the content of the sqlite database
./scripts/query-profiles.sh
## The script just calls sqlite with the location of the db and a query
sqlite3 /tmp/profile-entries-map.sqlite 'select * from items;'

The db is pretty simple. Just a key and value column where the values are serialized dataset profiles. The dataset profiles are base64 encoded strings based on the output of the protobuf serialization. Everything else is human readable.

Monitoring traffic#

Sometimes it's useful to monitor network traffic if you're having issues actually connecting to WhyLabs or some request is failing for a mysterious reason. You can use ngrep to show the headers, url, etc.

## Connect to the container
docker exec -it whycontainer sh
ngrep -q -W byline