Rest Container

In order to accommodate the breadth of environments and pipelines that exist today, we offer multiple solutions for transporting dataset profiles into WhyLabs. One of these solutions is a container that starts a REST endpoint that takes in data in pandas format and runs whylogs on that data for you, periodically uploading the dataset profiles it creates to WhyLabs. It is intended to be run on customer premises.

How it works#

REST Container Sequence Diagram

The container is a Java REST service managed by supervisord that accepts data in a JSON wrapper and runs whylogs-java on it for you. The container is configured mostly via environment variables and then run in Docker. The variables you can use are as follows.

##
## Required env variables
##
# Required API Key that you'll get from us.
WHYLABS_API_KEY=xxxxxx
# Required key that must be present for each request, set by you.
# A precaution to make sure no one else can upload through your container.
# The key is expected as value for the header key X-API-Key.
CONTAINER_API_KEY=secret-key
##
## Optional env variables
##
# Define the cadence at which profiles will be uploaded from the container to WhyLabs.
# Defaults to HOURS.
WHYLOGS_PERIOD=HOURS # MINUTES | HOURS | DAYS
# Additional set of strings considered to be null values.
# Do not include spaces or quotes around the strings.
NULL_STRINGS=nil,NaN,nan,null
# Alternate endpoint to send profiles to, rather than WhyLabs production. Mostly useful
# for debugging.
WHYLABS_API_ENDPOINT=http://localhost:8080
# How to queue incoming requests to the container. By default, we use a sqlite database to
# buffer requests so we don't lose anything if the container were to go down, but there is
# overhead that reduces tps. If you find the container isn't fast enough for your needs and
# you're willing to trade off reliability then you can set this to IN_MEMORY, which is much
# faster.
REQUEST_QUEUEING_MODE=SQLITE # SQLITE | IN_MEMORY
# Define options for the java server inside the container.
JAVA_OPTS=-XX:+UseZGC -XX:+UnlockExperimentalVMOptions -XX:-ZUncommit -Xmx4G

And you'll start it with.

## Assuming you're setting the above env variables with a local.env file.
docker run -it --rm -p 127.0.0.1:8080:8080 --env-file local.env --name whycontainer whycontainer

The --rm option isn't required, it just ensures the container is discarded when you're done. If you omit it then the container's on disk cache will persist across container restarts.

You can view the API documentation (generated by swagger) at https://your-endpoint/swagger-ui, where your-endpoint is wherever you deploy the container to.

We support two formats for data logging: one for single data points

{
"datasetId": "demo-model",
"tags": {
"tag1": "value1"
},
"single": {
"Brand": "Honda Civic",
"Price": 22000
}
}

And one for bulk.

{
"datasetId": "demo-model",
"tags": {
"tag1": "value1"
},
"multiple": {
"columns": ["Brand", "Price"],
"data": [
["Honda Civic", 22000],
["Toyota Corolla", 25000],
["Ford Focus", 27000],
["Audi A4", 35000]
]
}
}

If you're working in python with pandas then you can get the right format for the data field directly out of the pandas DataFrame.

import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'Price': [22000,25000,27000,35000]
}
df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
data = df.to_json(orient="split")

Running#

Contact us to get permissions to our Docker ECR. There are many ways to run Docker containers in general so the details will depend a lot on your individual setup and requirements. We're happy to make suggestions in our slack channel if you're looking for ideas.

Debugging and Troubleshooting#

Controlling the live REST service#

The container runs supervisord as its main command so it won't ever exit on its own. You can manipulate the REST server from within the container without shutting down the container by using supervisord as follows.

## Connect to the running container, assuming you used `--name whycontainer` to run it.
docker exec -it whycontainer sh
## Restart the server
./scripts/restart-server.sh
## The script is a convenience around supervisorctl. You can manually run
supervisorctl -c /opt/whylogs/supervisord.conf restart app
supervisorctl -c /opt/whylogs/supervisord.conf start app
supervisorctl -c /opt/whylogs/supervisord.conf stop app

The REST server does make use of temporary files on its file system so restarting the REST server without terminating the container has the advantage of not wiping out the ephemeral storage if it would have resulted in the container getting destroyed.

Inspecting the persistent storage#

The persistence is accomplished through a map abstraction that eventually calls sqlite. You can directly interface with sqlite in the container to verify its contents.

## Connect to the running container, assuming you used `--name whycontainer` to run it.
docker exec -it whycontainer sh
## Dump the content of the sqlite database
./scripts/query-profiles.sh
## The script just calls sqlite with the location of the db and a query
sqlite3 /tmp/profile-entries-map.sqlite 'select * from items;'

The db is pretty simple. Just a key and value column where the values are serialized dataset profiles. The dataset profiles are base64 encoded strings based on the output of the protobuf serialization. Everything else is human readable.

Monitoring traffic#

Sometimes it's useful to monitor network traffic if you're having issues actually connecting to WhyLabs or some request is failing for a mysterious reason. You can use ngrep to show the headers, url, etc.

## Connect to the container
docker exec -it whycontainer sh
ngrep -q -W byline
Prefooter Illustration Mobile
Run AI With Certainty
Get started for free
Prefooter Illustration