Snowflake

The Snowflake integration works by leveraging our Python UDFs in your environment. There isn't an automatic setup or native Snowflake application yet. After the whylogs and WhyLabs UDFs have been set up, you'll profile and upload data through SQL queries that leverage the UDFs.

Setup

Make sure snowsql is set up with with a default schema, database, and warehouse so you can execute sql statements directly via stdin if you're following along with the examples. If you already have your snowsql configured appropriately then you can skip this, it's just for demonstration.

create database whylogs_integration_db;
create warehouse whylogs_integration_warehouse;

# So you can pipe things to snowsql on the command line
[connections]
# ....
warehousename = whylogs_integration_warehouse
dbname = whylogs_integration_db
schemaname = public

Allow Conda Packages

Follow the Snowflake instructions for enabling the use of third party packages. You'll be using whylogs as a Python UDF from the private Snowflake conda repo and it's disabled by default until you complete their form.

Create Secrets (Configuration)

Secrets are used to supply the WhyLabs API key and org id. All of your credentials can be generated in your WhyLabs account settings. This is the only part that you have to manually edit since it's specific to your account. The other instructions work for any account.

⚠️ The whylogs version in Snowflake does't understand our new API key format yet. If you generate an api key that has :org-12345 at the end of it then you'll have to remove that until we get a newer whylogs version into Snowflake.

CREATE OR REPLACE SECRET whylabs_api_key
  TYPE = GENERIC_STRING
  SECRET_STRING = 'xxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx';

CREATE OR REPLACE SECRET whylabs_org_id
  TYPE = GENERIC_STRING
  SECRET_STRING = 'org-xxxxx';

Quick Install

SQL Script

Each version of the Snowflake integration is published along side a sql setup script that you can use to create all of the resources you need in your Snowflake account. You can download this script and execute it with snowsql

curl https://whylabs-snowflake-udfs.s3.us-west-2.amazonaws.com/udfs/v1/latest/setup.sql | snowsql

Manual Install

The setup will use Snowflake sql files in our github repo. You can check that out now or copy and paste the commands as you go.

git clone https://github.com/whylabs/whylabs-snowflake-integration.git
cd ./whylabs-snowflake-integration/sql

Create Network Rule

You need two network rules. One network rule allows the WhyLabs Upload UDF to connect to WhyLabs. The other rule doesn't actually provide any network access but is required by the Snowflake in order to access secrets, which the whylogs UDF needs for configuration.

snowsql -f ./networking.sql

Create External Integration

The integrations are what tell Snowflake what each of the UDFs have access to in terms of network and secrets.

snowsql -f ./integrations.sql

Create Stage Integration

This will set up your account to be able to use the public bucket that WhyLabs uses to distribute our UDFs.

snowsql -f ./storage.sql

Create UDFs

This will create the UDFs in your account, referencing the python source from our readonly public bucket.

snowsql -f ./create-udf.sql

If you want to get a hold of the UDF source for any particular release then you can view the releases page on the GitHub repository.

Specific Version

If you want to use a specific version of the UDFs rather than the latest one that we publish then you can use this format instead, where RELEASE_SHA is the release tag that you want to use from the releases page.

create or replace function whylogs(data object)
    -- ...
    handler = 'whylogs_udf.handler'
    imports = ('@whylabs_udf_stage/v1/archive/RELEASE_SHA/whylogs_udf.py') -- This is different
;

create or replace function whylabs_upload(profile_view varchar, segment_partition varchar, segment varchar)
    returns table (upload_result varchar)
    -- ...
    handler = 'whylabs_upload_udf.handler'
    imports = ('@whylabs_udf_stage/v1/archive/RELEASE_SHA/whylabs_upload_udf.py') -- This is different
;

Usage

Now you'll be abe to use the whylogs and whylabs_upload functions in your SQL queries. The UDFs have to be used in queries in conjunction with partitioning. There are some gotchas with partition size unfortunately. If the partition sizes are too large then the UDF will hit an Out of Memory error. When this happens, you'll have to add extra columns to the partition statement

Optional - Dummy Data

If you want to recreate the tables used in the example queries for testing then follow this section. This script will create a warehouse, database, and table, populated with dummy data. Ths will take about a minute to finish.

curl https://raw.githubusercontent.com/whylabs/whylabs-snowflake-integration/master/sql/create-dummy-data.sql | snowsql

Profile Data

This example shows how to profile a table that you intend to segment on STATE (representing th US state).

WITH 
    raw_data AS ( -- Query a day's worth of data 
        SELECT 
            date_trunc('DAY', hire_date) AS day,
            state,
            object_insert(
                object_insert(
                    object_construct(*), 
                    'DATASET_TIMESTAMP', date_part(EPOCH_MILLISECONDS, hire_date)
                ),
                'SEGMENT_COLUMNS', 'STATE'
            ) AS data
        FROM employees
        WHERE day = '2023-10-03 00:00:00.000'::timestamp
    )
-- Use whylogs to profile data
SELECT 
    day, 
    state, 
    profile_view, 
    segment_partition, 
    segment, 
    rows_processed, 
    debug_info
FROM raw_data, TABLE(whylogs(data) OVER (PARTITION BY day, state))
;

The results will look like this.

PROFILE_VIEW	SEGMENT_PARTITION	SEGMENT	ROWS_PROCESSED	DEBUG_INFO
V0hZMQCRBAoOILCT55K...	STATE	...	52224	norm_time: 0.179 ...
V0hZMQCRBAoOIOWu55K...	STATE	...	44032	norm_time: 0.144 ...
V0hZMQCRBAoOIPLG55K...	STATE	...	48128	norm_time: 0.153 ...
V0hZMQCRBAoOIMvg55K...	STATE	...	41984	norm_time: 0.156 ...

Some information on the returned columns.

The PROFILE_VIEW column is a base64 encoded version of the generated whylogs profile for that partition. The
The SEGMENT_PARTITION column is the name of the column that was used to segment the data. In this case it's STATE.
The SEGMENT column is the value of the segment. It's base64 encoded, but it specifies the specific segment for that profile.
The ROWS_PROCESSED column is the number of rows that were processed for that profile.
The DEBUG_INFO column is a string that contains some debug information about the profiling process. It's useful for troubleshooting.

The input to whylogs() is an object that contains the data that you want to profile. The object_insert function is used to add a DATASET_TIMESTAMP field to the object. This is used by whylogs to determine the timestamp of the data. Having a DATASET_TIMESTAMP is required, and the format should be a TIMESTAMP_NTZ (whylogs will assume UTC internally).

If you're going to use segments (like this example does) then you'll have to insert another column into the query that specifies the segment. In this case, the STATE column is used as the segment. The SEGMENT_COLUMNS field is used to tell whylogs which columns are used for segmentation. This is an all-caps CSV.

Partition Optimization

The partitioning should line up with the aggregation mode (day, hour, etc.), and the segments you're using. If you start hitting Out of Memory errors and you have nothing left to partition on then you can add arbitrary numbers that serve as uniform partitions, like so.

WITH 
    raw_data AS (
        SELECT 
            date_trunc('DAY', hire_date) AS day,
            state,
            object_insert(
                object_insert(
                    object_construct(*), 
                    'DATASET_TIMESTAMP', date_part(EPOCH_MILLISECONDS, hire_date)
                ),
                'SEGMENT_COLUMNS', 'STATE'
            ) AS data,
            FLOOR(ABS(UNIFORM(0, 9, RANDOM()))) as rand -- 10 additional sub partitions
        FROM employees
        WHERE day = '2023-10-03 00:00:00.000'::timestamp
    )
SELECT 
    day, 
    state, 
    profile_view, 
    segment_partition, 
    segment, 
    rows_processed, 
    debug_info
FROM raw_data, TABLE(whylogs(data) OVER (PARTITION BY day, state, rand)) -- rand is the new partition
;

You'll end up generating more models now, but you'll reduce the amount of data that each UDF has to process. Only do this if you hit an error though.

Upload Data

The second UDF (whylabs_upload()) is for uploading the profiles that are generated by the first UDF (whylogs()).

WITH 
    raw_data AS ( -- Same as before
        SELECT 
            date_trunc('DAY', hire_date) AS day,
            state,
            object_insert(
                object_insert(
                    object_construct(*), 
                    'DATASET_TIMESTAMP', date_part(EPOCH_MILLISECONDS, hire_date)
                ),
                'SEGMENT_COLUMNS', 'STATE'
            ) AS data,
        FROM employees
        WHERE day <= '2023-10-03 00:00:00.000'::timestamp
            and day >= '2023-10-01 00:00:00.000'::timestamp
    ),
    profiled_data AS (  -- Same profiling code as before
        SELECT 
            day, 
            state, 
            profile_view, 
            segment_partition, 
            segment, 
            rows_processed, 
            debug_info
        FROM raw_data,
        TABLE(whylogs(data) OVER (PARTITION BY day, state))
    ),
    upload_data AS ( -- Format the input for the whylabs upload function
        SELECT 
            day,
            state,
            object_construct(
                'WHYLABS_DATASET_ID', 'model-92', -- Specify the model id we're uploading to
                'PROFILE_VIEW', profile_view,
                'SEGMENT_PARTITION', segment_partition,
                'SEGMENT', segment
            ) AS upload_object
        FROM profiled_data
    )
SELECT 
    dataset_id, result, error, dataset_timestamp, segment
FROM 
    upload_data,
    TABLE(whylabs_upload(upload_object) OVER (PARTITION BY day, state)); -- Upload with whylabs_upload udf
;

This query is essentially the profile query, wrapped in a with block, and then fed into a second query that uses whylabs_upload(). You can default to keeping the partitioning the same for the upload, but you have some flexibility here. In this example, the upload UDF is launched once for every profile result.

Don't Use Segments

If you don't want to use segments then you can just leave out the segment specific fields. You should still use reasonable partitions though since that determines how much data is passed to the UDF. Roughly uniform would be best for partition sizes.

WITH 
    raw_data AS (
        SELECT 
            date_trunc('DAY', hire_date) AS day,
            state,
            object_insert(
                object_construct(*), 
                'DATASET_TIMESTAMP', date_part(EPOCH_MILLISECONDS, hire_date) -- Don't specify segments
            ) AS data
        FROM employees
        WHERE day <= '2023-10-03 00:00:00.000'::timestamp
            and day > '2023-10-01 00:00:00.000'::timestamp
    ),
    profiled_data AS (  -- Use whylogs to profile data
        SELECT 
            day, 
            state, 
            profile_view, 
            segment_partition, 
            segment, 
            rows_processed, 
            debug_info
        FROM raw_data,
        TABLE(whylogs(data) OVER (PARTITION BY day))
    ),
    upload_data AS (
        SELECT 
            day,
            state,
            object_construct(
                'WHYLABS_DATASET_ID', 'model-96',
                'PROFILE_VIEW', profile_view
            ) AS upload_object
        FROM profiled_data
    )
SELECT 
    dataset_id, result, error, to_timestamp(dataset_timestamp::varchar), segment -- segment will be null in the results
FROM 
    upload_data,
    TABLE(whylabs_upload(upload_object) OVER (PARTITION BY day, state)); -- Still use state
;

Performance

Here are some rough performance numbers for profiling and uploading data taken from an X-Small Snowflake warehouse using the dataset in this doc page (10 columns).

Operation	Time	Rows	Days of Data	Column count
Profiling	17s	1,212,416	4 days	10
Profiling+Uploading	34s	1,212,416	4 days	10
Profiling	1m 35s	10,682,368	30 days	10
Profiling+Uploading	1m 57s	10,682,368	30 days	10

The performance will scale with the number of columns in your dataset and the number of rows you're profiling at once. The more data you profile the more you'll have to segment the query, which results in a larger number of whylogs profiles being generated.

Setup​

Allow Conda Packages​

Create Secrets (Configuration)​

Quick Install​

SQL Script​

Manual Install​

Create Network Rule​

Create External Integration​

Create Stage Integration​

Create UDFs​

Specific Version​

Usage​

Optional - Dummy Data​

Profile Data​

Partition Optimization​

Upload Data​

Don't Use Segments​

Performance​