Skip to main content

Snowflake

The Snowflake integration works by leveraging our Python UDFs in your environment. There isn't an automatic setup or native Snowflake application yet. After the whylogs and WhyLabs UDFs have been set up, you'll profile and upload data through SQL queries that leverage the UDFs.

Setup

Make sure snowsql is set up with with a default schema, database, and warehouse so you can execute sql statements directly via stdin if you're following along with the examples. If you already have your snowsql configured appropriately then you can skip this, it's just for demonstration.

create database whylogs_integration_db;
create warehouse whylogs_integration_warehouse;
# So you can pipe things to snowsql on the command line
[connections]
# ....
warehousename = whylogs_integration_warehouse
dbname = whylogs_integration_db
schemaname = public

Allow Conda Packages

Follow the Snowflake instructions for enabling the use of third party packages. You'll be using whylogs as a Python UDF from the private Snowflake conda repo and it's disabled by default until you complete their form.

Create Secrets (Configuration)

Secrets are used to supply the WhyLabs API key and org id. All of your credentials can be generated in your WhyLabs account settings. This is the only part that you have to manually edit since it's specific to your account. The other instructions work for any account.

⚠️ The whylogs version in Snowflake does't understand our new API key format yet. If you generate an api key that has :org-12345 at the end of it then you'll have to remove that until we get a newer whylogs version into Snowflake.

CREATE OR REPLACE SECRET whylabs_api_key
TYPE = GENERIC_STRING
SECRET_STRING = 'xxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx';

CREATE OR REPLACE SECRET whylabs_org_id
TYPE = GENERIC_STRING
SECRET_STRING = 'org-xxxxx';

Quick Install

SQL Script

Each version of the Snowflake integration is published along side a sql setup script that you can use to create all of the resources you need in your Snowflake account. You can download this script and execute it with snowsql

curl https://whylabs-snowflake-udfs.s3.us-west-2.amazonaws.com/udfs/v1/latest/setup.sql | snowsql

Manual Install

The setup will use Snowflake sql files in our github repo. You can check that out now or copy and paste the commands as you go.

git clone https://github.com/whylabs/whylabs-snowflake-integration.git
cd ./whylabs-snowflake-integration/sql

Create Network Rule

You need two network rules. One network rule allows the WhyLabs Upload UDF to connect to WhyLabs. The other rule doesn't actually provide any network access but is required by the Snowflake in order to access secrets, which the whylogs UDF needs for configuration.

snowsql -f ./networking.sql

Create External Integration

The integrations are what tell Snowflake what each of the UDFs have access to in terms of network and secrets.

snowsql -f ./integrations.sql

Create Stage Integration

This will set up your account to be able to use the public bucket that WhyLabs uses to distribute our UDFs.

snowsql -f ./storage.sql

Create UDFs

This will create the UDFs in your account, referencing the python source from our readonly public bucket.

snowsql -f ./create-udf.sql

If you want to get a hold of the UDF source for any particular release then you can view the releases page on the GitHub repository.

Specific Version

If you want to use a specific version of the UDFs rather than the latest one that we publish then you can use this format instead, where RELEASE_SHA is the release tag that you want to use from the releases page.

create or replace function whylogs(data object)
-- ...
handler = 'whylogs_udf.handler'
imports = ('@whylabs_udf_stage/v1/archive/RELEASE_SHA/whylogs_udf.py') -- This is different
;

create or replace function whylabs_upload(profile_view varchar, segment_partition varchar, segment varchar)
returns table (upload_result varchar)
-- ...
handler = 'whylabs_upload_udf.handler'
imports = ('@whylabs_udf_stage/v1/archive/RELEASE_SHA/whylabs_upload_udf.py') -- This is different
;

Usage

Now you'll be abe to use the whylogs and whylabs_upload functions in your SQL queries. The UDFs have to be used in queries in conjunction with partitioning. There are some gotchas with partition size unfortunately. If the partition sizes are too large then the UDF will hit an Out of Memory error. When this happens, you'll have to add extra columns to the partition statement

Optional - Dummy Data

If you want to recreate the tables used in the example queries for testing then follow this section. This script will create a warehouse, database, and table, populated with dummy data. Ths will take about a minute to finish.

curl https://raw.githubusercontent.com/whylabs/whylabs-snowflake-integration/master/sql/create-dummy-data.sql | snowsql

Profile Data

This example shows how to profile a table that you intend to segment on STATE (representing th US state).

WITH 
raw_data AS ( -- Query a day's worth of data
SELECT
date_trunc('DAY', hire_date) AS day,
state,
object_insert(
object_insert(
object_construct(*),
'DATASET_TIMESTAMP', date_part(EPOCH_MILLISECONDS, hire_date)
),
'SEGMENT_COLUMNS', 'STATE'
) AS data
FROM employees
WHERE day = '2023-10-03 00:00:00.000'::timestamp
)
-- Use whylogs to profile data
SELECT
day,
state,
profile_view,
segment_partition,
segment,
rows_processed,
debug_info
FROM raw_data, TABLE(whylogs(data) OVER (PARTITION BY day, state))
;

The results will look like this.

PROFILE_VIEWSEGMENT_PARTITIONSEGMENTROWS_PROCESSEDDEBUG_INFO
V0hZMQCRBAoOILCT55K...STATE...52224norm_time: 0.179 ...
V0hZMQCRBAoOIOWu55K...STATE...44032norm_time: 0.144 ...
V0hZMQCRBAoOIPLG55K...STATE...48128norm_time: 0.153 ...
V0hZMQCRBAoOIMvg55K...STATE...41984norm_time: 0.156 ...

Some information on the returned columns.

  • The PROFILE_VIEW column is a base64 encoded version of the generated whylogs profile for that partition. The
  • The SEGMENT_PARTITION column is the name of the column that was used to segment the data. In this case it's STATE.
  • The SEGMENT column is the value of the segment. It's base64 encoded, but it specifies the specific segment for that profile.
  • The ROWS_PROCESSED column is the number of rows that were processed for that profile.
  • The DEBUG_INFO column is a string that contains some debug information about the profiling process. It's useful for troubleshooting.

The input to whylogs() is an object that contains the data that you want to profile. The object_insert function is used to add a DATASET_TIMESTAMP field to the object. This is used by whylogs to determine the timestamp of the data. Having a DATASET_TIMESTAMP is required, and the format should be a TIMESTAMP_NTZ (whylogs will assume UTC internally).

If you're going to use segments (like this example does) then you'll have to insert another column into the query that specifies the segment. In this case, the STATE column is used as the segment. The SEGMENT_COLUMNS field is used to tell whylogs which columns are used for segmentation. This is an all-caps CSV.

Partition Optimization

The partitioning should line up with the aggregation mode (day, hour, etc.), and the segments you're using. If you start hitting Out of Memory errors and you have nothing left to partition on then you can add arbitrary numbers that serve as uniform partitions, like so.

WITH 
raw_data AS (
SELECT
date_trunc('DAY', hire_date) AS day,
state,
object_insert(
object_insert(
object_construct(*),
'DATASET_TIMESTAMP', date_part(EPOCH_MILLISECONDS, hire_date)
),
'SEGMENT_COLUMNS', 'STATE'
) AS data,
FLOOR(ABS(UNIFORM(0, 9, RANDOM()))) as rand -- 10 additional sub partitions
FROM employees
WHERE day = '2023-10-03 00:00:00.000'::timestamp
)
SELECT
day,
state,
profile_view,
segment_partition,
segment,
rows_processed,
debug_info
FROM raw_data, TABLE(whylogs(data) OVER (PARTITION BY day, state, rand)) -- rand is the new partition
;

You'll end up generating more models now, but you'll reduce the amount of data that each UDF has to process. Only do this if you hit an error though.

Upload Data

The second UDF (whylabs_upload()) is for uploading the profiles that are generated by the first UDF (whylogs()).

WITH 
raw_data AS ( -- Same as before
SELECT
date_trunc('DAY', hire_date) AS day,
state,
object_insert(
object_insert(
object_construct(*),
'DATASET_TIMESTAMP', date_part(EPOCH_MILLISECONDS, hire_date)
),
'SEGMENT_COLUMNS', 'STATE'
) AS data,
FROM employees
WHERE day <= '2023-10-03 00:00:00.000'::timestamp
and day >= '2023-10-01 00:00:00.000'::timestamp
),
profiled_data AS ( -- Same profiling code as before
SELECT
day,
state,
profile_view,
segment_partition,
segment,
rows_processed,
debug_info
FROM raw_data,
TABLE(whylogs(data) OVER (PARTITION BY day, state))
),
upload_data AS ( -- Format the input for the whylabs upload function
SELECT
day,
state,
object_construct(
'WHYLABS_DATASET_ID', 'model-92', -- Specify the model id we're uploading to
'PROFILE_VIEW', profile_view,
'SEGMENT_PARTITION', segment_partition,
'SEGMENT', segment
) AS upload_object
FROM profiled_data
)
SELECT
dataset_id, result, error, dataset_timestamp, segment
FROM
upload_data,
TABLE(whylabs_upload(upload_object) OVER (PARTITION BY day, state)); -- Upload with whylabs_upload udf
;

This query is essentially the profile query, wrapped in a with block, and then fed into a second query that uses whylabs_upload(). You can default to keeping the partitioning the same for the upload, but you have some flexibility here. In this example, the upload UDF is launched once for every profile result.

Don't Use Segments

If you don't want to use segments then you can just leave out the segment specific fields. You should still use reasonable partitions though since that determines how much data is passed to the UDF. Roughly uniform would be best for partition sizes.

WITH 
raw_data AS (
SELECT
date_trunc('DAY', hire_date) AS day,
state,
object_insert(
object_construct(*),
'DATASET_TIMESTAMP', date_part(EPOCH_MILLISECONDS, hire_date) -- Don't specify segments
) AS data
FROM employees
WHERE day <= '2023-10-03 00:00:00.000'::timestamp
and day > '2023-10-01 00:00:00.000'::timestamp
),
profiled_data AS ( -- Use whylogs to profile data
SELECT
day,
state,
profile_view,
segment_partition,
segment,
rows_processed,
debug_info
FROM raw_data,
TABLE(whylogs(data) OVER (PARTITION BY day))
),
upload_data AS (
SELECT
day,
state,
object_construct(
'WHYLABS_DATASET_ID', 'model-96',
'PROFILE_VIEW', profile_view
) AS upload_object
FROM profiled_data
)
SELECT
dataset_id, result, error, to_timestamp(dataset_timestamp::varchar), segment -- segment will be null in the results
FROM
upload_data,
TABLE(whylabs_upload(upload_object) OVER (PARTITION BY day, state)); -- Still use state
;

Performance

Here are some rough performance numbers for profiling and uploading data taken from an X-Small Snowflake warehouse using the dataset in this doc page (10 columns).

OperationTimeRowsDays of DataColumn count
Profiling17s1,212,4164 days10
Profiling+Uploading34s1,212,4164 days10
Profiling1m 35s10,682,36830 days10
Profiling+Uploading1m 57s10,682,36830 days10

The performance will scale with the number of columns in your dataset and the number of rows you're profiling at once. The more data you profile the more you'll have to segment the query, which results in a larger number of whylogs profiles being generated.

Prefooter Illustration Mobile
Run AI With Certainty
Get started for free
Prefooter Illustration