Snowflake
The Snowflake integration works by leveraging our Python UDFs in your environment. There isn't an automatic setup or native Snowflake application yet. After the whylogs and WhyLabs UDFs have been set up, you'll profile and upload data through SQL queries that leverage the UDFs.
Setup
Make sure snowsql
is set up with with a default schema, database, and
warehouse so you can execute sql statements directly via stdin if you're
following along with the examples. If you already have your snowsql configured
appropriately then you can skip this, it's just for demonstration.
create database whylogs_integration_db;
create warehouse whylogs_integration_warehouse;
# So you can pipe things to snowsql on the command line
[connections]
# ....
warehousename = whylogs_integration_warehouse
dbname = whylogs_integration_db
schemaname = public
Allow Conda Packages
Follow the Snowflake instructions for enabling the use of third party packages. You'll be using whylogs as a Python UDF from the private Snowflake conda repo and it's disabled by default until you complete their form.
Create Secrets (Configuration)
Secrets are used to supply the WhyLabs API key and org id. All of your credentials can be generated in your WhyLabs account settings. This is the only part that you have to manually edit since it's specific to your account. The other instructions work for any account.
⚠️ The whylogs version in Snowflake does't understand our new API key format yet. If you generate an api key that has
:org-12345
at the end of it then you'll have to remove that until we get a newer whylogs version into Snowflake.
CREATE OR REPLACE SECRET whylabs_api_key
TYPE = GENERIC_STRING
SECRET_STRING = 'xxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx';
CREATE OR REPLACE SECRET whylabs_org_id
TYPE = GENERIC_STRING
SECRET_STRING = 'org-xxxxx';
Quick Install
SQL Script
Each version of the Snowflake integration is published along side a sql setup
script that you can use to create all of the resources you need in your
Snowflake account. You can download this script and execute it with snowsql
curl https://whylabs-snowflake-udfs.s3.us-west-2.amazonaws.com/udfs/v1/latest/setup.sql | snowsql
Manual Install
The setup will use Snowflake sql files in our github repo. You can check that out now or copy and paste the commands as you go.
git clone https://github.com/whylabs/whylabs-snowflake-integration.git
cd ./whylabs-snowflake-integration/sql
Create Network Rule
You need two network rules. One network rule allows the WhyLabs Upload UDF to connect to WhyLabs. The other rule doesn't actually provide any network access but is required by the Snowflake in order to access secrets, which the whylogs UDF needs for configuration.
snowsql -f ./networking.sql
Create External Integration
The integrations are what tell Snowflake what each of the UDFs have access to in terms of network and secrets.
snowsql -f ./integrations.sql
Create Stage Integration
This will set up your account to be able to use the public bucket that WhyLabs uses to distribute our UDFs.
snowsql -f ./storage.sql
Create UDFs
This will create the UDFs in your account, referencing the python source from our readonly public bucket.
snowsql -f ./create-udf.sql
If you want to get a hold of the UDF source for any particular release then you can view the releases page on the GitHub repository.
Specific Version
If you want to use a specific version of the UDFs rather than the latest one
that we publish then you can use this format instead, where RELEASE_SHA
is the
release tag that you want to use from the releases page.
create or replace function whylogs(data object)
-- ...
handler = 'whylogs_udf.handler'
imports = ('@whylabs_udf_stage/v1/archive/RELEASE_SHA/whylogs_udf.py') -- This is different
;
create or replace function whylabs_upload(profile_view varchar, segment_partition varchar, segment varchar)
returns table (upload_result varchar)
-- ...
handler = 'whylabs_upload_udf.handler'
imports = ('@whylabs_udf_stage/v1/archive/RELEASE_SHA/whylabs_upload_udf.py') -- This is different
;
Usage
Now you'll be abe to use the whylogs
and whylabs_upload
functions in your
SQL queries. The UDFs have to be used in queries in conjunction with
partitioning. There are some gotchas with partition size unfortunately. If the
partition sizes are too large then the UDF will hit an Out of Memory error. When
this happens, you'll have to add extra columns to the partition statement
Optional - Dummy Data
If you want to recreate the tables used in the example queries for testing then follow this section. This script will create a warehouse, database, and table, populated with dummy data. Ths will take about a minute to finish.
curl https://raw.githubusercontent.com/whylabs/whylabs-snowflake-integration/master/sql/create-dummy-data.sql | snowsql
Profile Data
This example shows how to profile a table that you intend to segment on STATE
(representing th US state).
WITH
raw_data AS ( -- Query a day's worth of data
SELECT
date_trunc('DAY', hire_date) AS day,
state,
object_insert(
object_insert(
object_construct(*),
'DATASET_TIMESTAMP', date_part(EPOCH_MILLISECONDS, hire_date)
),
'SEGMENT_COLUMNS', 'STATE'
) AS data
FROM employees
WHERE day = '2023-10-03 00:00:00.000'::timestamp
)
-- Use whylogs to profile data
SELECT
day,
state,
profile_view,
segment_partition,
segment,
rows_processed,
debug_info
FROM raw_data, TABLE(whylogs(data) OVER (PARTITION BY day, state))
;
The results will look like this.
PROFILE_VIEW | SEGMENT_PARTITION | SEGMENT | ROWS_PROCESSED | DEBUG_INFO |
---|---|---|---|---|
V0hZMQCRBAoOILCT55K... | STATE | ... | 52224 | norm_time: 0.179 ... |
V0hZMQCRBAoOIOWu55K... | STATE | ... | 44032 | norm_time: 0.144 ... |
V0hZMQCRBAoOIPLG55K... | STATE | ... | 48128 | norm_time: 0.153 ... |
V0hZMQCRBAoOIMvg55K... | STATE | ... | 41984 | norm_time: 0.156 ... |
Some information on the returned columns.
- The
PROFILE_VIEW
column is a base64 encoded version of the generated whylogs profile for that partition. The - The
SEGMENT_PARTITION
column is the name of the column that was used to segment the data. In this case it'sSTATE
. - The
SEGMENT
column is the value of the segment. It's base64 encoded, but it specifies the specific segment for that profile. - The
ROWS_PROCESSED
column is the number of rows that were processed for that profile. - The
DEBUG_INFO
column is a string that contains some debug information about the profiling process. It's useful for troubleshooting.
The input to whylogs()
is an object that contains the data that you want to
profile. The object_insert
function is used to add a DATASET_TIMESTAMP
field
to the object. This is used by whylogs to determine the timestamp of the data.
Having a DATASET_TIMESTAMP
is required, and the format should be a
TIMESTAMP_NTZ
(whylogs will assume UTC internally).
If you're going to use segments (like this example does) then you'll have to
insert another column into the query that specifies the segment. In this case,
the STATE
column is used as the segment. The SEGMENT_COLUMNS
field is used
to tell whylogs which columns are used for segmentation. This is an all-caps CSV.
Partition Optimization
The partitioning should line up with the aggregation mode (day, hour, etc.), and the segments you're using. If you start hitting Out of Memory errors and you have nothing left to partition on then you can add arbitrary numbers that serve as uniform partitions, like so.
WITH
raw_data AS (
SELECT
date_trunc('DAY', hire_date) AS day,
state,
object_insert(
object_insert(
object_construct(*),
'DATASET_TIMESTAMP', date_part(EPOCH_MILLISECONDS, hire_date)
),
'SEGMENT_COLUMNS', 'STATE'
) AS data,
FLOOR(ABS(UNIFORM(0, 9, RANDOM()))) as rand -- 10 additional sub partitions
FROM employees
WHERE day = '2023-10-03 00:00:00.000'::timestamp
)
SELECT
day,
state,
profile_view,
segment_partition,
segment,
rows_processed,
debug_info
FROM raw_data, TABLE(whylogs(data) OVER (PARTITION BY day, state, rand)) -- rand is the new partition
;
You'll end up generating more models now, but you'll reduce the amount of data that each UDF has to process. Only do this if you hit an error though.
Upload Data
The second UDF (whylabs_upload()
) is for uploading the profiles that are generated by the first UDF (whylogs()
).
WITH
raw_data AS ( -- Same as before
SELECT
date_trunc('DAY', hire_date) AS day,
state,
object_insert(
object_insert(
object_construct(*),
'DATASET_TIMESTAMP', date_part(EPOCH_MILLISECONDS, hire_date)
),
'SEGMENT_COLUMNS', 'STATE'
) AS data,
FROM employees
WHERE day <= '2023-10-03 00:00:00.000'::timestamp
and day >= '2023-10-01 00:00:00.000'::timestamp
),
profiled_data AS ( -- Same profiling code as before
SELECT
day,
state,
profile_view,
segment_partition,
segment,
rows_processed,
debug_info
FROM raw_data,
TABLE(whylogs(data) OVER (PARTITION BY day, state))
),
upload_data AS ( -- Format the input for the whylabs upload function
SELECT
day,
state,
object_construct(
'WHYLABS_DATASET_ID', 'model-92', -- Specify the model id we're uploading to
'PROFILE_VIEW', profile_view,
'SEGMENT_PARTITION', segment_partition,
'SEGMENT', segment
) AS upload_object
FROM profiled_data
)
SELECT
dataset_id, result, error, dataset_timestamp, segment
FROM
upload_data,
TABLE(whylabs_upload(upload_object) OVER (PARTITION BY day, state)); -- Upload with whylabs_upload udf
;
This query is essentially the profile query, wrapped in a with
block, and then
fed into a second query that uses whylabs_upload()
. You can default to keeping
the partitioning the same for the upload, but you have some flexibility here. In
this example, the upload UDF is launched once for every profile result.
Don't Use Segments
If you don't want to use segments then you can just leave out the segment specific fields. You should still use reasonable partitions though since that determines how much data is passed to the UDF. Roughly uniform would be best for partition sizes.
WITH
raw_data AS (
SELECT
date_trunc('DAY', hire_date) AS day,
state,
object_insert(
object_construct(*),
'DATASET_TIMESTAMP', date_part(EPOCH_MILLISECONDS, hire_date) -- Don't specify segments
) AS data
FROM employees
WHERE day <= '2023-10-03 00:00:00.000'::timestamp
and day > '2023-10-01 00:00:00.000'::timestamp
),
profiled_data AS ( -- Use whylogs to profile data
SELECT
day,
state,
profile_view,
segment_partition,
segment,
rows_processed,
debug_info
FROM raw_data,
TABLE(whylogs(data) OVER (PARTITION BY day))
),
upload_data AS (
SELECT
day,
state,
object_construct(
'WHYLABS_DATASET_ID', 'model-96',
'PROFILE_VIEW', profile_view
) AS upload_object
FROM profiled_data
)
SELECT
dataset_id, result, error, to_timestamp(dataset_timestamp::varchar), segment -- segment will be null in the results
FROM
upload_data,
TABLE(whylabs_upload(upload_object) OVER (PARTITION BY day, state)); -- Still use state
;
Performance
Here are some rough performance numbers for profiling and uploading data taken from an X-Small Snowflake warehouse using the dataset in this doc page (10 columns).
Operation | Time | Rows | Days of Data | Column count |
---|---|---|---|---|
Profiling | 17s | 1,212,416 | 4 days | 10 |
Profiling+Uploading | 34s | 1,212,416 | 4 days | 10 |
Profiling | 1m 35s | 10,682,368 | 30 days | 10 |
Profiling+Uploading | 1m 57s | 10,682,368 | 30 days | 10 |
The performance will scale with the number of columns in your dataset and the number of rows you're profiling at once. The more data you profile the more you'll have to segment the query, which results in a larger number of whylogs profiles being generated.