The open source data profiling library, whylogs, can be used to profile data within a customer's environment. WhyLabs can be used %100 offline as nothing is uploaded to the WhyLabs platform unless an API key is utilized. Once an API key is plugged in, only statistical profiles are ever uploaded to WhyLabs. There is no mechanism to send raw data to WhyLabs.
Most customers integrating with WhyLabs will opt to upload proflies via the mechanism built into the open source Whylogs library. Simply obtain an API key, plug it in, and you're ready to rock.
Uploading a profile for a data timestamp in the current week will typically become visible in the UI in under a minute (often just a few seconds).
Some enterprise customers operate with very strict network egress rules blocking access to cloud based REST APIs. For such scenarios we offer the ability to pull profiles dumped to a blob store such as AWS S3. Contact us for more information!
Cross account blob store integrations currently have a one day turnaround for new profiles to be processed and visible in the UI.
Profiling data older than seven days is considered backfill. Backfills currently have a one day processing turnaround, so they're ready to view the following day.
Depending on how you use whylogs in your environment you might have multiple machines in a distributed context generating profiles for the same dataset+time+segment. Whylogs profiles are easily mergable making it easy to reduce data egress volume prior to uploading. If profiles are not merged prior to upload, that's okay too! The whyLabs platform will merge them automatically.
For example, suppose a kafka topic is being profiled with whylogs using multiple consumers instances emitting a profile once an hour. Profiles will automatically merge and reflect changes throughout the day.