Java
The whylogs library includes both a Java and Python version. This page is specific to the java version which includes support for Apache Spark integration.
#
UsageTo get started, add WhyLogs to your Maven POM:
For the full Java API signature, see the Java Documentation.
Spark package (Scala 2.11 or 2.12 only):
For the full Scala API signature, see the Scala API Documentation.
#
Examples RepoFor examples in different languages, please checkout our whylogs-examples repository.
#
Simple trackingThe following code is a simple tracking example that does not output data to disk:
#
Serialization and deserializationWhyLogs uses Protobuf as the backing storage format. To write the data to disk, use the standard Protobuf serialization API as follows.
#
Merging dataset profilesIn enterprise systems, data is often partitioned across multiple machines for distributed processing. Online systems may also process data on multiple machines, requiring engineers to run ad-hoc analysis using an ETL-based system to build complex metrics, such as counting unique visitors to a website.
WhyLogs resolves this by allowing users to merge sketches from different machines. To merge two WhyLogs DatasetProfile
files, those files must:
- Have the same name
- Have the same session ID
- Have the same data timestamp
- Have the same tags The following is an example of the code for merging files that meet these requirements.
#
Apache Spark integrationOur integration is compatible with Apache Spark 2.x (3.0 support is to come). This example shows how we use WhyLogs to profile a dataset based on time and categorical information. The data is from the public dataset for Fire Department Calls & Incident.
For further analysis, dataframes can be stored in a Parquet file, or collected to the driver if the number of entries is small enough.
#
Building and Testing- To build, run ./gradlew build
- To test, run ./gradlew test