Canal Engine - High-performance connector and caching engine

Introduction

Canal is our latest Connector and Caching Engine that provides advanced performance features to your Holistics queries.

Its responsibility is to connect to your Data Warehouses, trigger queries, and then efficiently retrieve the query results from the Data Warehouses into Holistics Cache (aka. "Holistics Data Lake"), making the data readily available for further processing (e.g. rendering on browsers, exporting, etc.).

How to enable Canal

Open Beta

Holistics Canal is now in Open Beta!

If you are interested, please fill in this form and we will try to notify you with our new updates! P.S. Make sure to mention the Database type that you would like Canal to support next. 😉

See Enable Canal.

Technologies

Data Streaming

Canal employs the data “streaming” technique that eliminates lots of overheads and bottlenecks when transferring data. To be specific, it transfers the result data as small chunks from your Data Warehouse straight into Holistics Data Lake.

With data streaming, there is minimal-to-none overhead during the transfer, thus minimizing the latency between the time when the query is finished on the Data Warehouse and the time the result is visible to end-users.

Connection Pooling

Upon building Holistics Canal, we have taken the opportunity to implement Connection Pooling as well!

When a "canal" (or connection) has been constructed between Holistics and your Data Warehouse, Holistics will try to re-use that same connection for multiple queries.
This effectively cuts down the connection establishment costs (e.g. DNS lookup, Authentication, SSL, etc.) when running multiple queries on the same Data Warehouse, which typically reduces 100-1000ms of latency for every query.

Golang

The whole Holistics Canal system (including Data Lake) runs on Golang. This gives many benefits to the system, including:

Faster execution

Old Holistics Connector runs on Ruby, which is an interpreted programming language. Compilations only happen during runtime (i.e. Just-in-time compilation). On the other hand, Golang compiles the code ahead of runtime, allowing the runtime to execute faster right from the get-go.

Golang also enables us to use more efficient data structures and make lower-level optimizations in our codes.

Better concurrency

Golang can spawn multiple Goroutines working in parallel.

In our illustration, Golang allows us to operate on multiple “currents” at the same time, right in the middle of the flow/streaming:

On the other hand, Golang concurrency also allows sharing memory between parallel/concurrent executions, which has also facilitated the Connection Pooling feature mentioned above.

Access to better data processing tools and technologies

Golang has first-class support from major Databases/Data Warehouses. Thus, the Golang database connector libraries are often readily available, more performant, have more features, and have fewer bugs.
Apache Arrow and Apache Parquet libraries are also very well-maintained in Golang, while they are still pretty primitive in Ruby at this moment.

Apache Arrow and Apache Parquet

Holistics Canal uses Apache Arrow as the data format for transferring data into the Data Lake.

It avoids the cost of “unloading” and “loading” data (i.e. serialization and deserialization) into and from the Data Lake.
- For Data Warehouses (e.g. BigQuery and Snowflake) that use columnar storage themselves and provide Apache Arrow as query output format, this can also avoid the cost of “unloading” data from the Data Warehouses.
It can be seamlessly stored and processed as a columnar data storage, enabling fast data analytics and retrieval.
- Currently, we store the Arrow data as Apache Parquet files, which provide storage compression and portability while still being fast enough when queried by Duckdb.

DuckDB

We use Duckdb as our Cache Query Engine:

Features: Duckdb provides lots of useful querying and analytics features.
Speed:
- Duckdb vectorized query execution model allows high-performance querying on cached data.
- Duckdb can output Arrow data, which again is very efficient when transferring to post-processing services.

Future

As the tools and technologies around Apache Arrow and Duckdb are evolving every day, we can expect to incorporate more features into Holistics and improve Holistics performance even further in the future!

References

For a more detailed comparison between Holistics Canal and our previous engine, please check out this Community Post.
To learn more about how Holistics Caching works, please refer to Caching Mechanism.

Introduction​

How to enable Canal​

Technologies​

Data Streaming​

Connection Pooling​

Golang​

Faster execution​

Better concurrency​

Access to better data processing tools and technologies​

Apache Arrow and Apache Parquet​

DuckDB​

Future​

References​