Skip to main content

Aggregate Awareness

Open Beta

Aggregate Awareness is now in open beta.

You can also read the announcement blog post.

Aggregate Awareness is a query performance optimization functionality by Holistics. It inspects the query sent from Holistics BI and intelligently rewrites the query to use the smaller, pre-aggregated tables instead of the raw tables.

This results in fewer rows scanned, hence improving both performance & cost while maintaining query accuracy.

The aggregated tables can be either built using Aggregate Awareness, or using external transformation tools (dbt, Airflow, Dagster).

Glossary

Before we dive deeper into aggregate awareness, let’s build the foundation and align on the same page with some basic terms

  • Pre-aggregate is the aggregation performed and prepared prior to the query time, producing a condensed version of source data with just enough information that needs to be used frequently. This helps reduce the size of the data and improves the query performance.
  • Pre-aggregated table is the output of the pre-aggregate process.
  • Aggregate Awareness is the capability of automatically identifying eligible pre-aggregates and substituting them into queries, making the queries use smaller pre-aggregated tables while still producing accurate results.
  • Granularity is a term that describes the level of detail or complexity of a set of data.
    • Coarse-grained (or low) granularity includes the data set having less detail with a grasp overview frequently used for quick analysis.
    • Fine-grained (or high) granularity, in contrast, includes the data set having more detail frequently used for in-depth analysis or at the execution level.
    • For example:
      • month has coarser-grained (lower) granularity than day
      • (country, city) has finer-grained (higher) granularity than (country)

How It Works

KNOWLEDGE CHECKPOINT

You should know the concepts of Data Model and Dataset.

To make Holistics Aggregate Awareness know which pre-aggregate can be used in the query, there are two jobs to be done:

  • Define the Pre-Aggregate in the dataset to map between the raw tables/models and the pre-aggregated table.
  • Create the pre-aggregated table by using either
    • The external tools (like dbt, Airflow, etc.) or
    • The built-in persistence option in Holistics
How it works - Pre-aggregate

When your query is intended to use the original tables, Holistics will automatically pick the right pre-aggregated tables having enough information to answer the data question.

Therefore, instead of processing millions of millions of records, it only needs fewer than that, leading to faster results.

How it works - Query with pre-aggregate

Quick Example

Background

Assume we’re working with the following raw_transactions table and a model on top of it residing in an ecommerce dataset

Quick example - Raw table
// ---------- Modeling Layer ----------

Dataset ecommerce {
//... other dataset settings
models: [transactions]
}

Model transactions {
dimension id
measure count_transactions { // a custom measure definition
label: 'Count Transactions'
type: 'number'
definition: @aql count(transactions.id)
}
//... other fields' definitions

table_name: 'raw_transactions'
}

// ---------- Database ----------

Table raw_transactions {
id integer [PK]
user_id integer [ref: > users.id]
product_id integer [ref: > products.id]
merchant_id integer [ref: > merchants.id]
status varchar
city varchar
country varchar
order_value double
created_at timestamp
}

Given that, we know the common queries with this table are:

  • Count transactions by day, week, month
  • Count transactions by status, country, city

This table is relatively big (billions of records). So, we know these analytical queries would take time.

Create Pre-aggregated Table

A good approach is to build a pre-aggregated table on the frequently used dimensions and measures to reduce the number of records

Quick example - Create pre-aggregated table
// Create pre-aggregated table agg_transactions

SELECT
created_at::date as created_at_day,
status,
country,
city,
COUNT(1) as count_transactions
FROM raw_transactions
GROUP BY 1, 2, 3, 4

// ---------- Database - Schema: `persisted` ----------

Table agg_transactions {
created_at_day date
status varchar
country varchar
city varchar
count_transactions integer
}

Map Pre-aggregated Table with Model

After that, we use pre-aggregate to define which model dimensions & measures are equivalent to the respective pre-aggregated table fields

Quick example - Pre-aggregate config
Dataset ecommerce {
//... other dataset settings
models: [transactions]

// ----- BEGIN NEW CODE - Add pre-aggregate
pre_aggregates: {
agg_transactions: PreAggregate {
// This dimension is in the pre-aggregated table `agg_transactions`
dimension created_at_day {
// This reference is in the `transactions` model
for: ref('transactions', 'created_at'),
time_granularity: "day"
}
dimension status {
for: ref('transactions', 'status')
}
dimension country {
for: ref('transactions', 'country')
}
dimension city {
for: ref('transactions', 'city')
}
measure count_transactions {
for: ref('transactions', 'id')
aggregation_type: 'count'
}
persistence: FullPersistence {
schema: 'persisted'
}
},
}
// ----- END NEW CODE
}

The Results

When querying and exploring data, analysts and explorers only work with a single transactions model. They shouldn’t need to have knowledge about the underlying aggregation optimizations.

With Aggregate Awareness, Holistics can automatically choose the right underlying table to query based on the dimensions of the query. The mechanism in Holistics’s Query Engine might be conceptually like the following:

Quick example - Query with pre-aggregate

Let us know what you think about this document :)